k-prototypes for customer segmentation

You work for a medical device manufacturer in Switzerland. Your company manufactures orthopedic devices and sells them worldwide. The company sells directly to individual doctors who use them on rehabilitation and physical therapy patients.

Historically, the sales and customer support departments have grouped doctors by geography. However, the region is not a good predictor of the number of purchases a doctor will make or their support needs.

Your team wants to use a data-centric approach to segmenting doctors to improve marketing, customer service, and product planning.

1 - Imports

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
!pip install venn &> /dev/null
from venn import venn
!pip install kmodes &> /dev/null
from kmodes.kprototypes import KPrototypes
import plotly.express as px
from sklearn import preprocessing

doctors = pd.read_csv('data/doctors.csv')
orders = pd.read_csv('data/orders.csv')
complaints = pd.read_csv('data/complaints.csv')
instructions = pd.read_csv('data/instructions.csv')

1.1 Exploratory Data Analysis

Kinds of table relationships.

First, let's analyze the relationship between the tables and the records of the doctors in each one.
So to deal with the small intersection between the tables, I'll do an outer join and fill in the missing values with zeros.
304 doctors, almost 50% of the records, do not present data on orders, instructions or complaints, which will make them part of a large cluster of doctors with little data beyond the record.
210 doctors, more than 30% of the records, have complaints, but they are not classified in the record of the table doctors. Which will also integrate much of a cluster label.

ndoc_doc = set(doctors.DoctorID)
ndoc_inst = set(instructions.DoctorID)
ndoc_ord = set(orders.DoctorID)
ndoc_comp = set(complaints.DoctorID)
doc_data = {
    "doctors": ndoc_doc,
    "instructions": ndoc_inst,
    "orders": ndoc_ord,
    "complaints": ndoc_comp
}

venn(doc_data);
print(f"Total of doctors on register :{len(ndoc_doc.union(ndoc_inst,ndoc_ord,ndoc_comp))}")

Doctors

Doctors contains information on doctors. Each row represents one doctor.

"DoctorID" - is a unique identifier for each doctor.
"Region" - the current geographical region of the doctor.
"Category" - the type of doctor, either 'Specialist' or 'General Practitioner.'
"Rank" - is an internal ranking system. It is an ordered variable: The highest level is Ambassadors, followed by Titanium Plus, Titanium, Platinum Plus, Platinum, Gold Plus, Gold, Silver Plus, and the lowest level is Silver.
"Incidence rate" and "R rate" - relate to the amount of re-work each doctor generates.
"Satisfaction" - measures doctors' satisfaction with the company.
"Experience" - relates to the doctor's experience with the company.
"Purchases" - purchases over the last year.

# Satisfaction must be float, so we will use nan so that empty values do not interfere with the average.
def isnumb(x):
    '''
    Function  is used to accept a string and convert it into a float. If the input string does not contain a numeral value or If the first character of thestring is not a number 	then it returns NaN.
    '''
    return re.sub('[-.]+','',x).isnumeric()

# Satisfaction to float 
doctors['Satisfaction'] = doctors['Satisfaction'].apply(lambda x : float(x) if isnumb(x) else np.nan)
sns.pairplot(doctors)
plt.show();

###
cor_doc = doctors.corr()
mask = np.triu(np.ones_like(cor_doc, dtype=bool))
sns.heatmap(cor_doc,
            annot = True,
            fmt='.2f',
           mask = mask);

###
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))

doctors.Category.value_counts().plot(kind='bar',title = "Count by Category", ax=axes[0])
doctors.Rank.value_counts().plot(kind='bar',title = "Count Doctors by Rank", ax=axes[1])
fig.show()

💪 2. How many doctors are there in each region? What is the average number of purchases per region?

Counting the number of doctors grouped by region and checking the average rating and purchases per region

Region '1 13' has the highest number of doctors.
Region '1 19 20' has only 1 doctor, but this one has a high level of purchases.

doc_by_reg = doctors.groupby(['Region']).agg({'DoctorID':'count',
                                              'Purchases': 'mean',
                                              'R rate':'mean',
                                              'Satisfaction':'mean'
                                             }).sort_values(by='DoctorID',ascending=False)
doc_by_reg.rename(columns = {'DoctorID':'Doctor'}, inplace = True)
display(doc_by_reg.head())

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20,9))
sns.set_theme(style="whitegrid")
plt.xticks(rotation=90)
for i,k in enumerate(doc_by_reg.columns):
    axes[i//2,i%2].tick_params(labelrotation=90)
    sns.barplot(x = doc_by_reg.index, y = k,
                data = doc_by_reg, color="c",
                ax = axes[i//2,i%2]).set(title='Distribution of '+k+' for region')
fig.tight_layout()
plt.show()

print('Show doctor order by Purchase')
display(doctors.sort_values(by ='Purchases', ascending=False).head(5))

Conditions H, G and B have the highest number of orders, accounting for almost 60% of orders.

Therefore, these conditions (settings) are of great importance in the production priority.

Orders contains details on orders. Each row represents one order; a doctor can place multiple orders.

"DoctorID" - doctor id (matches the other tables).
"OrderID" - order identifier.
"OrderNum" - order number.
"Conditions A through J" - map the different settings of the devices in each order. Each order goes to an individual patient.

dfcond = orders.sum()[3:].sort_values(ascending  = False)
plt.figure(figsize=[10,5])
dfcond.plot(kind='bar', title = 'Oders by condition')
plt.show()

dfcond = (dfcond/dfcond.sum()).to_frame()
dfcond['cumsum'] = dfcond.cumsum()
dfcond.rename(columns={0:'Percentual'}, inplace = True)
dfcond

💪 3. Can you find a relationship between purchases and complaints?

There is a weak relationship between the number of complaints and the number of purchases, in addition, the record of complaints has few records in common with that of doctors.
There is a strong relationship between the number of complaints and how correct they are. One reason is that most complaints are correct.As shown below.

Analysis of categories as the largest number of purchases.

Despite being the lowest level, silver is the fifth best level in terms of purchase averages.
The other levels maintain the relationship of the internal ranking system.
Specialists are the ones who make the most purchases on average and total.

‌
‌
‌