Rafael Monteiro/

k-prototypes for customer segmentation


Can you find a better way to segment your customers?

📖 Background

You work for a medical device manufacturer in Switzerland. Your company manufactures orthopedic devices and sells them worldwide. The company sells directly to individual doctors who use them on rehabilitation and physical therapy patients.

Historically, the sales and customer support departments have grouped doctors by geography. However, the region is not a good predictor of the number of purchases a doctor will make or their support needs.

Your team wants to use a data-centric approach to segmenting doctors to improve marketing, customer service, and product planning.

1 - Imports

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
!pip install venn &> /dev/null
from venn import venn
!pip install kmodes &> /dev/null
from kmodes.kprototypes import KPrototypes
import as px
from sklearn import preprocessing

doctors = pd.read_csv('data/doctors.csv')
orders = pd.read_csv('data/orders.csv')
complaints = pd.read_csv('data/complaints.csv')
instructions = pd.read_csv('data/instructions.csv')

1.1 Exploratory Data Analysis

Kinds of table relationships.

  • First, let's analyze the relationship between the tables and the records of the doctors in each one.
    So to deal with the small intersection between the tables, I'll do an outer join and fill in the missing values with zeros.

  • 304 doctors, almost 50% of the records, do not present data on orders, instructions or complaints, which will make them part of a large cluster of doctors with little data beyond the record.

  • 210 doctors, more than 30% of the records, have complaints, but they are not classified in the record of the table doctors. Which will also integrate much of a cluster label.

ndoc_doc = set(doctors.DoctorID)
ndoc_inst = set(instructions.DoctorID)
ndoc_ord = set(orders.DoctorID)
ndoc_comp = set(complaints.DoctorID)
doc_data = {
    "doctors": ndoc_doc,
    "instructions": ndoc_inst,
    "orders": ndoc_ord,
    "complaints": ndoc_comp

print(f"Total of doctors on register :{len(ndoc_doc.union(ndoc_inst,ndoc_ord,ndoc_comp))}")


Doctors contains information on doctors. Each row represents one doctor.
  • "DoctorID" - is a unique identifier for each doctor.
  • "Region" - the current geographical region of the doctor.
  • "Category" - the type of doctor, either 'Specialist' or 'General Practitioner.'
  • "Rank" - is an internal ranking system. It is an ordered variable: The highest level is Ambassadors, followed by Titanium Plus, Titanium, Platinum Plus, Platinum, Gold Plus, Gold, Silver Plus, and the lowest level is Silver.
  • "Incidence rate" and "R rate" - relate to the amount of re-work each doctor generates.
  • "Satisfaction" - measures doctors' satisfaction with the company.
  • "Experience" - relates to the doctor's experience with the company.
  • "Purchases" - purchases over the last year.
# Satisfaction must be float, so we will use nan so that empty values do not interfere with the average.
def isnumb(x):
    Function  is used to accept a string and convert it into a float. If the input string does not contain a numeral value or If the first character of thestring is not a number 	then it returns NaN.
    return re.sub('[-.]+','',x).isnumeric()

# Satisfaction to float 
doctors['Satisfaction'] = doctors['Satisfaction'].apply(lambda x : float(x) if isnumb(x) else np.nan)

cor_doc = doctors.corr()
mask = np.triu(np.ones_like(cor_doc, dtype=bool))
            annot = True,
           mask = mask);

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))

doctors.Category.value_counts().plot(kind='bar',title = "Count by Category", ax=axes[0])
doctors.Rank.value_counts().plot(kind='bar',title = "Count Doctors by Rank", ax=axes[1])

💪 2. How many doctors are there in each region? What is the average number of purchases per region?

Counting the number of doctors grouped by region and checking the average rating and purchases per region

  • Region '1 13' has the highest number of doctors.
  • Region '1 19 20' has only 1 doctor, but this one has a high level of purchases.
doc_by_reg = doctors.groupby(['Region']).agg({'DoctorID':'count',
                                              'Purchases': 'mean',
                                              'R rate':'mean',
doc_by_reg.rename(columns = {'DoctorID':'Doctor'}, inplace = True)

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20,9))
for i,k in enumerate(doc_by_reg.columns):
    sns.barplot(x = doc_by_reg.index, y = k,
                data = doc_by_reg, color="c",
                ax = axes[i//2,i%2]).set(title='Distribution of '+k+' for region')

print('Show doctor order by Purchase')
display(doctors.sort_values(by ='Purchases', ascending=False).head(5))

Conditions H, G and B have the highest number of orders, accounting for almost 60% of orders.

  • Therefore, these conditions (settings) are of great importance in the production priority.
Orders contains details on orders. Each row represents one order; a doctor can place multiple orders.
  • "DoctorID" - doctor id (matches the other tables).
  • "OrderID" - order identifier.
  • "OrderNum" - order number.
  • "Conditions A through J" - map the different settings of the devices in each order. Each order goes to an individual patient.
dfcond = orders.sum()[3:].sort_values(ascending  = False)
dfcond.plot(kind='bar', title = 'Oders by condition')

dfcond = (dfcond/dfcond.sum()).to_frame()
dfcond['cumsum'] = dfcond.cumsum()
dfcond.rename(columns={0:'Percentual'}, inplace = True)

💪 3. Can you find a relationship between purchases and complaints?

  • There is a weak relationship between the number of complaints and the number of purchases, in addition, the record of complaints has few records in common with that of doctors.
    There is a strong relationship between the number of complaints and how correct they are. One reason is that most complaints are correct.As shown below.

Analysis of categories as the largest number of purchases.

  • Despite being the lowest level, silver is the fifth best level in terms of purchase averages.
  • The other levels maintain the relationship of the internal ranking system.
  • Specialists are the ones who make the most purchases on average and total.
Complaints collects information on doctor complaints.
  • "DoctorID" - doctor id (matches the other tables).
  • "Complaint Type" - the company's classification of the complaints.
  • "Qty" - number of complaints per complaint type per doctor.
# Set index to DoctorID to join
doctors2 = doctors.set_index('DoctorID')

# "Qty" - number of complaints per complaint *type* per doctor. 'Complaint Type'
complaints['Complaint Type'].value_counts().plot(kind='bar')
complaints2 = complaints.pivot('DoctorID','Complaint Type','Qty').fillna(0)
complaints2 = complaints2[complaints2.columns[1:]]

# Grouping the complaints of each type by doctor
select_cols = ['Total_complaint', 'Correct']

complaints3 = pd.concat([doctors2,complaints2], axis =1) # Join doctors and complaints

complaints_purch = complaints3[select_cols+['Purchases']]

# Pairplot for quantitative variables

# Heatmap of correlations
cor_compl = complaints_purch.corr()

mask = np.triu(np.ones_like(cor_compl, dtype=bool))
            annot = True,
           mask = mask);

fig = px.scatter(complaints3.dropna(), y="Total_complaint", x="Purchases", color="Rank")
fig.update_layout(title_text="Relationship between the number of Total_complaint and purchases by Rank")

💪 4. Define new doctor segments that help the company improve marketing efforts and customer service.

Checking the relationship between the number of orders and purchases by doctor.

  • Specialist is the category that has the highest number of orders and purchases, both on average and in total.
Instructions has information on whether the doctor includes special instructions on their orders.
  • "DoctorID" - doctor id (matches the other tables).
  • "Instructions" - 'Yes' when the doctor includes special instructions, 'No' when they do not.
# Change yes and no values to 1 and 0, and set DoctorId to index
instructions2 = instructions.copy()
instructions2.Instructions ={'Yes':1,'No':0})

# Group orders by doctor, I will not consider the conditions, to better distribute the weights of the features in the clustering.
ord_by_doct = orders.groupby('DoctorID').agg({'OrderID':'count'})

# Join orders, doctors, complaints, instructions
# I will not consider type of complaints, to better distribute the weights of the features in the clustering
dfodc = ord_by_doct.join([complaints3.iloc[:,:9].dropna(how = 'all'),instructions2], how='outer') 

dctagg = {'OrderID':['sum','mean'],'Incidence rate':'mean',
          'R rate':'mean','Satisfaction':'mean','Experience':'mean',

print('Specialists are the ones who make the most orders on total and average too.')
display(dfodc.groupby('Category').agg(dctagg).sort_values(by=('Purchases','mean'), ascending=False))

print('Identifying the categories that have the highest volume of purchases.')
display(dfodc.groupby('Rank').agg(dctagg).sort_values(by=[('Purchases','sum'),('Purchases','mean')], ascending=False))

fig = px.scatter(dfodc.fillna(0), y="OrderID", x="Purchases", color="Rank")
fig.update_layout(title_text="Relationship between the number of orders and purchases by doctor")


Adjusting the scale of variables, choosing the appropriate k-prototypes method for clustering with mixed variables (categorical and numerical), as well as defining the appropriate parameters for the efficient operation of the method used.

  • First finding the clusters
  • After checking the main characteristics of each cluster.
# As we have many categorical values, I will use k prototypes.
# K-Means is one of the conventional clustering methods commonly used in clustering techniques. However, its method is not good and suitable for data that contains categorical variables. 

# Categorical columns
catcols = ['Region','Category', 'Rank']
iloccatcols = [dfodc.columns.get_loc(k) for k in catcols]

# Numeric columns
numcols = [k for k in dfodc.columns if k not in catcols]
dfodc.fillna(0, inplace=True)

for k in catcols:
    dfodc.loc[:,k] = dfodc.loc[:,k].astype(str)

# To eliminate possible problems of scale difference between the variables, we will apply the standardization.
customers_norm = dfodc.copy()
scaler = preprocessing.StandardScaler()
# Apply standardization to numeric columns
customers_norm[numcols] = scaler.fit_transform(customers_norm[numcols])
costs = []
n_clust = []

# Plot for Elbow Method for best choice of number of clusters.

# Random state 0 for reproducing the problem the same every time it is run
# Selecting ‘Huang’ as the init, the model will select the first k distinct objects from the data set as initial k-modes
for i in range(3, 20):
        kproto = KPrototypes(n_clusters = i, init='Huang',random_state = 0)
        clusters = kproto.fit_predict(customers_norm, categorical = iloccatcols )
plt.title('Elbow Method')
plt.xlabel('Number of clusters')

def optimal_number_of_clusters(sse,n_clust):
    x1, y1 = n_clust[0], sse[0]
    x2, y2 = n_clust[-1], sse[-1]
    distances = []
    for x0,y0 in zip(n_clust,sse):
        numerator = abs((y2-y1)*x0 - (x2-x1)*y0 + x2*y1 - y2*x1)
        denominator = np.sqrt((y2 - y1)**2 + (x2 - x1)**2)
    return distances

distances = optimal_number_of_clusters(costs,n_clust)
n_max_dist = n_clust[distances.index(max(distances))]

plt.plot(n_clust, distances)
plt.plot([n_max_dist,n_max_dist],[0,max(distances)], label = 'Optimal number of clusters')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')

print("The Optimal number of clusters is :",n_max_dist)

kproto = KPrototypes( n_clusters= n_max_dist , init='Huang',random_state = 0)
clusters = kproto.fit_predict(customers_norm, categorical = iloccatcols )
#join data with labels 
labeledCustomers = dfodc.copy()
labeledCustomers['labels'] = clusters

plt.title("Size of clusters")

# Groupby labels for mean and mode
mode = lambda x:x.value_counts().index[0]
aggfunc = list(map(lambda x: 'mean' if x not in catcols else mode,labeledCustomers.columns[:-1]))
dctag = {k:j for k,j in zip(labeledCustomers.columns,aggfunc)}
display(labeledCustomers.groupby('labels').agg(dctag).sort_values(by = 'Purchases', ascending = False))

# Groupby labels for count and total of Orders, Purchases and complaints
display(labeledCustomers.groupby('labels').agg({'Rank':'count','OrderID':'sum','Purchases':'sum','Total_complaint':'sum'}).sort_values(by = 'Purchases', ascending = False))

sns.catplot(data=labeledCustomers, x="OrderID", y="Purchases", hue="labels",height = 4 , aspect = 2.5)

for i,j in enumerate(catcols+['OrderID','Purchases']):
	sns.catplot(x = j, y='labels', hue='labels', data=labeledCustomers,height = 4 , aspect = 2.5)
	if j in ['Region','Purchases']: