You have worked hard to build a loyal customer base or a loyal employee base. You do not want the "A players", the loyal spokespeople, to churn away from your good business. After all, customers and employees are your lifeblood. How do you preempt possible churn of your employees and customers ahead of time and put effective retention strategies in place? Past attritions reveal a pattern that can be used to identify pre-eminent causes for churn that can translate into specific retention strategies. Underneath, we show an example of employee attrition on simulated data to understand leading causes of churn.
We will use data/analysis from employee data to demonstrate attrition models. Please remember this is a statistical model and does not consider the events or the temporal sequence that led to attrition. Most analyses use both quantitative methods as well as survival models to predict churn.
# Download the data
uri = 'https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx'
# Ingest the data
attrition_pd = pd.read_excel(uri)
pd_display(attrition_pd, "Simulated employee churn data")
Let us explore the data characteristics (like datatypes and statistics). We notice there are 35 recorded attributes and 1470 observations of data.
# Print attributes of the data and their legend
pd.DataFrame(
zip(attrition_pd.columns, attrition_pd.dtypes),
columns=['Attribute', 'Data Type']).set_index('Attribute')
The data contains
# Here are the data definitions from the Excel spreadsheet
pd.read_excel(uri, sheetname=1, names=['Attribute', 'Legend']).fillna('').T
Visualize the features to see if they make sense. You want good arity, correlation of the data
%matplotlib inline
import matplotlib
#matplotlib.style.use('ggplot')
import math
from ggplot import *
fig = plt.figure(figsize=(18, 72))
cols = 3
label_col = 'Attrition'
# Draw correlation plots per column
rows = math.ceil(float(attrition_pd.shape[1]) / cols)
for i, column in enumerate(attrition_pd.columns):
if column.lower() == label_col.lower():
continue
ax = fig.add_subplot(rows, cols, i + 1)
ax.set_title(column)
if attrition_pd.dtypes[column] == np.object:
cts = attrition_pd[[label_col, column]]
cts = cts.groupby([label_col, column]).size()
cts.unstack().T.plot(kind='bar', ax=ax, stacked=True, alpha=0.5)
else:
cts = attrition_pd[[label_col, column]]
(xmin, xmax) = (min(cts[column].tolist()), max(cts[column].tolist()))
cts.groupby(label_col)[column].plot(
bins=16,
kind='hist',
stacked=True,
alpha=0.5,
legend=True,
ax=ax,
range=[xmin, xmax])
# Display plots
plt.subplots_adjust(hspace=0.7, wspace=0.2)
By looking at the plots, and tallying with the hypothesis, can we validate we are along the right direction?
Here are notable visual observations we find (we will confirm statistically later) --
# Aligned with our few observations, let us take some actions and weed out useless attributes
df = attrition_pd.copy()
# Remove ID attributes and the attributes that do not have a variance
# See observations above for justification
df.drop('EmployeeCount', axis=1, inplace=True)
df.drop('EmployeeNumber', axis=1, inplace=True)
df.drop('Over18', axis=1, inplace=True)
df.drop('StandardHours', axis=1, inplace=True)
Our exploration has given us a chance to gauge a few attributes and their characteristics. We have imputed and filtered values as needed.
Of the many attributes that seem to impact the attrition outcome, we do not know which is most discerning predictor. To discern, let us build a model...
We will be using a simple decision tree to investigate the predictor strength. We choose decision tree because we want explainability and palatibility of our model. Since decision tree works alright with categorical, ordinal, and continuous data, we are good so far with the transformation of the data. But sklearn however does require us to map categorical variables into one-hot encoded floats aka Trump -> [1, 0] and Clinton -> [0, 1] where the first-bit indicates Trump and second-bit indicates Clinton. Numerics and ordinals are fine.
# Let us one-hot encode the data
from sklearn.preprocessing import MinMaxScaler
# Split the testset into a 90-10 split for training and testing
# Also remove all binary attributes and scale ordinal variables into 0-1 range
one_hot_encoded = pd.get_dummies(df).drop(
label_col + '_No',
axis=1).rename(columns={label_col + '_Yes': label_col}).drop(
'OverTime_No', axis=1).rename(columns={'OverTime_Yes': 'OverTime'})
# Display the data set
pd_display(one_hot_encoded, "The dataset encoded")
Let us validate if our 21 observations (hypothesis) stand any ground. Turns out some of them do and some do not. Underneath a simple correlation plot of the attributes with respect to attrition. As we observe --
# Plot a correlation plot
one_hot_encoded.corr().ix[label_col].drop(label_col).sort_values().plot(kind='barh', figsize=(6, 8))
# Define shortcuts for separating X and Y from pandas dataframe
X_set = lambda df: df.drop([label_col], axis=1)
Y_set = lambda df: df[label_col]
# Create a stratified KFold so you may repeat training instances randomized
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=10)
# Create empty test and training catalog
train_idx = []
test_idx = []
# Across each stratified fold, keep track of training records versus test records
for train, test in skf.split(X_set(one_hot_encoded), Y_set(one_hot_encoded)):
train_idx.extend(train)
test_idx.extend(test)
#Let us separate the training X and test X
X_train, y_train, X_test, y_test = (
X_set(one_hot_encoded).iloc[train_idx], Y_set(one_hot_encoded).iloc[train_idx],
X_set(one_hot_encoded).iloc[test_idx], Y_set(one_hot_encoded).iloc[test_idx])
# Let us preview the data
pd_display(X_train,
"Training data normalized and ommitted of the prediction label")
Using the gradient boosted tree (the ensemble method) is...
#Let us use Gradient Boosted Tree Classifier (prediction of a yes/no attrition) model -- a decision tree implementation
from sklearn import ensemble
from sklearn import linear_model
from sklearn.preprocessing import *
# Fit classifier params
params = {'n_estimators': 400, 'max_depth': 3 }
# Create classifier
clf = ensemble.GradientBoostingClassifier(learning_rate=0.01, **params)
# Train
clf.fit(X_train, y_train)
# Predict on testset and verify accuracy
acc = clf.score(X_test, y_test)
# Print
HTML("<h3 align='center'>Accuracy with Ensemble Methods is <u>{:.2f}%</u></h3>".format(acc * 100))
If we used a logistic regression model instead...
from sklearn import linear_model, datasets, metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
# Training Logistic regression
logistic_classifier = linear_model.LogisticRegression(C=100.0)
# Train and predict the outcomes for test set
acc1 = logistic_classifier.fit(X_train, y_train).score(X_test, y_test)
# Print
HTML("<h3 align='center'>Accuracy with Logistic Regression is <u>{:.2f}%</u></h3>".format(acc1 * 100))
What features seem to be explaining the attrition most?
%matplotlib inline
# Plot feature importance normalized to a 100% scale
importances = pd.DataFrame(
zip(X_train.columns.values, 100 * clf.feature_importances_ /
clf.feature_importances_.max()),
columns=['Feature', 'Importance %']).sort_values(
['Importance %'], ascending=[False])
# Chart most important features predicting the attrition outcome
sns.set_palette("Blues")
importances.head(20).plot(kind='barh', x='Feature', y='Importance %');
#Let us see how we predicted. What false positives and true negatives did we yield...
from sklearn.metrics import confusion_matrix
# Predict the attrition outcomes on the test set
y_pred = clf.predict(X_test)
# Draw the confusion matrix
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues, labels=None):
# Show the confusion matrix as an image
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
# Put labels on axis
tick_marks = np.arange(len(labels))
plt.xticks(tick_marks, labels, rotation=45)
plt.yticks(tick_marks, labels, rotation=45)
# Pack it together
plt.tight_layout()
# Render the DataFrame as a table for easy view
cmpd = pd.DataFrame(cm, columns=labels)
cmpd.index = labels
display(HTML('<b align="center">Confusion matrix</b>'))
display(cmpd)
plt.ylabel('Actual')
plt.xlabel('Predicted')
# Labels
labels = [0, 1]
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=labels)
# Show the confusion matrix
plt.figure()
plot_confusion_matrix(cm, labels=labels)
plt.show()
How do we interpret these results?
# What does our data decision tree look like...
# Since GBT is an ensemble of randomforests/decisiontrees, we will refit the data just to render on the screen
from IPython.display import Image
import pydot
from sklearn import tree
from sklearn.externals.six import StringIO
from sklearn import tree
dtree = tree.DecisionTreeClassifier(max_depth=3)
dtree = dtree.fit(X_train, y_train)
dot_data = StringIO()
with open("output.dot", "w") as output_file:
tree.export_graphviz(
dtree,
out_file=output_file,
filled=True,
rounded=True,
feature_names=X_train.columns.values,
class_names=['Stays', 'Leaves'],
special_characters=True)
from os import system
system('dot -Tpng -o dtree2.png output.dot')
Image("dtree2.png")
#Double click on the image if you care.
What does the decision tree really tell us? Focus on leaf nodes that show class="Leaves". The class of a node is assigned as a "winner takes all" churn or retention class.
Traversing the left-most "leaves" branch, it suggests that fulltime non-research scientists with relatively smaller experience of less than 1.5 years employees attrite just as much as they stay. So age and profession of the fulltime employee base is a factor of concern warranting better incentives and cross-opportunities for young non-research trainees in the organization.
Traversing the other leaves branch, it suggests that young fulltime employees under 34 years with lower income of less than $3750 are showing higher propensity to leave. It warrants better incentives for younger employees.
Similarly, not shown here, another rendition of the tree suggested exempt employees that work overtime and make smaller daily rates leave as well. They seek better work-reward balance.
All this leads to simple outcomes --
To save employees from churning, offer better rotational opportunities, financial incentives, and contract-to-hire opportunities.
Now that we have a model, one that performs ~90% accuracy, can we now predict for new employees (or current employees as they age) if/when they attrite? Yes.
#For the purposes of this demo, let us assume test data is indeed our scoring data. You want to ideally score on an unseen data...
#Predict the attrition likelihood
y_pred = clf.predict(X_test)
emit = X_test.copy()
emit['Really Attrited'] = y_test
emit['Predicted Attrited'] = y_pred
#Show the likelihoods
pd_display(
emit[emit['Predicted Attrited']==1].drop_duplicates(),
"These are the predicted churn outcomes of existing employee base. Please reach out the employees that have prediction of attrition = 1"
)
A simple attrition analysis model. This is a purely quantitative model that looks at past look-alike attributes of a churn.
Survival models that predict outcomes based on prior "activities" (telephone calls, HR escalations, paystub downloads etc) also play a critical role in churn analysis. We do not consider those here although we do ask you to search for nPath analysis (http://blogs.sas.com/content/sascom/2014/08/19/path-analysis-with-sas-visual-analytics/) to learn more about a few of these survival models.