The problem statement and analysis utilize a Telco Customer Churn data set made available by IBM.
The data set lends itself to understanding customer behavior in order to analyze and predict retention and avoid churn. The data identify customers who left service within the last 30 days. Customer characteristics include demographics, account information, and services. The data set is from the IBM Watson Analytics blog on Using Customer Behavior Data to Improve Customer Retention.
The problem is how to use the data to develop a model that helps understand and predict customer behavior, specifically, a customer’s likelihood of churning, or dropping the service.
The IBM Watson Analytics site provides its own analysis of the problem and identifies customer tenure, contract length, and whether the customer has online security in place or not as the three main drivers of churn. It concludes a customer on a month-to-month contract, with fiber optic internet service, tenure less than six months, who has multiple lines, and who has spent less than $266 in total charges has an 86% probability of churning.
The data set consists of 7043 rows with 21 variables and occupies 1.5 MB of memory. There are four numeric features: SeniorCitizen, tenure, MonthlyCharges, and TotalCharges, though the first is a categorical feature. There are 17 categorical features, three true numeric features, and one target, the categorical feature Churn. FOr Churn, a “Yes” indicates the customer left within 30 days. Historically, 28% of customers in the data set churned in the previous month. There is a high correlation rate between two features, TotalCharges and MonthlyCharges. This is unsurprising since the former is a function of the latter and there are only three numeric features total.
The best solution developed in this analysis is the logistic regression model, with an 82% accuracy score. The log regression model is built out in the annotated code accompanying this problem statement.
The decision tree analysis achieved 79% accuracy. The random forest ensemble method also delivered 79% accuracy. By contrast, a guess that no customers would ever churn would have an accuracy score of 76%. So the logistic regression model achieved a significant improvement in accuracy.
The benchmark model in this analysis is the naive approach of guessing no customers would churn, which yields 76% accuracy.
The performance metric is a simple measure of accuracy (the percentage of instances in which the model correctly predicted the churn status of customers in the testing data set). Confusion matrices are also used to indicate the rate of false positives and false negatives.
The Markdown Notebook containing the annotated code for the analysis can be found at this link. The raw code can be found on a GitHub repository.