How effective are Starbucks app offers?

16 min readMay 24, 2021

https://images.unsplash.com/photo-1496379896897-7b57622f431b

Overview

This blog post summarizes the results of my Udacity Data Science nanodegree capstone project. In order to complete the nanodegree we obtained a data set for the Starbucks rewards mobile app which contained simulated data that mimics customer behaviour. From time to time Starbucks sends out an offer to a limited number of potential customers: an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free).

Starbucks does not send out the same offer to all users, which is a challenge for this data set.

One task of this capstone project is to combine transaction, demographic and offer data. With this combined data set we are able to see which demographic group responds best to which offer. The data is simplified as it does not contain any information about the actual product,

In the Starbucks app each offer has a period of time until it is valid before expiration. In the transactional data we can see the purchases of every customer which was done via the app. The data includes the purchase time and the actual money spent on a purchase. In addition, this data also contains information for each offer that a user receives an offer, that a user view the offer and that a user completes the offer.

Problem statement

The major objective of this blog post is to analyze the data sets provided by Udacity/Starbucks. We want to learn from the data if we can predict based on demographic data if a person is willing to complete an offer, i.e. if he is willing to make a purchase at Starbucks if he/she is given an offer.

In the following we plan to answer the following questions:

What are the persona cluster among the customers in the data set?
Can we in general predict if a customer takes an offer and to which offer (bogo or discount) he/she is more responsive?
Which characteristica do have the most common influence on the prediction?

Strategy

In order to achieve this we apply Crisp-DM (Cross Industry Standard Process for Data Mining) in the following. This involves the following steps:

Business Understanding — we have raised the three key questions above about the Starbucks data set
Data Understanding — we have explored the three test sets provided by Starbucks/Udacity
Data Preparation — we have cleaned the data by dropping features, by imputing values, and by converting categorical into numerical data from the original data set
Modeling — we have modelled the classification and prediction questions with a KMeans classifier and a random forest classifier
Evaluation — we have evaluated the resuling models. For the KMeans classifier, we have used the Silhouette score, for the random forest classifier, we have used the accuracy score
Deployment — we have discussed the findings as a final report in this medium blog post

The modelling step is twofold:

For the clusering of the customers into personas, we employ a Kmeans algorithm. This is a common approach to clustering problems. In order to find the number of clusters, we apply a hyper-parameter tuning approach
For the prediction of the willingness of a customer to take an offer, we employ a RandomForestClassifier, where we apply a hyper-parameter tuning approach to find optimal parameters for the machine learning model.

Metrics for Validation

In order to validate the model we use the following metrics:

Silhouette score

For the kmeans algorithm we employ a silhouette score. A silhouette score is useful in this case since we don’t have labeled data for our data set. The higher the score the more does the model relate to better defined clusters. This is an inituitive explanation behin clusters.

More information on the silhouette score can be found at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score

Accuracy score

For a classification algorithm the accuracy score is the most intuitive score. It gives the percentage of the set of prediceted labels for a sample exactly matching the set of actual labels.

More information on the accurarcy score can be found at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

Data Exploration and Visualization

In the following section, we explore the the data. We inspect the content of the datasets and get the ratio of missing values in each column. The data provided by Udacity/Starbucks contained three files in total:

portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
profile.json — demographic data for each customer
transcript.json — records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

portfolio.json

id (string) — offer id
offer_type (string) — type of offer ie BOGO, discount, informational
difficulty (int) — minimum required spend to complete an offer
reward (int) — reward given for completing an offer
duration (int) — time for offer to be open, in days
channels (list of strings)

profile.json

age (int) — age of the customer
became_member_on (int) — date when customer created an app account
gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
id (str) — customer id
income (float) — customer’s income

transcript.json

event (str) — record description (ie transaction, offer received, offer viewed, etc.)
person (str) — customer id
time (int) — time in hours since start of test. The data begins at time t=0
value — (dict of strings) — either an offer id or transaction amount depending on the record

Data Cleaning and Preprocessing

In this section we start to clean the data. Data cleaning consists of dropping features, of imputing values, and of converting categorical into numerical data from the original data set. We do this for each data set separately.

Portfolio Data Set

The portfolio data set contains offer ids and meta data about each offer (duration, type, etc.).

Cleaning

We clean the portfolio data set as follows: the channels and the offer_type are converted from categorical data into numerical data. Furthermore, we rename id and reward into offer_id and portfolio_reward, respectively in order to avoid ambiguity since those names are also part of the other data sets.

Visualisation

In this section, we visualize the results from the cleaned portfolio data set. This gives us some more insights into the data. We start plotting the counts of offer types. Here, we can see that there are four offer types for bogo and discount, while just two are informational as depicted in Fig. 1.

We look here at the different channel types. We can clearly see that email is the most often used channel type, while social media is the least often used in Fig 2.

Profile Data Set

The profile data set contains demographic data for each customer such as age, sex, and the tenure of the starbucks membership. This data set gives us deep insights into the customers of Starbucks.

Cleaning

We clean the profile data set. This is achieved by first converting the became member column dtype to a datetime format. WIth this date we can compute the number of days since a customer is member. Also we rename id to customer_id since id is too generic. We add an additional feature to the dataset which states if data is missing. We one-hot-encode the gender and finally fill all missing data with the mean values.

Visualisation

We are now ready to visualize the demographic data in our data set. This will help us to obtain a deeper insight into our data.

The bar chart of the gender data is depicted in Fig. 3. We have more males than females in the data. There is also a notable amount of people which used other as other. Note that other does not denote the member with missing data.

When plotting the age column in Fig. 4, we can see that the average is a little bit below 60 years old. Note that a lot of people are marked as 118 years. We expect that 118 is used for labeling unkown age. That is the reason why we dropped this value in the cleaned data frame.

We plot the distribution of the income for all users in Fig. 5. Note that the average is around 70000. There is little income with more that 100000.

The following histogram in Fig. 6 shows how long a customer is member. This plot is particularly interesting due to two things:

the eldest member has a tenure of around 1800 days
there are two edges when a lot of new members became active. One is around 300 days, the other is around 1100 days.

The histogram in Fig. 7 shows the number of missing data. Missing data means that the customer didn’t provide his/her gender, his/her age, and his/her income. We have labeled that data as missing data. We can clearly see in the bar chart that most customerers provided data, but the number of customer with missing data is non-negligible.

Transcript Data Set

The data set of transcripts keeps the records for transactions, offers received, offers viewed, and offers completed. This is in particlarly interesting since it keeps the reaction of a customer to an offere.

Cleaning

We clean the data set as follows: the person column is renamed to customer_id since person is too generic and in order to be aligned with the other data set. Furthere, more we convert the value columns, which is a dict into a series. Hence, we have additaional separate columns for event_offer_completed, event_offer_received, event_offer_viewed, and transcation:

Visualization

We are not ready to visualize the insights of the transaction script. This is espescially helpful to later start the modeling of our problem statements.

First, we plot the histogram of the transcript rewards in Fig. 8. We can observe that the rewards are quite discrete, i.e. 2,3, 5, and 10. From these values the 5 is the most common one.

When we plot the amount of money spent in a transaction in Fig. 9, we can see that most customer spent money with less that than 5, and no customer spends more money than 50.

The time in the transcript as a histogram chart in Fig. 10. We can see that there are 5 peaks, mostly around 150, 330, 400, 500, and 580.

We plot the occurences of different events in the transcript in Fig 11. We can see that transaction is the most often occuring event in the transcript data set. Offer received, offer viewed, and offer completed are less common. This is expected since not all customers did receive an offer and for the offer events the sequence must be as follows: first the customer must receive an offer, then he needs to view it and at the end the customer must finally complete the offer. In each of these steps some customer decide not to take the next step.

Model Building and Evaluation

We are now ready to build the machine learning models. The models are used to cluster and classify the data into groups. The goal of the clustering task is then to group the customers into persona groups based on the demographic data.
Based on the clustering result we implement a classification pipeline which predicts the willingness of a specific customer to take the offer.

Clustering into personas

Our first goal is then to cluster the customers into categories. Here, we use the profile data for it, i.e we use as features the following properties of the customer:

age
income
tenure of membership
gender
missing data

We then employ a clustering algorithm based on this subset.

Model implementation

In order to cluster the customers into groups, we first employ a robust scaler (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html). Robust scalers are robust to outliers and also standardize all the input data. After the data is scaled we employ a Kmeans algorithm. (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) This algorithm groups the data by finding n subgroups which have an equal variance. It does this by minimizing a specific criterion also known as within-cluster sum-of-squares. For kmeans we have to a priori specify the number of clusters. The algorithm is widely used in different domains since it scales well to big amount of data.

We then combine these two scaler and estimator into a pipeline and apply a hypterparameter tuning approach with gridsearch cv. We dcided to tune the foloowing parameters:

n_clusters: the number of clusters to form
max_iter: maximum number of iterations of the k-means algorithm
tol: relative tolerance with regards to Frobenius norm of the difference in the cluster centers

We employ the silhouette score to assess the result. This is quite useful here as we noted before. Furthermore, we also assessed Calinski Harabasz Score and Davies Bouldin Score for more information.

More information on clsutering performance analysis can be found at https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation.

As a result, we saw that 6 would be the optimal number of cluster. Hence, we are now ready to scatter plot the clusters in Fig. 12. Note that we plot the following data for each gender (male, female, other, unknown):

income over age
member_since_in_days over age
member_since_in_days over income

As a result, we have revealed the following clusters which can be shortly described as:
Cluster 0: Low income, low age, new member
Cluster 1: Female, low income, high age, new member
Cluster 2, High income, high age, new member
Cluster 3: Male, low income, high age, new member
Cluster 4: Old members
Cluster 5: Members with unknown data

In the following we are now ready to merge the three preprocessed data sets (portfolio, profile, transcript) into a single one. We merge transcript and profile based on the customer_id. Then, we merge the portfolio into this based on offer_id.

With the merge data set, we can now have a look at the events with respect to the offer type in Fig 13. The common sequence for the event offers is as follows: first received, then viewed and finally completed. From the data below we can clearly see that both discount and bogo were received equally. When it comes to viewing, the bogo was viewed more often, but in the end discount was used more often by the customer.

From a different perspective we can look at the offer type and inspect the event offers as depicted in Fig 14. We can see that for each offerr type the count descreases from received to viewed which is clearly expected. We can also inspect that for discount offers the decrease is much larger for the received ones. For the bogo offer type the most strong descrease in the number of events is from viewed to completed.

We now do the same for each cluster in Fig 15. Here, three things are noteworthy for the reader:

Cluster 2 is really responsive to offers — not matter if the offer type is discount or bogo.
Cluster 0, 3, 5 are not really responsive to offers.
None of the clusters is more responsive to bogo than to discount.

In Fig. 16 we plot the histogram for the amount of money spent. We do this once for the group that didn’t receive any offer and once for the group that used an offer before buying. We can see that with offer the amount of money spent is larger. Wihout offer the people tend to spend less money which can be seen by the large properotion between 0 and 10. For the amount spent with offer the average is somewhere around 15 which is clearly more than without.

We do the same for each group of cluster as we can see in Fig. 17. We can summarize the remarkable results as follows:

For cluster 1 and 2 the amount spent are almost the same — no matter if they are using an offer.
For cluster 0, 3, 4, 5 the amount spend increases in average if we compare the amount spent with and wihtout offer.

We are now ready to visualize the correlation between the different features in Fig. 18. We are dropping the event_offers, the amount, the transcript reward, and some events (received, viewed, transaction).

We can see that the age relates to income. Note also that females tend to be older and have a higher income since gender_F has a positive coefficient with both income and age. All the data from the portfolio are also more correlated.

For each clsuter the correlation is depicted in the following figure. Note that we didn’t find any interesting correlation here.

Predicting if customer takes offer

Our second goal is now to predict or better said classify of a customers is willing to take an offer. We use the event_offer_completed column as the target and all the remaining columns as features. Then we split this set into training and testings data.

We then employ a classification algorithm here as follows.

Model implementation

In order to cluster the customers into groups, we first transform the customer id with a column transformer (https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) with an ordinal encoder. This encodes the uuid from the customer into a single numerical value that can be used in the algorithm. Afterwards, we scale the data with a Standard scaler (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) so that it can be used with the classifier. As a classifier we use the RandomForestClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).
With all these estimators we create a pipeline whose hyper parameter are tuned with Grid Search CV. We use the following parameters to tune the pipeline

n_estimators: the number of estimators
bootstrap: if bootstrap samples are used when building trees
criterion: quality of a split: gini for the Gini impurity and entropy for the information gain

We employ the accuracy score to assess the result. This is quite useful here as we noted before. In addition, we rank the features which are most relevant for predicting whether a customer is willing to take the offer. Here, we employ the feature importances for the forest.

More information on accuracy score analysis can be found at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html. The creation of the feature importance is inspired by https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#feature-importance-based-on-mean-decrease-in-impurity

For the overall data, we can see that the most important feature is the augmented one called missing_data. This is followed by portfolio_reward, difficulty, and duration. In particular, the gender and the channels have a minor influence on the decision.

We can see that the accuracy of the test data is somewhere between 0.77 and 0.95, where the model can be most optimally created for cluster 5 (the one with missing data) and the least optimally for cluster 2.

The top 3 features for the clusters are the following:

cluster 0: missing_data, member_since_in_days, income
cluster 1: missing_data, income, member_since_in_days
cluster 2: missing_data, income, member_since_in_days
cluster 3: missing_data, income, member_since_in_days
cluster 4: missing_data, member_since_in_days, income
cluster 5: missing_data, portfolio_reward, difficulty

Conclusion

The overall accuracy score of 0.95 on the test data set is quite high. This means that we have a good machine learning model for the predicting whether a customer takes an offer or not. Desprite this good result, we are concerned that the result mostly depends on the newly established meta-feature which we called missing_data. Initially, we would have expected that the classification is more strongly coupled to the demographic data such as age, income, or gender.

The approach we chose is general and can be easily adapted to new methods like new classifiers or pipelines in the future.

Reflection

The jupyter notebook in my repository https://github.com/domsieb/dsnd_capstone_starbucks and this blogpost is the major output for my Udacity data science nanodegree. In order to finish the nanodegree, the task was to inspect and analyze the Starbucks data provided for this capstone project. In general, I found the capstone project to be very interesting but also rather difficult sometimes.

The two most difficult things for me during this capstone project were the following:

what are the key problems for the data set and how to formulate them
after obtaining an intermediate result with the clustering algorithm I was uncertain how to continue with the clusters in the next step since the correlations didn’t provide any significant difference between the clsuters.

Improvement

For future improvement we suggest to employ different classification approaches in order to obtain a better and more accurate prediction. Just like this, we also suggest to create a web app where we can type in the customer details and then send him an offer or not based on the prediction model.

How effective are Starbucks app offers?

Overview

Problem statement

Strategy

Metrics for Validation

Data Exploration and Visualization

Data Cleaning and Preprocessing

Portfolio Data Set

Profile Data Set

Transcript Data Set

Model Building and Evaluation

Clustering into personas

Predicting if customer takes offer

Conclusion

Reflection

Improvement

References

Written by Dominik Sieber