Starbucks Capstone Challenge

Data Scientist NanoDegree Program

11 min readFeb 27, 2021

Definition

Project Overview

We have a simulated dataset that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offers during certain weeks.

What we are trying to do is to combine transaction, demographic, and offer data to determine which demographic groups respond best to which offer type. In addition to that, we will use machine learning algorithms to provide a model that predicts how many of the given offers each user would complete.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. Informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

Not all users receive the same offer, and that is the challenge to solve with this data set.

Problem Statement:

The goal here is to:

Inspect and process data to transform to more interpretable formats
Create an offer-user data for accessing to offer responder statistics.
Visualize and determine which demographic group responds best to which offer type.
Preprocess and merge all the data we have for each user and prepare a complete and clean user-offer dataset for the model training
Choose a machine learning method and build a model to predict the number of responses by each user to the given offers.

Metrics

For our model, we can use micro averaged and macro averaged precision/recall as our metric for it’s suitable for our multiclass classifier:

Here the things are done according to the labels. For each label, the metrics are computed, and then they are aggregated. Hence, in this case, you end up computing the precision/recall for each label over the entire dataset, as you do for binary classification (as each label has a binary assignment), then aggregate it.

This is just an extension of the standard multi-class equivalent.

TPj, FPj, TNj, FNj are the true positive, false positive, true negative, and false negative counts respectively for only the jth label.

B stands for any of the confusion-matrix based metrics

Analysis

Data Exploration

The data is contained in three files:

portfolio.json — containing offer ids and metadata about each offer (duration, type, etc.)
profile.json — demographic data for each customer
transcript.json — records for transactions, offers received, offers viewed, and offers complete

Here is the schema and explanation of each variable in the files:

portfolio

id (str) — offer id
offer_type (str) — type of offer ie BOGO, discount, informational
difficulty (int) — the minimum required to spend to complete an offer
reward (int) — reward given for completing an offer
duration (int) — time for offer to be open, in days
channels (list of strings)

profile

age (int) — age of the customer
became_member_on (int) — date when customer created an app account
gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
id (str) — customer id
income (float) — customer’s income

transcript

event (str) — record description (ie transaction, offer received, offer viewed, etc.)
person (str) — customer id
time (int) — time in hours since the start of the test. The data begins at time t=0
value — (dict of strings) — either an offer id or transaction amount depending on the record

Checking the insights of the ‘profile’ dataset:

checking the distribution of age, gender, and income:

Processing and Cleaning ‘transcript’ dataset:

With a quick look, we can easily understand that the most challenging dataset to clean and process is ‘transcript’. As you can see below, there are multi-labeled columns, dictionary values, and multi-type data in this dataset.

Clearly, we have to clean this dataset in order to do further analysis and processing. First, we need to unpack the dictionaries in the ‘value’ column and stack them in different columns. After that, we can divide the table into 3 separate tables: offers, transaction, and reward that each has a different type of ‘value’ column.

3 more Analyzable data sets derived from ‘transcript’

Determine which demographic group responds best to which offer type:

In order to have a more perspective on the characteristics of responders to each offer, we need to create a dataset grouped by the unique offer ids and containing detailed statistics about the responders to that offer.

For this matter, we can merge our profile data (which contains all the details about each user) with the transcript_offer_id dataset which we processed in the last section. By grouping this dataset by the offer ids, and making some changes to the date of birth and age columns, we can analyze the demographic groups more easily.

Using this dataset, I defined a function that takes in (‘offer_person’, ‘offer_id’) as arguments and returns a data frame consisting of age categories and the gender of the responder to the given offer id and their percentages of the whole sample. Now we can iterate through each offer id and get the demographic details about the responders to that offer.

As an example, we can take a look at one of the graphs demonstrating the statistics on “2906b810c7d4411798c6938adc9daaa5”:

And here is another graph showing details on “fafdcd668e3743c1bb461111dcafc2a4”:

By a quick comparison between the two graphs, we can notice that among male users in the 20–30 age group, about 20% have responded to the first offer, and only 10% have responded to the second offer. When we look at the ‘Channels’ data on these two offers, we notice that the first offer used ‘Social Media’ in addition to the second offers channel and this could be one of the reasons since the younger ages are more active in social media. In conclusion, for this group, it is best to send more of the second offer and less of the first offer. (you can find the graphs for all offer ids in the image folder in my Github repository)

We noticed that there were only 8 unique offer ids in the offer_person dataset. After investigation using portfolio and offer_person datasets, we can easily see that the 2 missing offers are ‘informational’ offers for which we do not have any data for their number of completed offers.

Preprocess and Building a Model

Non-engaging users challenge:

One of the main challenges of this dataset is taking into account that some users will make purchases even if they don’t receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn’t want to send a buy 10 dollars to get 2 dollars off offer.

In order to filter out these users from our final dataset, I defined a get_times() function that takes in (person_id, df=transcript_offer_id) as arguments and returns view_time_mean and action_time_mean lists for each person id, which we can add to our person_offer dataset.

the most challenging part of this function was dealing with NaN and negative values in calculating the mean which I implemented some Try-Except blocks to resolve this issue.

The reason I did this is that in this way we can easily point out the offers which have a negative time difference. These are the offers that are completed before viewing, therefore the data related to these values such as transaction could be misleading for our model training.

Moreover, there are some offers that are completed while never being viewed. The data related to these values also have to be removed.

During our cleaning process, we should be aware of what portion of data we are deleting because it should not be more than 25% of the whole data set.

Cleaning the final dataset and prepare for modeling

We want to create a model to predict how many offers a user would complete, given the type of offer, details of the user, and other metrics. so we should finalize data cleaning, numerate all categorical values, and prepare for modeling.

Our data cleaning and transforming were captured in the following sections:

Merging all three datasets which were extracted and transformed from transcript data.
Adding the profile data on users to the main dataset by merging on ‘person’.
Changing the ‘gender’ column to dummy variables and add the ‘unknown’ column for missing values
Impute mean for the ‘age’ column missing values as the rows contain valuable information regarding offers, transactions, and rewards.
Impute mean for the ‘income’ column missing values as the rows contain valuable information regarding offers, transactions, and rewards.
Impute zeros for ‘offer received’, ‘offer viewed’, ‘offer completed’, ‘reward received’, and ‘transaction amount’ missing values because they all literally mean that there are zero counts or amounts under these categories for that user in our data set.
Changing ‘became_member_on’ value types into a timestamp, calculate years of membership and replace it with the old column.
Dropping all the rows that contain data with more number of completed offers than received offers for these users obviously has completed some offers they had not even viewed.
Adding the number of each offer id received by each user to our data, and imputing all missing values with zeros as they literally mean that there was no offer received for that special position.

Finalize data cleaning by ordering all the features as:

‘Person’,
All offer id columns,
‘Gender_male’
‘Gender_other’
‘Gender_unknown’,
‘Age’,
‘income’
‘years_of_membership’,
‘offer received’,
‘offer viewed’,
‘transaction_amount’,
‘reward_received’,
‘offer completed’

Checking for Outliers:

In order to have a smooth model training, we have to avoid outliers as much as we can. The most possible column for having outliers is ‘transaction’:

It is obvious that the ‘transaction’ column has considerable outliers which need some treatment.

We can drop a 3% percentile of the data from above to solve the problem. Let’s see what happens:

It seems like we have succeeded to remove the outliers and we can continue.

Here, it is important to keep track of the amount of data we have deleted from our whole sample set and be sure that we are still on the safe side.

Split the Data and Train the Model

Now we should choose all the features as inputs (all columns except ‘person’ and ‘offer completed’) and ‘offer completed’ as the target.

We also need to split our data into train and test subsets so that we can train and evaluate our model.

The most challenging part for this section is choosing the best fit classifier and tuning the hyper parameters. After some trial and error, I found ‘Random Forest Classifier’ as a proper choice for our purpose here.

Random forest classifier is a set of decision trees that each vote for a class, then these votes will be aggregated and the final class will be decided. This classifier is useful when we have different types of features and if we take good care of the hyper parameters, we can dodge the overfitting problem.

Fitting the model and tuning the hyper parameters:

After fitting our model, we feel the need for some tuning because the score is a little too high and we are probably overfitting. My method of choice here is multiple cross-validations using ‘GridSearch’.

After running a lot of trials and different amounts for the hyper parameters inputs, I came up with some numbers that give me a 0.97 accuracy score on the training set and a 0.89 accuracy score on the test set.

This is a perfectly acceptable score and we can continue to check more results on this model

Checking Evaluation Metrics and Confusion Matrix:

Since we are working with a classifier, we need to check the confusion matrix for evaluating the results in addition to our main metrics which are precision/recall scores for micro and macro average:

In the first look at the confusion matrix, it may look like the model is doing well on lower values and not so well on larger values but this may occur due to the larger amount of data in lower amounts. We can check this by inspecting our data just to be more sure.

This shows that our guess was true! The distribution is more aggregated at larger values and less on smaller values, but yet we have more false negatives and false positives at larger values. The reason for this is having fewer data and, as a result, less accurate training for this area.

Now we can deploy our model for future use or use it in our web application.

Conclusion

We analyzed and processed the simulated dataset from Starbucks that mimics customer behavior on their rewards mobile app.

We Projected the demographics for responders to each offer and discussed the reasons
We Transformed our datasets and create a complete data frame based on the users consisting of all the data we have on each user
We cleaned and numerate our data set and prepared for model training
We trained a classifier on our data to predict the number of offers that each user would complete given the type and number of offers and the channels by which they get to the user
We tuned our model and checked the result with evaluation metrics and confusion matrix
We deployed our model for further use

There are absolutely other ways to improve our model accuracy and precision. One of them is involving more hyperparameters in our cross-validation process and aim for a higher accuracy.

My Github Repository