Machine Learning basic principles, Part II
Data science and the whole ML workflow
The first part of our machine learning series was all about helping you understand some basic machine learning (ML) concepts and jargon (read it here). This post builds on it, and goes a bit deeper in its explanations of how a machine learning model is born.
ML is only growing in importance. As more and more models are trained, its applications are widening and are present in our every-day lives right now. Think about pedestrian detection systems in modern cars, Google completing your emails and correcting your grammar, and so on—ML is all around us.
So, how do we get from concept to real-world application? How does a data science team work? What are the steps within their workflow when they tackle analytical and machine learning problems? Let’s get into it…
Data science workflow
Every data science workflow is like any other workflow. It starts with gathering as much information as possible about what we want to analyze. When we have a clear idea of what the problem is, and how we will solve it, we need to get the right data for it.
The data gets imported from different sources—it comes from our data warehouse (DWH), events that were performed by our users, external data, and other sources. There are often mistakes in the data, like missing data or problems with formatting. And, importing and cleaning the data can get time-consuming—the reality is that we spend a lot of time day-to-day on data cleaning.
Next, we get into a circle of transforming, visualising, and modelling the data to help us better understand the data. Within this step, we learn how to best represent the data and Visualize it to make our findings easier to communicate (that’s the next step). At Showmax we are firm believers in data-driven decision making, and we use these visualisations with our findings to help our key stakeholders make critical business decisions.
Machine learning workflow
In most cases, simple data science is enough to solve the problems at hand, like importing, transforming, and visualising the data. But sometimes we have more challenging problems that ask for a ML-based approach. For example, we wanted to find out why users choose to continue using our service, what makes them happy, and what compels them to become long-term subscribers.
This is a complex task that requires many different data sources to be able to even begin explaining the phenomenon—there are simply way too many possibilities. So, we decided to use ML to see what we could find.
Putting together a ML workflow is a little bit different than a data science workflow. It starts with the same steps, “problem and data analysis,” which is similar to the import, clean, and visualize steps in a regular data science workflow. The next step is “preprocessing and feature extraction,” the process of generating more features from the existing data. Think of features as columns in the dataset, e.g. date, time, first subscription day, and so on. In this case, feature extraction means taking the “date” feature and requesting data like, “is_weekday”, “is_holiday”, and more.
After the dataset is available, it has to be split into multiple datasets. There is the training dataset, which is used to train the model itself. The validation set is used to evaluate the model and set the hyperparameters correctly. And last is the test set, which is used for reporting the final metrics you want to optimize, like accuracy, weighted accuracy, recall, MSE, or RMSE.
The next step is “model training and hyperparameter tuning,” which involves choosing the right model and setting the hyperparameters so that the performance is as good as it can possibly be. Hyperparameters are model-specific parameters that can be altered to change the behavior of the model. For example, in a decision tree, the number of different decisions is a hyperparameter.
The last part is the analysis of your model, where you compare different models (if you trained multiple models). With this feedback, you can run through the flow once again to improve upon the previous iteration. Don’t be afraid to come up with a new feature to add or another ML algorithm to try out.
An example: clustering our users
Let’s have a look at an example to understand the steps within the workflow. It’s an example we often use when talking to fellow Showmaxers—a very simple model that clusters our colleagues based on movies and series they’ve watched. In fact, this exercise can bring you some valuable insights. You might use it to form groups in Slack to connect people with similar interests, for example.
There’s no data science without the data, right? All Showmaxers have access to the Showmax app and most of them watch Showmax content. So, to use the user behavior data, we first need to import the data.
Because we want to work only with data from our colleagues, we have to use some filters. We should filter the dataset for email addresses ending with “showmax.com”. To limit the size of our dataset and keep assets as similar as possible, we should also limit the dataset for time. Third, we only want to include people that really did watch something, excluding those who only tested some feature. So, we put a lower limit of 30 seconds on net_time, defined as the time the user spends watching.
SELECT email, master_party_id, rvds.asset_id, grand_parent_name, genre, tag FROM cached_data.reports_viewing_durations_streaming rvds JOIN reporting.dim_party dp ON dp.master_party_id::text = rvds.userr_id::text JOIN reporting.dim_asset da ON da.asset_id::text = rvds.asset_id::text WHERE dp.email LIKE ‘%%showmax.com’ AND rvds.start_time > ‘2020-01-01’ AND rvds.net_time >30
The data we have now is still not the data we want to train a ML algorithm on. It needs to be tidied and transformed. These two actions together are called pre-processing.
In the tidy phase, you drop data that is incorrect or would confuse the algorithm. Let’s have a look at some examples:
- Remove users who are in the dataset multiple times
- Remove users with unrealistic clicking behaviour (as they were most likely testers)
- Remove users who don’t have a ‘.’ in their email address as that is not compliant with the Showmax email address structure
- Filter out entries where the asset_id is missing
Here we can see some of the email addresses that are not compliant with the rules above.
In our original data frame, we have the users together with assets they watched. We call this a long form dataset. For ML purposes, we transfer the format from long to wide, as seen below. It basically means that, instead of listing the users with assets they watched, each entry in the dataset is just one user. Each column stands for a specific series, and the value for the user is 1 if they watched it, and 0 if they didn’t. By transforming the dataset like this, each row in the dataset represents one user and we can immediately see what they did and did not watch.
A crucial part in any data science or ML project is data visualization. By visualizing the data you make it more understandable to you, your stakeholders, and a wider audience.
This dataset is not very big as it includes only a few different features, but we can still find some interesting things in it. For example, we can see that Grey’s Anatomy is the most-watched series, almost 10% of the users in our dataset watched at least a bit of it. This is followed by The River, a Showmax Original series, and then comes Lockdown. Our live channels, CNN, SABC News, etc., also score quite well.
Another way to visualize the high-scoring assets is to see how long users watched on average. You should, however, take into account the different average length of the assets. You cannot compare a movie that lasts 75-120 minutes to a series in which every episode is roughly 20-30 minutes long. We can see that the top 15 looks quite different, and there is little overlap. Grey’s Anatomy, the most-watched series in the previous visualization, is now #11. Combining these two visualizations, we see that this series is the most-popular one within our dataset. It is watched by the most people, and is also in the top 15 in terms of mean seconds watched.
Now we get to the part that most people consider “the work of people in ML.” You can see that we spent quite a lot of time already on gathering the data, tidying, and transforming it, and only now we are actually training a machine learning algorithm.
For this simple use case, we went with a KMeans model with hyperparameter n_clusters = 50. The model is then able to classify the data and assign a cluster number to it. This process is called predicting and we save these predictions to a variable ‘y. Then we can add it to our dataframe as column name cluster. This allows us to see the original data frame and filter it on a certain cluster.
For example, we saw a cluster of users watching Westworld, an American sci-fi series created by Jonathan Nolan and Lisa Joy, and produced by HBO. Finding out that our colleagues watch it on Showmax, we could connect them with each other.
When we have the results, we need to communicate it to the different stakeholders. It is important at this stage to keep in mind that not all your stakeholders will speak the same language in terms of the technology and methods you are using. You will need different visualisations and explanations for each of the different audiences.
For all target audiences, we always use a couple of sections:
- The data we used
- The outcome
- The business outcomes
Having read this post, you can see that training the ML algorithm itself is only a tiny percentage of our workload. We don’t want to disappoint the ML enthusiasts, but being a machine learning practitioner means spending more than a half of your time on gathering, tidying and transforming data.
We hope we gave you a basic understanding of some ML terms, and a bit of guidance for how to classify and solve a machine learning problem. Enjoy!