Machine Learning basic principles, Part I
We are very enthusiastic about machine learning (ML) here at Showmax. We have a solid amount of internal knowledge on the topic, and we’re constantly learning more. Recently, we implemented our first ML model — it measures our own Happiness metric and tries to predict churn based on it (read more about it in this blog post. To help you understand that piece, and our upcoming ML projects, we put together a comprehensive summary of some need-to-know terms.
What you may expect to learn
- What ML is
- Some basic ML concepts
- Differences between types of algorithms
- Workflow of producing ML models
- ML applications at Showmax
Humanize Machine Learning
Let’s start with a simple exercise. Ask yourself: “What is the weather going to be tomorrow?”
Give it time, think about it, and guide your thoughts with the illustrative pictures below.
Now, investigate the genesis of your answer and ask yourself: “Why did I answer the question like this? What is my answer based on?” Most likely you will have used some information that you know already:
- Your location
- What the weather’s like today
- A long-term weather forecast you saw
- The time of year
In ML, this data and extra information you used are referred to as features. All of these features can be fed into a ML algorithm. This will result in a weather prediction algorithm that can tell you what the weather will be like tomorrow based on the features you give it.
Now, let’s take a look at a Showmax-specific example. At Showmax we need a recommendation engine, a machine learning algorithm that will guide our users and show them relevant content to watch.
If I recommend something to watch to a friend, it’s likely that, in most cases, I’ll consider following:
- My own preferences: “I really liked movie X and you should watch it”
- Your interests: “Because you like sports, I recommend Y”
- What you’ve seen before: “Because you watched A and B, I recommend Z”
- Popularity/virality: Everyone’s watching ABC, and you should check it out”
How do machines do this?
Actually, the process is very similar, but the machines don’t use the information of 1 user (like yourself), or a couple of users (yourself and your friends). They use the viewing history of every single Showmax user. So, instead of a couple of data points, there are data points for hundreds of thousands of users.
The advantages that machines have over people are quite clear:
- Much greater capabilities in terms of memory
- Ability to collect viewing statistics for thousands of people
- Ability to make correlations in a fraction of the time
Some basic terms
When thinking about different problems that could be solved with ML, it’s good to have a framework. The first question you have to ask yourself is: “Do I have labels?”
With labels, we understand an outcome for at least a part of the dataset. For example; if you have 2000 pictures, 1000 of dogs and 1000 of cats, then “dog” and “cat” are the labels. But, if you just have a lot of pictures and you don’t know what’s on them, the dataset is unlabelled.
There are two main ML families:
- Supervised Learning - with labelled data
- Unsupervised Learning - with unlabelled data
Supervised learning only works labelled data. Within this there are two types of supervised learning. The type depends on the kind of labels, as there are two distinct types of problems. If your data has labels like “income”, for example (60,000 dollars per year, 80,000 dollars per year, etc.), then we say it’s a regression problem. If you take the example of the pictures of cats and dogs, this is categorical data - a classification problem.
In unsupervised learning, you don’t have labels, but there is still value to be extracted from this kind of dataset. One option is clustering, which could be used in marketing to create different target groups for marketing campaigns, for example. The clustering could be based on their watching behaviour, and then we could send them notifications for related content — letting a “sports fans” cluster know about an upcoming sporting event or documentary, for example.
Another example is association, which is often used in recommendation engines. It is used for discovering interesting relationships between variables in large databases. The aim is to identify strong rules discovered in databases using some measures of similarity.
- Clustering: analyzing user sessions
- Association: recommendations
- Dimensionality reduction: plotting assets to plane - film2vec
Bias & Variance
If you’ve ever taken even a basic ML course, it’s likely that you’ve heard of bias and variance. It’s easily explained by looking at it like a target in archery. In a perfect world, you have low variance and low bias. As you can see, this means that you hit the bull’s eye every time. In ML, it means that your accuracy, your Mean Squared Error (MSE), or whatever other metric you decide to use, is good.
When you train a model and have low bias but high variance, it means that you still hit the target reasonably often, but in some cases you can be a bit off. On the other hand, if the bias is high and the variance low, you would hit very consistent shots (low variance) but always be slightly off target. You should avoid high variance and high bias at all costs. This situation occurs when you are not consistent (high variance) and don’t really get near the target either (high bias).
In ML, we call it the Bias-Variance trade-off. It is the characteristic of a model that the variance of the parameter estimates across samples can be reduced by increasing the bias in the estimated parameters. In other words, the more complex the model you are training, the lower the bias and the higher the variance. As such, there is an optimal model complexity where the bias and variance are such that the total error in your estimations is kept at a minimum.
When discussing bias and variance, it’s impossible not to mention model complexity, overfitting, and underfitting. All of these terms are closely-related, and inform how you can achieve low bias and low variance.
If you make a very simple model, like a linear regression model which doesn’t fit the particular dataset, you will encounter underfitting. The model is too simple for the dataset at hand and you create a high bias, low variance model which will not do a great job.
On the other hand, if you would make a very deep decision tree (lots of different decisions), you can have a model that is overfitted to the current dataset. In this case, you get high variance and low bias. This model might perform excellently on your training dataset, but, when you feed new data to the model and you predict the outcome, it might give rather bad results.
A basic understanding of machine learning concepts and jargon is increasingly in-demand, as new models are trained every day. We hope that this post comes in handy. The next piece in our ML series will highlight one specific problem that we have already introduced in presentations and at hackathons. It will also show the sequence of various steps in the ML workflow, and how much of our time is actually spent on developing the model itself. Stay tuned!