“What to watch recommendation” and our Film2vec model
To deliver the best content to the right users we utilize asset metadata - movie or series features like genre, length, stars, etc. - but that has some limitations to it. The set of possible tags for assets is always finite, and adding extra data manually costs time and money.
To improve things further, we use viewing behavior to generate some more features for the assets - at virtually no extra cost. In fact, we created an algorithm that converts assets to vectors based solely on views, which gives us a new set of features to better describe them.
Our algorithm is based on the generic node2vec model that converts nodes in a graph into vectors. The idea is that nodes that are close to each other in the graph should be close to each other in the vector space. In many cases, the shortest path between two nodes could be a good definition of the closeness between them. In cases when the graph is so dense that almost every pair of nodes has an edge, one can use weights for edges - the higher the edge weight, the closer the nodes are.
One of the famous implementations of the node2vec models is the word2vec model presented by Tomáš Mikolov and associated researchers in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text, and represents each distinct word with a particular vector. The closeness of two words is given by the number of occurrences of the pair in the same context (within n words apart in the same text).
Most neural network models require structured data (e.g. matrices) for training , and graphs do not provide that. One way to overcome this - also, the one we decided to take - is to generate data points using random walk on the graph. One can start at a random node and, from there, sample an edge to get to an adjacent node. If the graph has weighted edges, the weights can be used to calculate transition probabilities.
That way, you generate a pair of origin and target nodes. To be able to train the model, you also need to generate some negative samples - randomly-selected nodes other than the two that have already been selected. You can then mix the target node with the negative samples. Note that this step is completely optional; sometimes we talk about neural networks in the context of artificial intelligence, but this network cannot even figure out that the target node is always the first one :).
The model input is the index of the origin node and indices of the target node together with the negative samples. The ground truth for output is an array with zeros on positions of the negative samples and one on position of the target node.
The inputs and outputs are fed into a neural network with just three layers. First is an embedding layer that is shared between the origin node and the other nodes. It’s literally just a matrix of weights from which the model takes the corresponding row (vector) for a given asset. The next layer calculates dot product between the origin node and all other nodes. The resulting array is fed to a softmax layer that produces numbers between 0 and 1 that are compared with the ground truth. The error is propagated back to the weights in the embedding matrix.
In our case, we don’t search for vectors representing words, we search for vectors representing each of the movies and series in our library. Another difference in our case is that our graph is bipartite. That means that we have two types of nodes - assets and users - with the edges between them representing the user watching the asset. That leads to two possible solutions for how to generate training data for the model.
The first one is to use the bipartite graph as it stands with a two step random walk. We start at a random asset and do a step to a random user that watched the asset. Then we do the second step to another asset that the user also watched (excluding the original asset). This leads to a straightforward generalization of the original idea. But the graph size is proportional to the product of number of users and number of assets which is too big for us to work with.
Another solution is to use a weighted graph with assets only. The edge weights are equal to the number of users that watched both of the assets. Then we can get back to the one-step random walk using probabilities proportional to the edge weights. The graph size is “only” quadratic in number of assets, which is much easier to work with. Both methods lead to very similar results, so we picked the one that fit into memory.
Using this method, we generate one data point. Obviously, we need much more. So we generate dozens or hundreds of data points that we use for multiple training epochs to maximize the utilization of the generated data. Then we can generate another dataset for the next season (that’s what we call epochs on the same training set), and repeat the process. After a few seasons, the embedding matrix contains vectors that represent the assets.
The vectors themselves cannot be used for any content recommendation, and require some model on top of them to be any useful. As we mentioned above, the vectors can be used as a set of extra features that can enhance assets in any existing recommendation engine. Apart from that, we came up with a few other applications:
- Calculate vectors for all users by aggregating vectors of assets that they watched — the users’ vectors in this case represent their taste or preferences. Then we can recommend assets to users based on dot products between their vectors and the vectors of the assets. Note that the vectors don’t tell much about the quality or popularity of the asset, so this should be done only with a pre-selected set of assets.
- We can use it to assign metadata to assets or find issues with existing metadata. For example, if an asset has neighbors (in terms of dot product between their respective vectors) that are all historical romances, it is expected that the asset itself is also a historical romance.
- Make clusters of similar assets that work both for content recommendations and further analyses — it greatly helps us understand the types of content we have on our platform.
We started this as a small and fun project just to see whether something like this was even possible. It turned out to be a great model, and whenever you enter the Showmax app you’ll see content that has been ordered by it. We’re still exploring other ways that it can help us with our daily tasks.