Recently, our Content Discovery team, stationed at the intersection of our vast video content library and the users choosing what to watch on our streaming platform, embarked on an essential mission: clustering users based on their similar viewing behaviors to refine our video content recommendations. We explored various modern algorithms, but interestingly, it was the classic k-means algorithm that proved to be the most effective. Let's dive into our journey and see how we tackled the challenge of user clustering.
In the world of personalized experiences and custom offerings, our way of creating group-based content recommendations offers a really interesting option. While acknowledging its drawback, we must also recognize a notable advantage – this approach requires significantly fewer computational resources. And on top of that, it opens up a whole new space for diverse analyses and A/B testing within these newly defined user cohorts. Some things work better for certain groups of users, and we really wanted a good way to see that. And now, we have achieved it.
The Naive Approach
We kicked off with a straightforward approach: constructing a matrix where each user corresponds to a row and each movie or TV series is represented by a column. The matrix elements were binary, taking the value of 0 or 1 based on whether the user had viewed a particular title. With the groundwork laid, we proceeded to implement a clustering algorithm of our choosing and eagerly anticipated the outcomes.
However, as we delved into the results, we discovered significant drawbacks inherent in this approach. Firstly, we realized that the distance between user vectors remained constant whether they both watched a particular title or neither did. Imagine a scenario where there are 10,000 titles available, and two users have each watched 5 titles. Among these, only 3 titles are common to both users. Strikingly, this results in a situation where they share 9,996 assets, whether watched or not, while the actual shared viewing experience comprises just 3 titles. Our intention was to give more weight to the positive data in order to emphasize shared content rather than the absence of it. Regrettably, our results didn't align with this desired emphasis.
Furthermore, another issue emerged: The similarity measure treated users who watched similar titles identically to users who had vastly different viewing preferences. For instance, consider Alice, who watched "Top Gun," and Bob, who saw "Top Gun: Maverick." But their dissimilarity is exactly the same as Xavier and Yasmine, who respectively viewed "Paw Patrol" and "Big Brother" – two very different titles. We realized we needed to address these challenges for a more effective user clustering solution.
A Detour to Graphs
To address these challenges head-on, we pivoted toward the realm of graph algorithms. We opted to represent our users as individual nodes within a graph, with edges signifying instances where two users shared a common content-viewing experience. These edges carried weights, indicating the number of titles shared in common – a vital shift that allowed us to focus on the positive data, accentuating the shared preferences rather than disparities. Additionally, this approach unveiled an intriguing aspect: when two users had similar viewing histories, it often meant numerous other users had also watched the same titles, leading to an abundance of shared neighbors within the graph. It seemed like the path forward.
However, it's important to note that this graph-based solution came with a significant memory overhead, unlike the initial matrix approach. In big O notation, the memory requirement was proportional to the square of the number of users, while the matrix approach was simply a product of the number of users and the number of titles (considering Showmax has several orders of magnitude more users than titles). To mitigate this, we explored various divide-and-conquer techniques, which indeed yielded meaningful results and stunning visualizations (as depicted in the figure below).
Visualization of clustered users using a graph-tool library. Each user is represented by a node on the exterior circle and edges signify commonly watched titles.
Despite these successes, we encountered a roadblock: the graph-based approach consumed excessive CPU resources, leading us to make the decision to close this particular avenue of research. Still, we kept some of the charts as stunning and CDR-relevant office decorations.
Pre-Processing the Matrix
After our experiment with graph algorithms yielded unsatisfactory results, we circled back to the original matrix approach and sought to refine the user vectors for a more accurate depiction of user preferences. Let me discuss some of the approaches we considered but ultimately did not implement for production.
One straightforward method involves the use of dimensionality reduction algorithms, such as PCA (Principal Component Analysis), t-SNE (t-Distributed Stochastic Neighbor Embedding), or UMAP (Uniform Manifold Approximation and Projection). The underlying concept here is that when two titles are watched by the same users, the corresponding columns in the matrix should exhibit a strong correlation, and the resulting lower-dimensional representation should have a composite column that captures the essence of both titles. In such cases, the specific title a user watches becomes less significant – the resulting vectors should be remarkably similar.
Another related technique is the use of autoencoders. These are neural networks with identical data on both the input and output layers, with a relatively compact hidden layer positioned in the middle. Essentially, the network attempts to squeeze as much information as possible through these artificially created bottlenecks. We're particularly interested in the vector at the narrowest point, which holds valuable compressed information. Autoencoders find extensive application in recommender systems due to the nature of the network, which aims to predict the content a user might watch based on their past viewing history, albeit with the distinction that the input is precisely what is being predicted.
Leveraging Previous Result
However, we had another tool at our disposal – a content embedding model. This model allowed us to measure the similarity between two titles using cosine similarity. We integrated this into the refinement of our original matrix structure. When a user had watched a particular title, we assigned a value of 1 to that matrix position. When a title wasn't watched, we searched for the most similar content they had viewed and plugged the similarity score into our matrix. As a result, our matrix columns became strikingly correlated for similar titles. This transformation enabled us to choose a subset of reference titles, setting the stage for a relatively simple clustering algorithm, k-means, in our case.
This approach also had an interesting side benefit - it made the process more explainable. The user features were designed based on "how similar the content is to the title the user watched," and the k-means clustering provided cluster centers that held valuable insights. This allowed us to describe the clusters in terms of their similarities to the reference titles. For example, we could say that a cluster was most akin to titles of a certain genre while having little similarity to another genre. Alternatively, we could explore which groups of users resembled a specific set of titles the most. This unexpected advantage brought a layer of clarity and understanding to our clustering results, enhancing the overall value of our approach.
Conclusion
In conclusion, our journey through user clustering led us to explore graph algorithms and content embedding models, blending classic and modern techniques, and arriving at an innovative solution. We discovered that, when it comes to understanding preferences and improving group recommendations, the pursuit of explainability and efficiency is the most effective. Our adaptation of the k-means algorithm empowered us to create user cohorts that provided not just valuable analytical insights, but also a clear, interpretable perspective on the preferences of various user groups. Our approach demonstrates how the fusion of classic methods and modern technologies can progress the understanding of user behavior in the ever-evolving digital landscape.