Notes on data science and machine learning

Sunday, January 27, 2019

The Browsemaps: Collaborative Filtering at LinkedIn (2014)

Wu, Lili, Sam Shah, Sean Choi, Mitul Tiwari, and Christian Posse. "The Browsemaps: Collaborative Filtering at LinkedIn." In RSWeb@ RecSys. 2014. [pdf]

This work describes a centralized system, Browsemaps, that provides collaborative filtering information for various Linkedin products, such as “companies you may want to follow”, “similar companies”, “similar profiles”, and “suggested profile updates”.

Take “companies you may want to follow” for example. Browsemaps provides a list of companies that are similar to the companies a user already follows. The similarity is calculated from user activity data, in particular, co-follows.

The system calculates the similar items offline in Hadoop, and saves the similar item information to a key-value store.

Product teams can combine the collaborative filtering information and content information for recommendations. In the case of “companies you may want to follow” product, a collaborative filtering feature would be whether a company is in the IDs of companies similar to a user follows. Content features include the matching between the user and the company in industry, location and years of experience.

Cold start is an issue for newly posted items using collaborative filtering approach. But one can leverage user’s historic activity to make recommendations. For example, when a member viewed several jobs and landed on a newly posted job which does not have similar jobs generated yet, it is useful to generate the similar jobs of the jobs the member viewed prior to the new job.

As the Browsemaps product is centralized, sometimes it takes time to discover data quality issues. For example, . For example, there can be an issue capturing an activity event, but it can be hard to catch until the issue surfaces in later downstream product applications.

There are a few steps that made the system more robust:

LinkedIn has transformed its data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system
Added robust auditing to ensure correct per-hop reliable data transfer, from the frontend all the way to our relevance systems
compare input and output coverage and offline metrics, alerting if there is significant deviation
added code-driven test automation for tracking events, so most regressions are caught as part of our continuous integration process, not after release.

Amazon.com recommendations: Item-to-item collaborative filtering (2003)

Linden, Greg, Brent Smith, and Jeremy York. "Amazon. com recommendations: Item-to-item collaborative filtering." IEEE Internet computing 1 (2003): 76-80. [pdf]

The article first reviewed a few methods.

Traditional collaborative filtering represents a customer as an N-dimensional vector of items, each electing indicating the rating. Commonly one measures similarity between two customer vectors by cosine. Then the algorithm can recommend similar customers’ items (e.g., an item is ranked higher when more similar customers purchased it).

The algorithm need to scan every customer. Only a few customers purchased a significant percentage of items, and so for M customers and N items, the performance is approximately O(M+N).

For 10 million or more customers and 1 million items, the algorithm would have performance issues.

For a user, one can reduce M by randomly sampling the customers or discarding customers with few purchases, but the selected customers may be less similar to the user.

One may also reduce N by discarding very popular or unpopular items. these items will never appear as recommendations.

Clustering can be used to reduce M or N by a large factor. Clustering the item space has the same effect of eliminating low-frequency items.

Clustering customers can have the effect that the recommendations for the same clusters are less relevant to customers in the cluster when the number of clusters is small. When the number of cluster is large, there will still be scalability issue.

Given a user’s purchased items, search- or content-based methods recommend popular items by the same author, artist or with similar keywords or subjects. For users with few purchases, search-based methods scale and perform well but the quality decreases as more purchases are made — the recommendations are either too broad (all best-selling drama DVD titles) or too narrow (all books by the same author). Recommendations should help a user discover new, relevant, and interesting items.

[Note: content-based methods mostly rely on human-defined meta data such as keywords. They may be effective for the cold-start issue of collaborative filtering methods]

Item-to-Item collaborative filtering

Amazon recommend products based on the items in the shopper cart. The algorithm needs to be scalable for tens of millions of customers and products.

The algorithm find similar items of the user’s purchased/rated items and combines the similar items into a recommendation list.

Authors used cosine to measure the similarity between items. Sampling customers who purchase best-selling titles reduces runtime with little reduction in quality.

Given the similar-item table, the algorithm finds items similar to a user’s purchases, aggregate the items and recommended most popular or correlated items. The computation is very quick, only depending on the items user purchased.

At the time, Amazon had more than 29 million customers and several millions of items.

Saturday, January 26, 2019

Deep neural networks for Youtube recommendations (2016)

Covington, Paul, Jay Adams, and Emre Sargin. “Deep neural networks for youtube recommendations.” Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016.

Challenges

Scale: a large number of users and videos.

Freshness: A large number of new videos are uploaded every second. The system should be responsive to new videos and latest user activities.

Noise: lack of users’ satisfaction and mostly rely on implicit feedback (watching). Metadata of the videos is also poorly structured.

As the other products in Google, Youtube started to use deep learning as a general-purpose solution for learning problems. TensorFlow provided a flexible framework for deep learning with large-scale distributed training. In this case, there are about one billion parameters trained on hundreds of billions of examples.

System Overview

The system consists of two deep learning networks — candidate generation and ranking. Candidate generation takes a user’s activity as input and processes a small subset (hundreds) of videos (with high precision). The candidate generation network provides broad personalizations. Similarity between users is expressed in terms of IDs of video watches, search queries and demographics.

The ranking network assign a score to each video (for the user) using a features describing the video and user. Highest-score videos are selected to present to users. The two-step process allows to select small number from millions of videos.

The system is developed with offline metrics like precision, recall, ranking loss, etc. Finally effectiveness is tested through A/B testing.

Candidate Generation

Formulated as a multi-class classification problem: at time t the video watched by U:

P(w_t=i | U,C) = e^(v_i*u) / sum(e^(v_j)*u)

where u is the embedding of the user, and v_j is the embedding for each video.

In the classification, the videos watched by the users are positive examples.

Due to a large number of videos, the number of classes is large. For efficient training, for each sample, the cross-entropy loss is minimized for the true label and a subset of randomly sampled negative classes.

A user’s recent watch history is represented by a sequence of videos each represented by its embedding. Then strategies like component-wise average and max can be applied to the embedding vectors, prodding one single vector representing video watches.

Similarly, user recent search history (each historic query is tokenized into unigrams and bigrams) proves one vector, and user’s geographic information produces one vector. Simple features like age and gender are normalized to [0,1] and are used as direct inputs to the network. The age of videos is also used as a feature to learn the fact that users prefer fresh content.

Features are concatenated into a first layer, followed by several fully connected rectified linear units (ReLU), and finally the softmax layer.

During serving time, the softmax output would need to compute for all classes. A more efficient solution is do a nearest neighbor search in the dot product space (v_i * u).

In addition, training examples are generated from all youtube watches — not just the watches recommended to avoid a bias towards exploitation. Also, it is useful to general a fixed number of examples for each user, to avoid active user dominating the model.

The authors also found it was useful to withhold information from the classifier in order to prevent the model from exploiting the structure of the site. e.g., if a user’s last search is “Tylor swift”, the recommendations may reproduce the search result page. Therefore, by representing the search queries into unordered tokens, it avoid the issue.

[Note: is this because the deep learning model only takes care of the exploit part but not the explore part?]

Ranking

The candidate generation phase select a subset of videos for ranking, and it is feasible to get more features describing the videos and user’s relationship with the videos.

The task is also a classification task — given an impression of a video, whether a user would click it. So the positive example is that the video impression was clicked, and negative example is the impression was not clicked.

But one also cares the expected watch time. e.g., deceptive videos may have high click-rate but not completed (“clickbait”).

This was achieved by logistics regression under cross-entropy loss, with the positive impressions weighted by the watch time. Negative (unclicked) impressions receive unit weight.

Feature engineering

Video IDs and search query terms have very large cardinality, are they are truncated by including only the top N according to frequency in clicked impressions. Out-of-vocabulary values are mapped to the zero embedding.

For input with variable lengths, e.g., a new user may only one video watched, while an active user may have hundreds watched, the embedding vector is averaged for equivalent length.

Although there are multiple features related to the same ID space, e.g., video ID of the impression, last video ID watched, the embedding of the same ID space is the same.

Neural networks are sensitive to the scaling and distribution of the inputs. Each continuous features are normalized to [0,1) using the cumulative distribution, and additional super-/sub-linear terms are added: x² and sqrt(x)

[Note: a deeper/wider networks could capture the non-linear relationship, however here a compact network is preferred to serve recommendation in realtime]

The deep learning network outperforms previous matrix factorization approach used in Youtube, but the gain is not specified.

Thursday, January 24, 2019

The YouTube video recommendation system (2010)

The YouTube video recommendation system

Davidson, J., et al. | Google | ACM conference on Recommender systems | 2010 [link]

Goal: recommend the top N personalized videos, with a control on user privacy, and interpretable to users.

Challenges

- no metadata

- #videos similar size as #active users

- most videos under 10 mins with short user interactions (noisy)

- short cycle from uploading to viral (require recent recommendations)

High-level design

- Intuition: based a user’s watched, favorited and liked videos, a large set of similar videos is ranked for relevance and diversity.

- System: decouple the components of the system for easier debugging and maintaining. need to be resilient to failure.

Data

- mainly two types of data:

— content data: video stream and metadata such as title, description etc.

— user activity data: explicit type like rating, liking and subscribing, and implicit activity like watching (and % of video watched)

- issues

— metadata may not exist, outdated or incorrect

— a user watched a video is not enough to conclude the user actually liked it

— lag in implicit activity (generated asynchronously), e..g, user closes the browser before a long-watch notification is received.

Similar videos

Similar videos of v are videos a user likely watch after watched video v. The similarity between two videos are measured as

C_ij / C_i * C_j

where C_ij is the co-visitation count, meaning a user watched both V_i and V_j during a 24 hours period of time, and C_i and C_j are count of each video.

Note this is very similar to the cosine similarity score, except that the denominators for cosine are L2 norms, and the similarity scores here are taking the L1 norms.

Then the top N most similar videos are picked given a watched videos. A minimum threshold is also applied to the overall view count to reduce noise.

Generating Recommendation Candidates

For videos a user watched or liked, a set of similar videos are generated. Sometimes, the set can be very similar to the user watched, and there can be narrow and miss content truly new to the user. A solution is to include the similar videos of the similar videos, recursively.

Ranking

1) Video quality: view count, ratings, sharing activities.

2) user specificity. Higher rank for the similar videos generated from higher quality seed videos, defined by view count, time of watch, etc.

A linear combination of the two signals are used to generate a ranked list. Diversity is then considered by removing videos are too similar to each other. One way is limit the number of recommendations associated with one single seed video, or the same channel. More complex approaches such as topic clustering could also be used.

User interface

Seed videos are used for interpretability — “because you watched xxx”

Implementations

Batch-oriented pre-computation approach for better scalability. Recommendations are generated using MapReduces jobs, and saved in BigTable.

Several times per day to reduce the delay.

Results

The recommendations generated from this approach has as more than twice click-through rate of the baseline methods: most viewed videos in a day; top rated videos in a day; top favorited videos in a day.

Notes on data science and machine learning

Pages

Sunday, January 27, 2019

The Browsemaps: Collaborative Filtering at LinkedIn (2014)

Amazon.com recommendations: Item-to-item collaborative filtering (2003)

Saturday, January 26, 2019

Deep neural networks for Youtube recommendations (2016)

Thursday, January 24, 2019

The YouTube video recommendation system (2010)

The YouTube video recommendation system

The Browsemaps: Collaborative Filtering at LinkedIn (2014)

Contact Form