Covington, Paul, Jay Adams, and Emre Sargin. “Deep neural networks for youtube recommendations.” Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016.
Challenges
Scale: a large number of users and videos.
Freshness: A large number of new videos are uploaded every second. The system should be responsive to new videos and latest user activities.
Noise: lack of users’ satisfaction and mostly rely on implicit feedback (watching). Metadata of the videos is also poorly structured.
As the other products in Google, Youtube started to use deep learning as a general-purpose solution for learning problems. TensorFlow provided a flexible framework for deep learning with large-scale distributed training. In this case, there are about one billion parameters trained on hundreds of billions of examples.
System Overview
The system consists of two deep learning networks — candidate generation and ranking. Candidate generation takes a user’s activity as input and processes a small subset (hundreds) of videos (with high precision). The candidate generation network provides broad personalizations. Similarity between users is expressed in terms of IDs of video watches, search queries and demographics.
The ranking network assign a score to each video (for the user) using a features describing the video and user. Highest-score videos are selected to present to users. The two-step process allows to select small number from millions of videos.
The system is developed with offline metrics like precision, recall, ranking loss, etc. Finally effectiveness is tested through A/B testing.
Candidate Generation
Formulated as a multi-class classification problem: at time t the video watched by U:
P(w_t=i | U,C) = e^(v_i*u) / sum(e^(v_j)*u)
where u is the embedding of the user, and v_j is the embedding for each video.
In the classification, the videos watched by the users are positive examples.
Due to a large number of videos, the number of classes is large. For efficient training, for each sample, the cross-entropy loss is minimized for the true label and a subset of randomly sampled negative classes.
A user’s recent watch history is represented by a sequence of videos each represented by its embedding. Then strategies like component-wise average and max can be applied to the embedding vectors, prodding one single vector representing video watches.
Similarly, user recent search history (each historic query is tokenized into unigrams and bigrams) proves one vector, and user’s geographic information produces one vector. Simple features like age and gender are normalized to [0,1] and are used as direct inputs to the network. The age of videos is also used as a feature to learn the fact that users prefer fresh content.
Features are concatenated into a first layer, followed by several fully connected rectified linear units (ReLU), and finally the softmax layer.
During serving time, the softmax output would need to compute for all classes. A more efficient solution is do a nearest neighbor search in the dot product space (v_i * u).
In addition, training examples are generated from all youtube watches — not just the watches recommended to avoid a bias towards exploitation. Also, it is useful to general a fixed number of examples for each user, to avoid active user dominating the model.
The authors also found it was useful to withhold information from the classifier in order to prevent the model from exploiting the structure of the site. e.g., if a user’s last search is “Tylor swift”, the recommendations may reproduce the search result page. Therefore, by representing the search queries into unordered tokens, it avoid the issue.
[Note: is this because the deep learning model only takes care of the exploit part but not the explore part?]
Ranking
The candidate generation phase select a subset of videos for ranking, and it is feasible to get more features describing the videos and user’s relationship with the videos.
The task is also a classification task — given an impression of a video, whether a user would click it. So the positive example is that the video impression was clicked, and negative example is the impression was not clicked.
But one also cares the expected watch time. e.g., deceptive videos may have high click-rate but not completed (“clickbait”).
This was achieved by logistics regression under cross-entropy loss, with the positive impressions weighted by the watch time. Negative (unclicked) impressions receive unit weight.
Feature engineering
Video IDs and search query terms have very large cardinality, are they are truncated by including only the top N according to frequency in clicked impressions. Out-of-vocabulary values are mapped to the zero embedding.
For input with variable lengths, e.g., a new user may only one video watched, while an active user may have hundreds watched, the embedding vector is averaged for equivalent length.
Although there are multiple features related to the same ID space, e.g., video ID of the impression, last video ID watched, the embedding of the same ID space is the same.
Neural networks are sensitive to the scaling and distribution of the inputs. Each continuous features are normalized to [0,1) using the cumulative distribution, and additional super-/sub-linear terms are added: x² and sqrt(x)
[Note: a deeper/wider networks could capture the non-linear relationship, however here a compact network is preferred to serve recommendation in realtime]
The deep learning network outperforms previous matrix factorization approach used in Youtube, but the gain is not specified.
No comments:
Post a Comment