The YouTube video recommendation system
Davidson, J., et al. | Google | ACM conference on Recommender systems | 2010 [link]
Goal: recommend the top N personalized videos, with a control on user privacy, and interpretable to users.
Challenges
- no metadata
- #videos similar size as #active users
- most videos under 10 mins with short user interactions (noisy)
- short cycle from uploading to viral (require recent recommendations)
High-level design
- Intuition: based a user’s watched, favorited and liked videos, a large set of similar videos is ranked for relevance and diversity.
- System: decouple the components of the system for easier debugging and maintaining. need to be resilient to failure.
Data
- mainly two types of data:
— content data: video stream and metadata such as title, description etc.
— user activity data: explicit type like rating, liking and subscribing, and implicit activity like watching (and % of video watched)
- issues
— metadata may not exist, outdated or incorrect
— a user watched a video is not enough to conclude the user actually liked it
— lag in implicit activity (generated asynchronously), e..g, user closes the browser before a long-watch notification is received.
Similar videos
Similar videos of v are videos a user likely watch after watched video v. The similarity between two videos are measured as
C_ij / C_i * C_j
where C_ij is the co-visitation count, meaning a user watched both V_i and V_j during a 24 hours period of time, and C_i and C_j are count of each video.
Note this is very similar to the cosine similarity score, except that the denominators for cosine are L2 norms, and the similarity scores here are taking the L1 norms.
Then the top N most similar videos are picked given a watched videos. A minimum threshold is also applied to the overall view count to reduce noise.
Generating Recommendation Candidates
For videos a user watched or liked, a set of similar videos are generated. Sometimes, the set can be very similar to the user watched, and there can be narrow and miss content truly new to the user. A solution is to include the similar videos of the similar videos, recursively.
Ranking
1) Video quality: view count, ratings, sharing activities.
2) user specificity. Higher rank for the similar videos generated from higher quality seed videos, defined by view count, time of watch, etc.
A linear combination of the two signals are used to generate a ranked list. Diversity is then considered by removing videos are too similar to each other. One way is limit the number of recommendations associated with one single seed video, or the same channel. More complex approaches such as topic clustering could also be used.
User interface
Seed videos are used for interpretability — “because you watched xxx”
Implementations
Batch-oriented pre-computation approach for better scalability. Recommendations are generated using MapReduces jobs, and saved in BigTable.
Several times per day to reduce the delay.
Results
The recommendations generated from this approach has as more than twice click-through rate of the baseline methods: most viewed videos in a day; top rated videos in a day; top favorited videos in a day.
No comments:
Post a Comment