The article first reviewed a few methods.
Traditional collaborative filtering represents a customer as an N-dimensional vector of items, each electing indicating the rating. Commonly one measures similarity between two customer vectors by cosine. Then the algorithm can recommend similar customers’ items (e.g., an item is ranked higher when more similar customers purchased it).
The algorithm need to scan every customer. Only a few customers purchased a significant percentage of items, and so for M customers and N items, the performance is approximately O(M+N).
For 10 million or more customers and 1 million items, the algorithm would have performance issues.
For a user, one can reduce M by randomly sampling the customers or discarding customers with few purchases, but the selected customers may be less similar to the user.
One may also reduce N by discarding very popular or unpopular items. these items will never appear as recommendations.
Clustering can be used to reduce M or N by a large factor. Clustering the item space has the same effect of eliminating low-frequency items.
Clustering customers can have the effect that the recommendations for the same clusters are less relevant to customers in the cluster when the number of clusters is small. When the number of cluster is large, there will still be scalability issue.
Given a user’s purchased items, search- or content-based methods recommend popular items by the same author, artist or with similar keywords or subjects. For users with few purchases, search-based methods scale and perform well but the quality decreases as more purchases are made — the recommendations are either too broad (all best-selling drama DVD titles) or too narrow (all books by the same author). Recommendations should help a user discover new, relevant, and interesting items.
[Note: content-based methods mostly rely on human-defined meta data such as keywords. They may be effective for the cold-start issue of collaborative filtering methods]
Item-to-Item collaborative filtering
Amazon recommend products based on the items in the shopper cart. The algorithm needs to be scalable for tens of millions of customers and products.
The algorithm find similar items of the user’s purchased/rated items and combines the similar items into a recommendation list.
Authors used cosine to measure the similarity between items. Sampling customers who purchase best-selling titles reduces runtime with little reduction in quality.
Given the similar-item table, the algorithm finds items similar to a user’s purchases, aggregate the items and recommended most popular or correlated items. The computation is very quick, only depending on the items user purchased.
At the time, Amazon had more than 29 million customers and several millions of items.
No comments:
Post a Comment