Wu, Lili, Sam Shah, Sean Choi, Mitul Tiwari, and Christian Posse. "The Browsemaps: Collaborative Filtering at LinkedIn." In RSWeb@ RecSys. 2014. [pdf]
This work describes a centralized system, Browsemaps, that provides collaborative filtering information for various Linkedin products, such as “companies you may want to follow”, “similar companies”, “similar profiles”, and “suggested profile updates”.
Take “companies you may want to follow” for example. Browsemaps provides a list of companies that are similar to the companies a user already follows. The similarity is calculated from user activity data, in particular, co-follows.
The system calculates the similar items offline in Hadoop, and saves the similar item information to a key-value store.
Product teams can combine the collaborative filtering information and content information for recommendations. In the case of “companies you may want to follow” product, a collaborative filtering feature would be whether a company is in the IDs of companies similar to a user follows. Content features include the matching between the user and the company in industry, location and years of experience.
Cold start is an issue for newly posted items using collaborative filtering approach. But one can leverage user’s historic activity to make recommendations. For example, when a member viewed several jobs and landed on a newly posted job which does not have similar jobs generated yet, it is useful to generate the similar jobs of the jobs the member viewed prior to the new job.
As the Browsemaps product is centralized, sometimes it takes time to discover data quality issues. For example, . For example, there can be an issue capturing an activity event, but it can be hard to catch until the issue surfaces in later downstream product applications.
There are a few steps that made the system more robust:
- LinkedIn has transformed its data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system
- Added robust auditing to ensure correct per-hop reliable data transfer, from the frontend all the way to our relevance systems
- compare input and output coverage and offline metrics, alerting if there is significant deviation
- added code-driven test automation for tracking events, so most regressions are caught as part of our continuous integration process, not after release.
No comments:
Post a Comment