Added embedding based retrieval blog post

2021-04-12 13:22:50 -07:00 · 2021-04-12 13:22:50 -07:00 · 6f06f5f0e1
parent b107b9252a
commit 6f06f5f0e1
7 changed files with 308 additions and 0 deletions
--- a/_category/recommendations.md
+++ b/_category/recommendations.md
@ -0,0 +1,4 @@
+---
+team: Recommendations
+permalink: "/blog/category/recommendations"
+---
--- a/_data/authors.yml
+++ b/_data/authors.yml
@ -125,3 +125,10 @@ ajhofmann:
 trupin:
  name: Theo Rupin
  github: trupin
+
+div:
+  name: Div Dasani
+  github: divdasani
+  about: |
+    Div is a Machine Learning Engineer on the Recommendations team, working on personalized
+    recommendations and online faceted search.
--- a/_data/teams.yml
+++ b/_data/teams.yml
@ -44,3 +44,13 @@ Security Engineering:

 Internal Tools:
  lever: 'Internal Tools'
+
+Recommendations:
+  lever: 'Recommendations'
+  about: |
+    The Recommendations team at Scribd wants to inspire users to read more and discover
+    new content and topics. Our team comprises of Machine Learning and Software Engineers,
+    Product Managers, Data Scientists, and QA and Project Managers, all of whom have the
+    shared passion of building the world's best recommendation engine for books. We pride
+    ourselves on using a variety of open-source technologies to develop and productionize
+    state of the art machine learning solutions.
--- a/_posts/2021-04-12-embedding-based-retrieval-scribd.md
+++ b/_posts/2021-04-12-embedding-based-retrieval-scribd.md
@ -0,0 +1,287 @@
+---
+layout: post
+title:  "Embedding-based Retrieval at Scribd"
+author: div
+tags:
+- machinelearning
+- real-time
+- search
+team: Recommendations
+---
+
+Building recommendations systems like those implemented at large companies like 
+[Facebook](https://arxiv.org/pdf/2006.11632.pdf) and 
+[Pinterest](https://labs.pinterest.com/user/themes/pin_labs/assets/paper/pintext-kdd2019.pdf) 
+can be accomplished using off the shelf tools like Elasticsearch. Many modern recommendation systems implement 
+*embedding-based retrieval*, a technique that uses embeddings to represent documents, and then converts the 
+recommendations retrieval problem into a [similarity search](https://en.wikipedia.org/wiki/Similarity_search) problem 
+in the embedding space. This post details our approach to “embedding-based retrieval” with Elasticsearch.
+
+### Context
+Recommendations plays an integral part in helping users discover content that delights them on the Scribd platform, 
+which hosts millions of premium ebooks, audiobooks, etc along with over a hundred million user uploaded items.
+
+![](/post-images/2021-04-ebr-scribd/f1.png)
+
+*Figure One: An example of a row on Scribd’s home page that is generated by the recommendations service*
+
+Currently, Scribd uses a collaborative filtering based approach to recommend content, but this model limits our ability 
+to personalize recommendations for each user. This is our primary motivation for rethinking the way we recommend content, 
+and has resulted in us shifting to [Transformer](http://jalammar.github.io/illustrated-transformer/) -based sequential 
+recommendations. While model architecture and details won’t be discussed in this post, the key takeaway is that our 
+implementation outputs *embeddings* – vector representations of items and users that capture semantic information such 
+as the genre of an audiobook or the reading preferences of a user. Thus, the challenge is now how to utilize these 
+millions of embeddings to serve recommendations in an online, reliable, and low-latency manner to users as they 
+use Scribd. We built an embedding-based retrieval system to solve this use case.
+
+### Recommendations as a Faceted Search Problem
+There are many technologies capable of performing fast, reliable nearest neighbors search across a large number of 
+document vectors. However, our system has the additional challenge of requiring support for 
+[faceted search](https://en.wikipedia.org/wiki/Faceted_search) – that is, being able to retrieve the most relevant 
+documents over a subset of the corpus defined by user-specific business rules (e.g. language of the item or geographic 
+availability) at query time. At a high level, we desired a system capable of fulfilling the following requirements:
+
+1. The system should be able to prefilter results over one or more given facets. This facet can be defined as a filter 
+over numerical, string, or category fields
+2. The system should support one or more exact distance metrics (e.g. dot product, euclidean distance)
+3. The system should allow updates to data without downtime
+4. The system should be highly available, and be able to respond to a query quickly. We targeted a service-level 
+objective (SLO) with p95 of <100ms
+5. The system should have helpful monitoring and alerting capabilities, or provide support for external solutions
+
+After evaluating several candidates for this system, we found Elasticsearch to be the most suitable for our use case. 
+In addition to satisfying all the requirements above, it has the following benefits:
+
+- Widely used, has a large community, and thorough documentation which allows easier long-term maintenance and onboarding
+- Updating schemas can easily be automated using pre-specified templates, which makes ingesting new data and maintaining 
+indices a breeze
+- Supports custom plugin integrations
+
+However, Elasticsearch also has some drawbacks, the most notable of which is the lack of true in-memory partial updates. 
+This is a dealbreaker if updates to the system happen frequently and in real-time, but our use case only requires support 
+for nightly batch updates, so this is a tradeoff we are willing to accept.
+
+We also looked into a few other systems as potential solutions. While 
+[Open Distro for Elasticsearch](https://opendistro.github.io/for-elasticsearch/) (aka AWS Managed Elasticsearch) was 
+originally considered due to its simplicity in deployment and maintenance, we decided not to move forward with this 
+solution due to its lack of support for prefiltering. [Vespa](https://vespa.ai/) is also a promising candidate that has a 
+bunch of additional useful features, such as true in-memory partial updates, and support for integration with TensorFlow 
+for advanced, ML-based ranking. The reason we did not proceed with Vespa was due to maintenance concerns: deploying to 
+multiple nodes is challenging since EKS support is lacking and documentation is sparse. Additionally, Vespa requires the 
+entire application package containing all indices and their schemas to be deployed at once, which makes working in a 
+distributed fashion (i.e. working with teammates and using a VCS) challenging.
+
+### How to Set Up Elasticsearch as a Faceted Search Solution
+
+![](/post-images/2021-04-ebr-scribd/f2.png)
+
+*Figure Two: A high level diagram illustrating how the Elasticsearch system fetches recommendations*
+
+Elasticsearch stores data as JSON documents within indices, which are logical namespaces with data mappings and shard 
+configurations. For our use case, we defined two indices, a `user_index` and an `item_index`. The former is essentially 
+a key-value store that maps a user ID to a corresponding user embedding. A sample document in the `user_index` looks like:
+
+```
+{"_id": 4243913,
+ "user_embed": [-0.5888184, ..., -0.3882332]}
+```
+
+Notice here we use Elasticsearch’s inbuilt `_id` field rather than creating a custom field. This is so we can fetch user 
+embeddings with a `GET` request rather than having to search for them, like this:
+
+```
+curl <URL to cluster>:9200/user_index/_doc/4243913
+```
+
+Now that we have the user embedding, we can use it to query the `item_index`, which stores each item’s metadata 
+(which we will use to perform faceted search) and embedding. Here’s what a document in this index could look like:
+
+```
+{"_id": 13375,
+ "item_format": "audiobook",
+ "language": "english",
+ "country": "Australia",
+ "categories": ["comedy", "fiction", "adventure"],
+ "item_embed": [0.51400936,...,0.0892048]}
+```
+
+We want to accomplish two goals in our query: retrieving the most relevant documents to the user (which in our model 
+corresponds to the dot product between the user and item embeddings), and ensuring that all retrieved documents have the 
+same filter values as those requested by the user. This is where Elasticsearch shines:
+
+```
+curl -H 'Content-Type: application/json' \
+<URL to cluster>:9200/item_index/_search \
+-d \ 
+'
+{"_source": ["_id"], 
+ "size": 30, 
+ "query": {"script_score": {"query": {"bool": 
+                                      {"must_not": {"term": {"categories": "adventure"}}, 
+                                       "filter": [{"term": {"language": "english"}},
+                                                  {"term": {"country": "Australia"}}]}},
+                            "script": {"source": "double value = dotProduct(params.user_embed, 'item_embed'); 
+                                                  return Math.max(0, value+10000);", 
+                                       "params": {"user_embed": [-0.5888184, ..., -0.3882332]}}}}}
+'  
+```
+
+Let’s break this query down to understand what’s going on:
+
+1. Line 2: Here we are querying the `item_index` using Elasticsearch’s `_search` API
+2. Lines 5,6: We specify which attributes of the item documents we’d like returned (in this case, only `_id`), and how many 
+results (`30`)
+3. Line 7: Here we are querying using the `script_score` feature; this is what allows us to first prefilter our corpus and 
+then rank the remaining subset using a custom script
+4. Lines 8-10: Elasticsearch has various different boolean query types for filtering. In this example we specify that we 
+are interested only in `english` items that can be viewed in `Australia` and which are not categorized as `adventure` 
+5. Lines 11-12: Here is where we get to define our custom script. Elasticsearch has a built-in `dot_product` method we 
+can employ, which is optimized to speed up computation. Note that our embeddings are not normalized, and Elasticsearch 
+prohibits negative scores. For this reason, we had to include the score transformation in line 12 to ensure our scores 
+were positive
+6. Line 13: Here we can add parameters which are passed to the scoring script
+
+This query will retrieve recommendations based on one set of filters. However, in addition to user filters, each row on 
+Scribd’s homepage also has row-specific filters (for example, the row “Audiobooks Recommended for You” would have a 
+row-specific filter of `"item_format": "audiobook"`). Rather than making multiple queries to the Elasticsearch cluster 
+with each combination of user and row filters, we can conduct multiple independent searches in a single query using the 
+`_msearch` API. The following example query generates recommendations for hypothetical “Audiobooks Recommended for You” 
+and “Comedy Titles Recommended for You” rows:
+
+```
+curl -H 'Content-Type: application/json' \
+<URL to cluster>:9200/_msearch \
+-d \ 
+'
+{"index": "item_index"}
+{"_source": ["_id"], 
+ "size": 30, 
+ "query": {"script_score": {"query": {"bool": 
+                                      {"must_not": {"term": {"categories": "adventure"}}, 
+                                       "filter": [{"term": {"language": "english"}},
+                                                  {"term": {"item_format": "audiobook"}},
+                                                  {"term": {"country": "Australia"}}]}},
+                            "script": {"source": "double value = dotProduct(params.user_embed, 'item_embed'); 
+                                                  return Math.max(0, value+10000);", 
+                                       "params": {"user_embed": [-0.5888184, ..., -0.3882332]}}}}}
+{"index": "item_index"}
+{"_source": ["_id"], 
+ "size": 30, 
+ "query": {"script_score": {"query": {"bool": 
+                                      {"must_not": {"term": {"categories": "adventure"}}, 
+                                       "filter": [{"term": {"language": "english"}},
+                                                  {"term": {"categories": "comedy"}},
+                                                  {"term": {"country": "Australia"}}]}},
+                            "script": {"source": "double value = dotProduct(params.user_embed, 'item_embed'); 
+                                                  return Math.max(0, value+10000);", 
+                                       "params": {"user_embed": [-0.5888184, ..., -0.3882332]}}}}}
+' 
+```
+
+#### Shard Configuration
+Elasticsearch stores multiple copies of data across multiple nodes for resilience and increased query performance in a 
+process known as sharding. The number of primary and replica shards is configurable only at index creation time. Here are 
+some things to consider regarding shards:
+
+1. Try out various shard configurations to see what works best for each use case. 
+[Elastic](https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster) recommends 20-40GB of data per 
+shard, while [eBay](https://tech.ebayinc.com/engineering/elasticsearch-performance-tuning-practice-at-ebay/) likes to keep 
+their shard size below 30GB. However, these values did not work for us, and we found much smaller shard sizes (<5GB) to 
+boost performance in the form of reduced latency at query time.
+2. When updating data, do not update documents within the existing index. Instead, create a new index, ingest updated 
+documents into this index, and [re-alias](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html) 
+from the old index to the new one. This process will allow you to retain older data in case an update needs to be reverted, 
+and allows re-configurability of shards at update time.
+
+#### Cluster Configuration
+We deployed our cluster across multiple data and primary nodes to enjoy the benefits of data redundancy, increased 
+availability, and the ability to scale horizontally as our service grows. We found that deploying the cluster across 
+multiple availability zones results in an increased latency during query time, but this is a tradeoff we accepted in the 
+interest of availability.
+
+As for hardware specifics, we use AWS EC2 instances to host our cluster. In production, we have 3 `t3a.small` primary-only 
+nodes, 3 `c5d.12xlarge` data nodes, and 1 `t3.micro` Kibana node. The primary-only nodes are utilized only in a coordinating 
+role (to route requests, distribute bulk indexing, etc), essentially acting as smart load balancers. This is why these 
+nodes are much smaller than the data nodes, which handle the bulk of storage and computational costs. Kibana is a data 
+visualization and monitoring tool; however, in production we use Datadog for our monitoring and alerting responsibilities, 
+which is why we do not allocate many resources for the Kibana node.
+
+### What Generating Recommendations Looks Like
+
+![](/post-images/2021-04-ebr-scribd/f3.png)
+
+*Figure Three: a diagram illustrating the system design for Personalization’s Embedding-based Retrieval Service*
+
+Step by step, this is how recommendations are generated for a user when he/she requests the home page:
+1. The Scribd app passes the user’s information to the recommendations service
+2. The recommendations service queries Elasticsearch with the user’s ID to retrieve their user embedding, which is stored 
+in a user index
+3. The recommendations service once again queries Elasticsearch, this time with the user’s embedding along with their 
+user query filters. This query is a multi-search request to the item index: one for every desired row
+4. Elasticsearch returns these recommendations to the service, which are postprocessed and generated into rows before 
+being sent to the client
+5. The client renders these recommendations and displays them to the user
+
+With this approach, Elasticsearch will serve two purposes: acting as a key-value store, and retrieving recommendations. 
+Elasticsearch is, of course, a slower key-value store than traditional databases, but we found the increase in latency 
+to be insignificant (~5ms) for our use case. Furthermore, the benefit of this approach is that it only requires 
+maintaining data in one system; using multiple systems to store data would create a consistency challenge.
+
+The underlying Personalization model is very large, making retraining it a very expensive process. Thus, it needs to be 
+retrained often enough to account for factors like user preference drift but not too often so as to be efficient with 
+computational resources. We found retraining the model weekly worked well for us. item embeddings, which typically update 
+only incrementally, are also recomputed weekly. However, user embeddings are recomputed daily to provide fresh 
+recommendations based on changing user interests. These embeddings along with relevant metadata are ingested into the 
+Elasticsearch index in a batch process using [Apache Spark](https://spark.apache.org/) and are scheduled through 
+[Apache Airflow](https://airflow.apache.org/). We monitor this ingest process along with real-time serving metrics 
+through Datadog.
+
+#### Load Testing
+Our primary goal during load testing was to ensure that our system was able to reliably respond to a “reasonably large” 
+number of requests per second and deliver a sufficient number of relevant recommendations, even under the confines of 
+multiple facets within each query. We also took this opportunity to experiment with various aspects of our system to 
+understand their impact on performance. These include:
+- Shard and replica configuration: We found that increasing the number of shards increased performance, but only to a 
+point; If a cluster is over-sharded, the overhead of each shard outweighs the marginal performance gain of the 
+additional partition
+- Dataset size: We artificially increased the size of our corpus several times to ensure the system’s performance would remain 
+sufficient even as our catalog continues to grow
+- Filter and mapping configurations: Some filters (like `range` inequalities) are more expensive than traditional 
+categorical filters. Additionally, increasing the number of fields in each document also has a negative impact on latency. 
+Our use case calls for several filters across hundreds of document fields, so we played with several document and query 
+configuration to find the one most optimal for the performance of our system
+
+Our system is currently deployed to production and serves ~50rps with a p95 latency <60ms.
+
+### Results
+Using Scribd’s internal A/B testing platform, we conducted an experiment comparing the existing recommendations service 
+with the new personalization model with embedding-based retrieval architecture across the home and discover page surfaces. 
+The test ran for approximately a month with >1M Scribd users (trialers or subscribers) assigned as participants. After 
+careful analysis of results, we saw the following statistically significant (p<0.01) improvements in the personalization 
+variant compared to the control experience:
+- Increase in the number of users who clicked on a recommended item
+- Increase in the average number of clicks per user
+- Increase in the number of users with a read time of at least 10 minutes (in a three day window)
+
+These increases represent significant business impact on key performance metrics. The personalization model currently 
+generates recommendations for every (signed in) Scribd user’s home and discover pages.
+
+### Next Steps
+Now that the infrastructure and model are in place, we are looking to add a slew of improvements to the existing system. 
+Our immediate efforts will focus on expanding the scope of this system to include more surfaces and row modules within 
+the Scribd experience. Additional long term projects include the addition of an online contextual reranker to increase 
+the relevance and freshness of recommendations and potentially integrating our system with an infrastructure as code 
+tool to more easily manage and scale compute resources.
+
+Thank you for reading! We hope you found this post useful and informative.
+
+### Thank You 🌮
+Thank you to 
+[Snehal Mistry](https://www.linkedin.com/in/snehal-mistry-b986b53/), 
+[Jeffrey Nguyen](https://www.linkedin.com/in/jnguyenfc/), 
+[Natalie Medvedchuk](https://www.linkedin.com/in/natalie-medvedchuk/), 
+[Dimitri Theoharatos](https://www.linkedin.com/in/dimitri-theoharatos/), 
+[Adrian Lienhard](https://www.linkedin.com/in/adrianlienhard/), 
+and countless others, all of whom provided invaluable guidance and assistance throughout this project.
+
+(giving tacos 🌮 is how we show appreciation here at Scribd)
--- a/post-images/2021-04-ebr-scribd/f1.png
+++ b/post-images/2021-04-ebr-scribd/f1.png
--- a/post-images/2021-04-ebr-scribd/f2.png
+++ b/post-images/2021-04-ebr-scribd/f2.png
--- a/post-images/2021-04-ebr-scribd/f3.png
+++ b/post-images/2021-04-ebr-scribd/f3.png