Added embedding based retrieval blog post

This commit is contained in:
Div 2021-04-12 13:22:50 -07:00 committed by R. Tyler Croy
parent b107b9252a
commit 6f06f5f0e1
7 changed files with 308 additions and 0 deletions

View File

@ -0,0 +1,4 @@
---
team: Recommendations
permalink: "/blog/category/recommendations"
---

View File

@ -125,3 +125,10 @@ ajhofmann:
trupin:
name: Theo Rupin
github: trupin
div:
name: Div Dasani
github: divdasani
about: |
Div is a Machine Learning Engineer on the Recommendations team, working on personalized
recommendations and online faceted search.

View File

@ -44,3 +44,13 @@ Security Engineering:
Internal Tools:
lever: 'Internal Tools'
Recommendations:
lever: 'Recommendations'
about: |
The Recommendations team at Scribd wants to inspire users to read more and discover
new content and topics. Our team comprises of Machine Learning and Software Engineers,
Product Managers, Data Scientists, and QA and Project Managers, all of whom have the
shared passion of building the world's best recommendation engine for books. We pride
ourselves on using a variety of open-source technologies to develop and productionize
state of the art machine learning solutions.

View File

@ -0,0 +1,287 @@
---
layout: post
title: "Embedding-based Retrieval at Scribd"
author: div
tags:
- machinelearning
- real-time
- search
team: Recommendations
---
Building recommendations systems like those implemented at large companies like
[Facebook](https://arxiv.org/pdf/2006.11632.pdf) and
[Pinterest](https://labs.pinterest.com/user/themes/pin_labs/assets/paper/pintext-kdd2019.pdf)
can be accomplished using off the shelf tools like Elasticsearch. Many modern recommendation systems implement
*embedding-based retrieval*, a technique that uses embeddings to represent documents, and then converts the
recommendations retrieval problem into a [similarity search](https://en.wikipedia.org/wiki/Similarity_search) problem
in the embedding space. This post details our approach to “embedding-based retrieval” with Elasticsearch.
### Context
Recommendations plays an integral part in helping users discover content that delights them on the Scribd platform,
which hosts millions of premium ebooks, audiobooks, etc along with over a hundred million user uploaded items.
![](/post-images/2021-04-ebr-scribd/f1.png)
*Figure One: An example of a row on Scribds home page that is generated by the recommendations service*
Currently, Scribd uses a collaborative filtering based approach to recommend content, but this model limits our ability
to personalize recommendations for each user. This is our primary motivation for rethinking the way we recommend content,
and has resulted in us shifting to [Transformer](http://jalammar.github.io/illustrated-transformer/) -based sequential
recommendations. While model architecture and details wont be discussed in this post, the key takeaway is that our
implementation outputs *embeddings* vector representations of items and users that capture semantic information such
as the genre of an audiobook or the reading preferences of a user. Thus, the challenge is now how to utilize these
millions of embeddings to serve recommendations in an online, reliable, and low-latency manner to users as they
use Scribd. We built an embedding-based retrieval system to solve this use case.
### Recommendations as a Faceted Search Problem
There are many technologies capable of performing fast, reliable nearest neighbors search across a large number of
document vectors. However, our system has the additional challenge of requiring support for
[faceted search](https://en.wikipedia.org/wiki/Faceted_search) that is, being able to retrieve the most relevant
documents over a subset of the corpus defined by user-specific business rules (e.g. language of the item or geographic
availability) at query time. At a high level, we desired a system capable of fulfilling the following requirements:
1. The system should be able to prefilter results over one or more given facets. This facet can be defined as a filter
over numerical, string, or category fields
2. The system should support one or more exact distance metrics (e.g. dot product, euclidean distance)
3. The system should allow updates to data without downtime
4. The system should be highly available, and be able to respond to a query quickly. We targeted a service-level
objective (SLO) with p95 of <100ms
5. The system should have helpful monitoring and alerting capabilities, or provide support for external solutions
After evaluating several candidates for this system, we found Elasticsearch to be the most suitable for our use case.
In addition to satisfying all the requirements above, it has the following benefits:
- Widely used, has a large community, and thorough documentation which allows easier long-term maintenance and onboarding
- Updating schemas can easily be automated using pre-specified templates, which makes ingesting new data and maintaining
indices a breeze
- Supports custom plugin integrations
However, Elasticsearch also has some drawbacks, the most notable of which is the lack of true in-memory partial updates.
This is a dealbreaker if updates to the system happen frequently and in real-time, but our use case only requires support
for nightly batch updates, so this is a tradeoff we are willing to accept.
We also looked into a few other systems as potential solutions. While
[Open Distro for Elasticsearch](https://opendistro.github.io/for-elasticsearch/) (aka AWS Managed Elasticsearch) was
originally considered due to its simplicity in deployment and maintenance, we decided not to move forward with this
solution due to its lack of support for prefiltering. [Vespa](https://vespa.ai/) is also a promising candidate that has a
bunch of additional useful features, such as true in-memory partial updates, and support for integration with TensorFlow
for advanced, ML-based ranking. The reason we did not proceed with Vespa was due to maintenance concerns: deploying to
multiple nodes is challenging since EKS support is lacking and documentation is sparse. Additionally, Vespa requires the
entire application package containing all indices and their schemas to be deployed at once, which makes working in a
distributed fashion (i.e. working with teammates and using a VCS) challenging.
### How to Set Up Elasticsearch as a Faceted Search Solution
![](/post-images/2021-04-ebr-scribd/f2.png)
*Figure Two: A high level diagram illustrating how the Elasticsearch system fetches recommendations*
Elasticsearch stores data as JSON documents within indices, which are logical namespaces with data mappings and shard
configurations. For our use case, we defined two indices, a `user_index` and an `item_index`. The former is essentially
a key-value store that maps a user ID to a corresponding user embedding. A sample document in the `user_index` looks like:
```
{"_id": 4243913,
"user_embed": [-0.5888184, ..., -0.3882332]}
```
Notice here we use Elasticsearchs inbuilt `_id` field rather than creating a custom field. This is so we can fetch user
embeddings with a `GET` request rather than having to search for them, like this:
```
curl <URL to cluster>:9200/user_index/_doc/4243913
```
Now that we have the user embedding, we can use it to query the `item_index`, which stores each items metadata
(which we will use to perform faceted search) and embedding. Heres what a document in this index could look like:
```
{"_id": 13375,
"item_format": "audiobook",
"language": "english",
"country": "Australia",
"categories": ["comedy", "fiction", "adventure"],
"item_embed": [0.51400936,...,0.0892048]}
```
We want to accomplish two goals in our query: retrieving the most relevant documents to the user (which in our model
corresponds to the dot product between the user and item embeddings), and ensuring that all retrieved documents have the
same filter values as those requested by the user. This is where Elasticsearch shines:
```
curl -H 'Content-Type: application/json' \
<URL to cluster>:9200/item_index/_search \
-d \
'
{"_source": ["_id"],
"size": 30,
"query": {"script_score": {"query": {"bool":
{"must_not": {"term": {"categories": "adventure"}},
"filter": [{"term": {"language": "english"}},
{"term": {"country": "Australia"}}]}},
"script": {"source": "double value = dotProduct(params.user_embed, 'item_embed');
return Math.max(0, value+10000);",
"params": {"user_embed": [-0.5888184, ..., -0.3882332]}}}}}
'
```
Lets break this query down to understand whats going on:
1. Line 2: Here we are querying the `item_index` using Elasticsearchs `_search` API
2. Lines 5,6: We specify which attributes of the item documents wed like returned (in this case, only `_id`), and how many
results (`30`)
3. Line 7: Here we are querying using the `script_score` feature; this is what allows us to first prefilter our corpus and
then rank the remaining subset using a custom script
4. Lines 8-10: Elasticsearch has various different boolean query types for filtering. In this example we specify that we
are interested only in `english` items that can be viewed in `Australia` and which are not categorized as `adventure`
5. Lines 11-12: Here is where we get to define our custom script. Elasticsearch has a built-in `dot_product` method we
can employ, which is optimized to speed up computation. Note that our embeddings are not normalized, and Elasticsearch
prohibits negative scores. For this reason, we had to include the score transformation in line 12 to ensure our scores
were positive
6. Line 13: Here we can add parameters which are passed to the scoring script
This query will retrieve recommendations based on one set of filters. However, in addition to user filters, each row on
Scribds homepage also has row-specific filters (for example, the row “Audiobooks Recommended for You” would have a
row-specific filter of `"item_format": "audiobook"`). Rather than making multiple queries to the Elasticsearch cluster
with each combination of user and row filters, we can conduct multiple independent searches in a single query using the
`_msearch` API. The following example query generates recommendations for hypothetical “Audiobooks Recommended for You”
and “Comedy Titles Recommended for You” rows:
```
curl -H 'Content-Type: application/json' \
<URL to cluster>:9200/_msearch \
-d \
'
{"index": "item_index"}
{"_source": ["_id"],
"size": 30,
"query": {"script_score": {"query": {"bool":
{"must_not": {"term": {"categories": "adventure"}},
"filter": [{"term": {"language": "english"}},
{"term": {"item_format": "audiobook"}},
{"term": {"country": "Australia"}}]}},
"script": {"source": "double value = dotProduct(params.user_embed, 'item_embed');
return Math.max(0, value+10000);",
"params": {"user_embed": [-0.5888184, ..., -0.3882332]}}}}}
{"index": "item_index"}
{"_source": ["_id"],
"size": 30,
"query": {"script_score": {"query": {"bool":
{"must_not": {"term": {"categories": "adventure"}},
"filter": [{"term": {"language": "english"}},
{"term": {"categories": "comedy"}},
{"term": {"country": "Australia"}}]}},
"script": {"source": "double value = dotProduct(params.user_embed, 'item_embed');
return Math.max(0, value+10000);",
"params": {"user_embed": [-0.5888184, ..., -0.3882332]}}}}}
'
```
#### Shard Configuration
Elasticsearch stores multiple copies of data across multiple nodes for resilience and increased query performance in a
process known as sharding. The number of primary and replica shards is configurable only at index creation time. Here are
some things to consider regarding shards:
1. Try out various shard configurations to see what works best for each use case.
[Elastic](https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster) recommends 20-40GB of data per
shard, while [eBay](https://tech.ebayinc.com/engineering/elasticsearch-performance-tuning-practice-at-ebay/) likes to keep
their shard size below 30GB. However, these values did not work for us, and we found much smaller shard sizes (<5GB) to
boost performance in the form of reduced latency at query time.
2. When updating data, do not update documents within the existing index. Instead, create a new index, ingest updated
documents into this index, and [re-alias](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html)
from the old index to the new one. This process will allow you to retain older data in case an update needs to be reverted,
and allows re-configurability of shards at update time.
#### Cluster Configuration
We deployed our cluster across multiple data and primary nodes to enjoy the benefits of data redundancy, increased
availability, and the ability to scale horizontally as our service grows. We found that deploying the cluster across
multiple availability zones results in an increased latency during query time, but this is a tradeoff we accepted in the
interest of availability.
As for hardware specifics, we use AWS EC2 instances to host our cluster. In production, we have 3 `t3a.small` primary-only
nodes, 3 `c5d.12xlarge` data nodes, and 1 `t3.micro` Kibana node. The primary-only nodes are utilized only in a coordinating
role (to route requests, distribute bulk indexing, etc), essentially acting as smart load balancers. This is why these
nodes are much smaller than the data nodes, which handle the bulk of storage and computational costs. Kibana is a data
visualization and monitoring tool; however, in production we use Datadog for our monitoring and alerting responsibilities,
which is why we do not allocate many resources for the Kibana node.
### What Generating Recommendations Looks Like
![](/post-images/2021-04-ebr-scribd/f3.png)
*Figure Three: a diagram illustrating the system design for Personalizations Embedding-based Retrieval Service*
Step by step, this is how recommendations are generated for a user when he/she requests the home page:
1. The Scribd app passes the users information to the recommendations service
2. The recommendations service queries Elasticsearch with the users ID to retrieve their user embedding, which is stored
in a user index
3. The recommendations service once again queries Elasticsearch, this time with the users embedding along with their
user query filters. This query is a multi-search request to the item index: one for every desired row
4. Elasticsearch returns these recommendations to the service, which are postprocessed and generated into rows before
being sent to the client
5. The client renders these recommendations and displays them to the user
With this approach, Elasticsearch will serve two purposes: acting as a key-value store, and retrieving recommendations.
Elasticsearch is, of course, a slower key-value store than traditional databases, but we found the increase in latency
to be insignificant (~5ms) for our use case. Furthermore, the benefit of this approach is that it only requires
maintaining data in one system; using multiple systems to store data would create a consistency challenge.
The underlying Personalization model is very large, making retraining it a very expensive process. Thus, it needs to be
retrained often enough to account for factors like user preference drift but not too often so as to be efficient with
computational resources. We found retraining the model weekly worked well for us. item embeddings, which typically update
only incrementally, are also recomputed weekly. However, user embeddings are recomputed daily to provide fresh
recommendations based on changing user interests. These embeddings along with relevant metadata are ingested into the
Elasticsearch index in a batch process using [Apache Spark](https://spark.apache.org/) and are scheduled through
[Apache Airflow](https://airflow.apache.org/). We monitor this ingest process along with real-time serving metrics
through Datadog.
#### Load Testing
Our primary goal during load testing was to ensure that our system was able to reliably respond to a “reasonably large”
number of requests per second and deliver a sufficient number of relevant recommendations, even under the confines of
multiple facets within each query. We also took this opportunity to experiment with various aspects of our system to
understand their impact on performance. These include:
- Shard and replica configuration: We found that increasing the number of shards increased performance, but only to a
point; If a cluster is over-sharded, the overhead of each shard outweighs the marginal performance gain of the
additional partition
- Dataset size: We artificially increased the size of our corpus several times to ensure the systems performance would remain
sufficient even as our catalog continues to grow
- Filter and mapping configurations: Some filters (like `range` inequalities) are more expensive than traditional
categorical filters. Additionally, increasing the number of fields in each document also has a negative impact on latency.
Our use case calls for several filters across hundreds of document fields, so we played with several document and query
configuration to find the one most optimal for the performance of our system
Our system is currently deployed to production and serves ~50rps with a p95 latency <60ms.
### Results
Using Scribds internal A/B testing platform, we conducted an experiment comparing the existing recommendations service
with the new personalization model with embedding-based retrieval architecture across the home and discover page surfaces.
The test ran for approximately a month with >1M Scribd users (trialers or subscribers) assigned as participants. After
careful analysis of results, we saw the following statistically significant (p<0.01) improvements in the personalization
variant compared to the control experience:
- Increase in the number of users who clicked on a recommended item
- Increase in the average number of clicks per user
- Increase in the number of users with a read time of at least 10 minutes (in a three day window)
These increases represent significant business impact on key performance metrics. The personalization model currently
generates recommendations for every (signed in) Scribd users home and discover pages.
### Next Steps
Now that the infrastructure and model are in place, we are looking to add a slew of improvements to the existing system.
Our immediate efforts will focus on expanding the scope of this system to include more surfaces and row modules within
the Scribd experience. Additional long term projects include the addition of an online contextual reranker to increase
the relevance and freshness of recommendations and potentially integrating our system with an infrastructure as code
tool to more easily manage and scale compute resources.
Thank you for reading! We hope you found this post useful and informative.
### Thank You 🌮
Thank you to
[Snehal Mistry](https://www.linkedin.com/in/snehal-mistry-b986b53/),
[Jeffrey Nguyen](https://www.linkedin.com/in/jnguyenfc/),
[Natalie Medvedchuk](https://www.linkedin.com/in/natalie-medvedchuk/),
[Dimitri Theoharatos](https://www.linkedin.com/in/dimitri-theoharatos/),
[Adrian Lienhard](https://www.linkedin.com/in/adrianlienhard/),
and countless others, all of whom provided invaluable guidance and assistance throughout this project.
(giving tacos 🌮 is how we show appreciation here at Scribd)

Binary file not shown.

After

Width:  |  Height:  |  Size: 406 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 179 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 220 KiB