Added embedding based retrieval blog post
This commit is contained in:
parent
b107b9252a
commit
6f06f5f0e1
|
@ -0,0 +1,4 @@
|
|||
---
|
||||
team: Recommendations
|
||||
permalink: "/blog/category/recommendations"
|
||||
---
|
|
@ -125,3 +125,10 @@ ajhofmann:
|
|||
trupin:
|
||||
name: Theo Rupin
|
||||
github: trupin
|
||||
|
||||
div:
|
||||
name: Div Dasani
|
||||
github: divdasani
|
||||
about: |
|
||||
Div is a Machine Learning Engineer on the Recommendations team, working on personalized
|
||||
recommendations and online faceted search.
|
||||
|
|
|
@ -44,3 +44,13 @@ Security Engineering:
|
|||
|
||||
Internal Tools:
|
||||
lever: 'Internal Tools'
|
||||
|
||||
Recommendations:
|
||||
lever: 'Recommendations'
|
||||
about: |
|
||||
The Recommendations team at Scribd wants to inspire users to read more and discover
|
||||
new content and topics. Our team comprises of Machine Learning and Software Engineers,
|
||||
Product Managers, Data Scientists, and QA and Project Managers, all of whom have the
|
||||
shared passion of building the world's best recommendation engine for books. We pride
|
||||
ourselves on using a variety of open-source technologies to develop and productionize
|
||||
state of the art machine learning solutions.
|
|
@ -0,0 +1,287 @@
|
|||
---
|
||||
layout: post
|
||||
title: "Embedding-based Retrieval at Scribd"
|
||||
author: div
|
||||
tags:
|
||||
- machinelearning
|
||||
- real-time
|
||||
- search
|
||||
team: Recommendations
|
||||
---
|
||||
|
||||
Building recommendations systems like those implemented at large companies like
|
||||
[Facebook](https://arxiv.org/pdf/2006.11632.pdf) and
|
||||
[Pinterest](https://labs.pinterest.com/user/themes/pin_labs/assets/paper/pintext-kdd2019.pdf)
|
||||
can be accomplished using off the shelf tools like Elasticsearch. Many modern recommendation systems implement
|
||||
*embedding-based retrieval*, a technique that uses embeddings to represent documents, and then converts the
|
||||
recommendations retrieval problem into a [similarity search](https://en.wikipedia.org/wiki/Similarity_search) problem
|
||||
in the embedding space. This post details our approach to “embedding-based retrieval” with Elasticsearch.
|
||||
|
||||
### Context
|
||||
Recommendations plays an integral part in helping users discover content that delights them on the Scribd platform,
|
||||
which hosts millions of premium ebooks, audiobooks, etc along with over a hundred million user uploaded items.
|
||||
|
||||
![](/post-images/2021-04-ebr-scribd/f1.png)
|
||||
|
||||
*Figure One: An example of a row on Scribd’s home page that is generated by the recommendations service*
|
||||
|
||||
Currently, Scribd uses a collaborative filtering based approach to recommend content, but this model limits our ability
|
||||
to personalize recommendations for each user. This is our primary motivation for rethinking the way we recommend content,
|
||||
and has resulted in us shifting to [Transformer](http://jalammar.github.io/illustrated-transformer/) -based sequential
|
||||
recommendations. While model architecture and details won’t be discussed in this post, the key takeaway is that our
|
||||
implementation outputs *embeddings* – vector representations of items and users that capture semantic information such
|
||||
as the genre of an audiobook or the reading preferences of a user. Thus, the challenge is now how to utilize these
|
||||
millions of embeddings to serve recommendations in an online, reliable, and low-latency manner to users as they
|
||||
use Scribd. We built an embedding-based retrieval system to solve this use case.
|
||||
|
||||
### Recommendations as a Faceted Search Problem
|
||||
There are many technologies capable of performing fast, reliable nearest neighbors search across a large number of
|
||||
document vectors. However, our system has the additional challenge of requiring support for
|
||||
[faceted search](https://en.wikipedia.org/wiki/Faceted_search) – that is, being able to retrieve the most relevant
|
||||
documents over a subset of the corpus defined by user-specific business rules (e.g. language of the item or geographic
|
||||
availability) at query time. At a high level, we desired a system capable of fulfilling the following requirements:
|
||||
|
||||
1. The system should be able to prefilter results over one or more given facets. This facet can be defined as a filter
|
||||
over numerical, string, or category fields
|
||||
2. The system should support one or more exact distance metrics (e.g. dot product, euclidean distance)
|
||||
3. The system should allow updates to data without downtime
|
||||
4. The system should be highly available, and be able to respond to a query quickly. We targeted a service-level
|
||||
objective (SLO) with p95 of <100ms
|
||||
5. The system should have helpful monitoring and alerting capabilities, or provide support for external solutions
|
||||
|
||||
After evaluating several candidates for this system, we found Elasticsearch to be the most suitable for our use case.
|
||||
In addition to satisfying all the requirements above, it has the following benefits:
|
||||
|
||||
- Widely used, has a large community, and thorough documentation which allows easier long-term maintenance and onboarding
|
||||
- Updating schemas can easily be automated using pre-specified templates, which makes ingesting new data and maintaining
|
||||
indices a breeze
|
||||
- Supports custom plugin integrations
|
||||
|
||||
However, Elasticsearch also has some drawbacks, the most notable of which is the lack of true in-memory partial updates.
|
||||
This is a dealbreaker if updates to the system happen frequently and in real-time, but our use case only requires support
|
||||
for nightly batch updates, so this is a tradeoff we are willing to accept.
|
||||
|
||||
We also looked into a few other systems as potential solutions. While
|
||||
[Open Distro for Elasticsearch](https://opendistro.github.io/for-elasticsearch/) (aka AWS Managed Elasticsearch) was
|
||||
originally considered due to its simplicity in deployment and maintenance, we decided not to move forward with this
|
||||
solution due to its lack of support for prefiltering. [Vespa](https://vespa.ai/) is also a promising candidate that has a
|
||||
bunch of additional useful features, such as true in-memory partial updates, and support for integration with TensorFlow
|
||||
for advanced, ML-based ranking. The reason we did not proceed with Vespa was due to maintenance concerns: deploying to
|
||||
multiple nodes is challenging since EKS support is lacking and documentation is sparse. Additionally, Vespa requires the
|
||||
entire application package containing all indices and their schemas to be deployed at once, which makes working in a
|
||||
distributed fashion (i.e. working with teammates and using a VCS) challenging.
|
||||
|
||||
### How to Set Up Elasticsearch as a Faceted Search Solution
|
||||
|
||||
![](/post-images/2021-04-ebr-scribd/f2.png)
|
||||
|
||||
*Figure Two: A high level diagram illustrating how the Elasticsearch system fetches recommendations*
|
||||
|
||||
Elasticsearch stores data as JSON documents within indices, which are logical namespaces with data mappings and shard
|
||||
configurations. For our use case, we defined two indices, a `user_index` and an `item_index`. The former is essentially
|
||||
a key-value store that maps a user ID to a corresponding user embedding. A sample document in the `user_index` looks like:
|
||||
|
||||
```
|
||||
{"_id": 4243913,
|
||||
"user_embed": [-0.5888184, ..., -0.3882332]}
|
||||
```
|
||||
|
||||
Notice here we use Elasticsearch’s inbuilt `_id` field rather than creating a custom field. This is so we can fetch user
|
||||
embeddings with a `GET` request rather than having to search for them, like this:
|
||||
|
||||
```
|
||||
curl <URL to cluster>:9200/user_index/_doc/4243913
|
||||
```
|
||||
|
||||
Now that we have the user embedding, we can use it to query the `item_index`, which stores each item’s metadata
|
||||
(which we will use to perform faceted search) and embedding. Here’s what a document in this index could look like:
|
||||
|
||||
```
|
||||
{"_id": 13375,
|
||||
"item_format": "audiobook",
|
||||
"language": "english",
|
||||
"country": "Australia",
|
||||
"categories": ["comedy", "fiction", "adventure"],
|
||||
"item_embed": [0.51400936,...,0.0892048]}
|
||||
```
|
||||
|
||||
We want to accomplish two goals in our query: retrieving the most relevant documents to the user (which in our model
|
||||
corresponds to the dot product between the user and item embeddings), and ensuring that all retrieved documents have the
|
||||
same filter values as those requested by the user. This is where Elasticsearch shines:
|
||||
|
||||
```
|
||||
curl -H 'Content-Type: application/json' \
|
||||
<URL to cluster>:9200/item_index/_search \
|
||||
-d \
|
||||
'
|
||||
{"_source": ["_id"],
|
||||
"size": 30,
|
||||
"query": {"script_score": {"query": {"bool":
|
||||
{"must_not": {"term": {"categories": "adventure"}},
|
||||
"filter": [{"term": {"language": "english"}},
|
||||
{"term": {"country": "Australia"}}]}},
|
||||
"script": {"source": "double value = dotProduct(params.user_embed, 'item_embed');
|
||||
return Math.max(0, value+10000);",
|
||||
"params": {"user_embed": [-0.5888184, ..., -0.3882332]}}}}}
|
||||
'
|
||||
```
|
||||
|
||||
Let’s break this query down to understand what’s going on:
|
||||
|
||||
1. Line 2: Here we are querying the `item_index` using Elasticsearch’s `_search` API
|
||||
2. Lines 5,6: We specify which attributes of the item documents we’d like returned (in this case, only `_id`), and how many
|
||||
results (`30`)
|
||||
3. Line 7: Here we are querying using the `script_score` feature; this is what allows us to first prefilter our corpus and
|
||||
then rank the remaining subset using a custom script
|
||||
4. Lines 8-10: Elasticsearch has various different boolean query types for filtering. In this example we specify that we
|
||||
are interested only in `english` items that can be viewed in `Australia` and which are not categorized as `adventure`
|
||||
5. Lines 11-12: Here is where we get to define our custom script. Elasticsearch has a built-in `dot_product` method we
|
||||
can employ, which is optimized to speed up computation. Note that our embeddings are not normalized, and Elasticsearch
|
||||
prohibits negative scores. For this reason, we had to include the score transformation in line 12 to ensure our scores
|
||||
were positive
|
||||
6. Line 13: Here we can add parameters which are passed to the scoring script
|
||||
|
||||
This query will retrieve recommendations based on one set of filters. However, in addition to user filters, each row on
|
||||
Scribd’s homepage also has row-specific filters (for example, the row “Audiobooks Recommended for You” would have a
|
||||
row-specific filter of `"item_format": "audiobook"`). Rather than making multiple queries to the Elasticsearch cluster
|
||||
with each combination of user and row filters, we can conduct multiple independent searches in a single query using the
|
||||
`_msearch` API. The following example query generates recommendations for hypothetical “Audiobooks Recommended for You”
|
||||
and “Comedy Titles Recommended for You” rows:
|
||||
|
||||
```
|
||||
curl -H 'Content-Type: application/json' \
|
||||
<URL to cluster>:9200/_msearch \
|
||||
-d \
|
||||
'
|
||||
{"index": "item_index"}
|
||||
{"_source": ["_id"],
|
||||
"size": 30,
|
||||
"query": {"script_score": {"query": {"bool":
|
||||
{"must_not": {"term": {"categories": "adventure"}},
|
||||
"filter": [{"term": {"language": "english"}},
|
||||
{"term": {"item_format": "audiobook"}},
|
||||
{"term": {"country": "Australia"}}]}},
|
||||
"script": {"source": "double value = dotProduct(params.user_embed, 'item_embed');
|
||||
return Math.max(0, value+10000);",
|
||||
"params": {"user_embed": [-0.5888184, ..., -0.3882332]}}}}}
|
||||
{"index": "item_index"}
|
||||
{"_source": ["_id"],
|
||||
"size": 30,
|
||||
"query": {"script_score": {"query": {"bool":
|
||||
{"must_not": {"term": {"categories": "adventure"}},
|
||||
"filter": [{"term": {"language": "english"}},
|
||||
{"term": {"categories": "comedy"}},
|
||||
{"term": {"country": "Australia"}}]}},
|
||||
"script": {"source": "double value = dotProduct(params.user_embed, 'item_embed');
|
||||
return Math.max(0, value+10000);",
|
||||
"params": {"user_embed": [-0.5888184, ..., -0.3882332]}}}}}
|
||||
'
|
||||
```
|
||||
|
||||
#### Shard Configuration
|
||||
Elasticsearch stores multiple copies of data across multiple nodes for resilience and increased query performance in a
|
||||
process known as sharding. The number of primary and replica shards is configurable only at index creation time. Here are
|
||||
some things to consider regarding shards:
|
||||
|
||||
1. Try out various shard configurations to see what works best for each use case.
|
||||
[Elastic](https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster) recommends 20-40GB of data per
|
||||
shard, while [eBay](https://tech.ebayinc.com/engineering/elasticsearch-performance-tuning-practice-at-ebay/) likes to keep
|
||||
their shard size below 30GB. However, these values did not work for us, and we found much smaller shard sizes (<5GB) to
|
||||
boost performance in the form of reduced latency at query time.
|
||||
2. When updating data, do not update documents within the existing index. Instead, create a new index, ingest updated
|
||||
documents into this index, and [re-alias](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html)
|
||||
from the old index to the new one. This process will allow you to retain older data in case an update needs to be reverted,
|
||||
and allows re-configurability of shards at update time.
|
||||
|
||||
#### Cluster Configuration
|
||||
We deployed our cluster across multiple data and primary nodes to enjoy the benefits of data redundancy, increased
|
||||
availability, and the ability to scale horizontally as our service grows. We found that deploying the cluster across
|
||||
multiple availability zones results in an increased latency during query time, but this is a tradeoff we accepted in the
|
||||
interest of availability.
|
||||
|
||||
As for hardware specifics, we use AWS EC2 instances to host our cluster. In production, we have 3 `t3a.small` primary-only
|
||||
nodes, 3 `c5d.12xlarge` data nodes, and 1 `t3.micro` Kibana node. The primary-only nodes are utilized only in a coordinating
|
||||
role (to route requests, distribute bulk indexing, etc), essentially acting as smart load balancers. This is why these
|
||||
nodes are much smaller than the data nodes, which handle the bulk of storage and computational costs. Kibana is a data
|
||||
visualization and monitoring tool; however, in production we use Datadog for our monitoring and alerting responsibilities,
|
||||
which is why we do not allocate many resources for the Kibana node.
|
||||
|
||||
### What Generating Recommendations Looks Like
|
||||
|
||||
![](/post-images/2021-04-ebr-scribd/f3.png)
|
||||
|
||||
*Figure Three: a diagram illustrating the system design for Personalization’s Embedding-based Retrieval Service*
|
||||
|
||||
Step by step, this is how recommendations are generated for a user when he/she requests the home page:
|
||||
1. The Scribd app passes the user’s information to the recommendations service
|
||||
2. The recommendations service queries Elasticsearch with the user’s ID to retrieve their user embedding, which is stored
|
||||
in a user index
|
||||
3. The recommendations service once again queries Elasticsearch, this time with the user’s embedding along with their
|
||||
user query filters. This query is a multi-search request to the item index: one for every desired row
|
||||
4. Elasticsearch returns these recommendations to the service, which are postprocessed and generated into rows before
|
||||
being sent to the client
|
||||
5. The client renders these recommendations and displays them to the user
|
||||
|
||||
With this approach, Elasticsearch will serve two purposes: acting as a key-value store, and retrieving recommendations.
|
||||
Elasticsearch is, of course, a slower key-value store than traditional databases, but we found the increase in latency
|
||||
to be insignificant (~5ms) for our use case. Furthermore, the benefit of this approach is that it only requires
|
||||
maintaining data in one system; using multiple systems to store data would create a consistency challenge.
|
||||
|
||||
The underlying Personalization model is very large, making retraining it a very expensive process. Thus, it needs to be
|
||||
retrained often enough to account for factors like user preference drift but not too often so as to be efficient with
|
||||
computational resources. We found retraining the model weekly worked well for us. item embeddings, which typically update
|
||||
only incrementally, are also recomputed weekly. However, user embeddings are recomputed daily to provide fresh
|
||||
recommendations based on changing user interests. These embeddings along with relevant metadata are ingested into the
|
||||
Elasticsearch index in a batch process using [Apache Spark](https://spark.apache.org/) and are scheduled through
|
||||
[Apache Airflow](https://airflow.apache.org/). We monitor this ingest process along with real-time serving metrics
|
||||
through Datadog.
|
||||
|
||||
#### Load Testing
|
||||
Our primary goal during load testing was to ensure that our system was able to reliably respond to a “reasonably large”
|
||||
number of requests per second and deliver a sufficient number of relevant recommendations, even under the confines of
|
||||
multiple facets within each query. We also took this opportunity to experiment with various aspects of our system to
|
||||
understand their impact on performance. These include:
|
||||
- Shard and replica configuration: We found that increasing the number of shards increased performance, but only to a
|
||||
point; If a cluster is over-sharded, the overhead of each shard outweighs the marginal performance gain of the
|
||||
additional partition
|
||||
- Dataset size: We artificially increased the size of our corpus several times to ensure the system’s performance would remain
|
||||
sufficient even as our catalog continues to grow
|
||||
- Filter and mapping configurations: Some filters (like `range` inequalities) are more expensive than traditional
|
||||
categorical filters. Additionally, increasing the number of fields in each document also has a negative impact on latency.
|
||||
Our use case calls for several filters across hundreds of document fields, so we played with several document and query
|
||||
configuration to find the one most optimal for the performance of our system
|
||||
|
||||
Our system is currently deployed to production and serves ~50rps with a p95 latency <60ms.
|
||||
|
||||
### Results
|
||||
Using Scribd’s internal A/B testing platform, we conducted an experiment comparing the existing recommendations service
|
||||
with the new personalization model with embedding-based retrieval architecture across the home and discover page surfaces.
|
||||
The test ran for approximately a month with >1M Scribd users (trialers or subscribers) assigned as participants. After
|
||||
careful analysis of results, we saw the following statistically significant (p<0.01) improvements in the personalization
|
||||
variant compared to the control experience:
|
||||
- Increase in the number of users who clicked on a recommended item
|
||||
- Increase in the average number of clicks per user
|
||||
- Increase in the number of users with a read time of at least 10 minutes (in a three day window)
|
||||
|
||||
These increases represent significant business impact on key performance metrics. The personalization model currently
|
||||
generates recommendations for every (signed in) Scribd user’s home and discover pages.
|
||||
|
||||
### Next Steps
|
||||
Now that the infrastructure and model are in place, we are looking to add a slew of improvements to the existing system.
|
||||
Our immediate efforts will focus on expanding the scope of this system to include more surfaces and row modules within
|
||||
the Scribd experience. Additional long term projects include the addition of an online contextual reranker to increase
|
||||
the relevance and freshness of recommendations and potentially integrating our system with an infrastructure as code
|
||||
tool to more easily manage and scale compute resources.
|
||||
|
||||
Thank you for reading! We hope you found this post useful and informative.
|
||||
|
||||
### Thank You 🌮
|
||||
Thank you to
|
||||
[Snehal Mistry](https://www.linkedin.com/in/snehal-mistry-b986b53/),
|
||||
[Jeffrey Nguyen](https://www.linkedin.com/in/jnguyenfc/),
|
||||
[Natalie Medvedchuk](https://www.linkedin.com/in/natalie-medvedchuk/),
|
||||
[Dimitri Theoharatos](https://www.linkedin.com/in/dimitri-theoharatos/),
|
||||
[Adrian Lienhard](https://www.linkedin.com/in/adrianlienhard/),
|
||||
and countless others, all of whom provided invaluable guidance and assistance throughout this project.
|
||||
|
||||
(giving tacos 🌮 is how we show appreciation here at Scribd)
|
Binary file not shown.
After Width: | Height: | Size: 406 KiB |
Binary file not shown.
After Width: | Height: | Size: 179 KiB |
Binary file not shown.
After Width: | Height: | Size: 220 KiB |
Loading…
Reference in New Issue