updated data-science team to applied-research. Job recommendations remain the same

This commit is contained in:
Jonathan Ramkissoon 2021-07-08 16:50:34 -04:00 committed by R. Tyler Croy
parent 731400f302
commit 9d1a28e34e
11 changed files with 33 additions and 38 deletions

View File

@ -1,4 +1,4 @@
---
team: Data Science
team: Applied Research
permalink: "/blog/category/data-science"
---

View File

@ -26,9 +26,9 @@
about titles in our library by analyzing content and user behavior and
building predictive models.
- team: Data Science
- team: Applied Research
description: |
The Data Science team drives decisions by creating insights into the product
The Applied Research team drives decisions by creating insights into the product
and improve the user experience with machine learning.
- team: Core Platform

View File

@ -8,6 +8,9 @@ iOS:
Android:
lever: 'Mobile'
Applied Research:
lever: 'Data Science'
Data Science:
lever: 'Data Science'

View File

@ -5,7 +5,7 @@ author: mattr
tags:
- seq2seq
- data
team: Data Science
team: Applied Research
---
Introduction

View File

@ -5,7 +5,7 @@ author: mattr
tags:
- search
- data
team: Data Science
team: Applied Research
---
Scribd has a variety of content to offer and connecting our users with their desired content is a crucial aspect of our product. One of the main ways that users find content on Scribd is through search, and in this post I want to delve into an analysis we did regarding parsing out valuable information from a users query in order to better serve them relevant results, and also learn more about what they are searching for.

View File

@ -5,7 +5,7 @@ author: dfeldman
tags:
- testing
- data
team: Data Science
team: Applied Research
---
What is A/B testing?

View File

@ -5,7 +5,7 @@ author: dfeldman
tags:
- testing
- data
team: Data Science
team: Applied Research
---
We love A/B testing at Scribd. What follows is a specific example to give you an inside look at the process from idea to implementation for an algorithm test.

View File

@ -6,7 +6,7 @@ tags:
- seo
- testing
- data
team: Data Science
team: Applied Research
---
Months ago, your friends convinced you to sign up for a half marathon. With three weeks to go, you havent even started training. In a growing panic, you turn to the internet for answers.

View File

@ -5,7 +5,7 @@ author: bclearly
tags:
- ltr
- data
team: Data Science
team: Applied Research
---
Why LTR? (Lifetime Revenue)

View File

@ -6,7 +6,7 @@ tags:
- machinelearning
- seq2seq
- data
team: Data Science
team: Applied Research
---
How much data do you need to train a seq2seq model? Lets say that you want to translate sentences from one language to another. You probably need a bigger dataset to translate longer sentences than if you wanted to translate shorter ones. How does the need for data grow as the sentence length increases?

View File

@ -4,12 +4,12 @@ title: "Identifying Document Types at Scribd"
tags:
- machinelearning
- data
team: Data Science
team: Applied Research
author: jonathanr
---
[User-uploaded documents](https://www.scribd.com/docs) have been a core component of Scribds business from the very beginning. Users can upload and share documents, analogous to YouTube and videos. Consequently, our document corpus has become much larger and more diverse over the years. Understanding what we have in the document corpus unlocks many opportunities for discovery and recommendations. Over the past year, one of the missions of the Applied Research team has been to build a system to extract key document meta-data with the goal of enriching downstream discovery systems. Our approach combines semantic understanding with user behaviour in a multi-component machine learning system. This is part 1 in a series of blog posts explaining the challenges faced by the team and solutions explored while building this system. In this post, we present the limitations, challenges and solutions encountered when developing a model to classify arbitrary user-uploaded documents.
[User-uploaded documents](https://www.scribd.com/docs) have been a core component of Scribds business from the very beginning. Users can upload and share documents, analogous to YouTube and videos. Consequently, our document corpus has become much larger and more diverse over the years. Understanding what we have in the document corpus unlocks many opportunities for discovery and recommendations. Over the past year, one of the missions of the Applied Research team has been to build a system to extract key document meta-data with the goal of enriching downstream discovery systems. Our approach combines semantic understanding with user behaviour in a multi-component machine learning system. This is part 1 in a series of blog posts explaining the challenges faced by the team and solutions explored while building this system. In this post, we present the limitations, challenges and solutions encountered when developing a model to visually classify arbitrary user-uploaded documents.
## Initial Constraints
@ -47,12 +47,10 @@ As mentioned in the introduction, we need an approach that is language and conte
Before the model training started, we faced an interesting data gathering problem. Our goal is to classify documents, so we must gather labelled documents. However, in order to train the page classifier mentioned above, we must also gather labelled pages. Naively, it might seem appropriate to gather labelled documents and use the document label for each of its pages. This isn't appropriate as a single document can contain multiple types of pages. As an example, consider the pages in this document.
<cetner>
<figure>
<img width="996" alt="Three pages from the same document" src="https://user-images.githubusercontent.com/9146894/124964050-8adecd80-dfee-11eb-83fb-a3afbde1fc14.png">
<figcaption> Figure 2: Three different pages from the same document to demonstrate why we can't take the document label and assign it to each page. </figcaption>
</figure>
</cetner>
<figure>
<img width="996" alt="Three pages from the same document" src="https://user-images.githubusercontent.com/9146894/124964050-8adecd80-dfee-11eb-83fb-a3afbde1fc14.png">
<figcaption> Figure 2: Three different pages from the same document to demonstrate why we can't take the document label and assign it to each page. </figcaption>
</figure>
The first and third pages can be considered text-heavy, but definitely not the second. Taking all the pages of this document and labelling them as text-heavy would severely pollute our training and testing data. The same logic applies to each of our 6 classes.
@ -60,35 +58,31 @@ The first and third pages can be considered text-heavy, but definitely not the s
To circumvent this challenge, we took an active learning approach to data gathering. We started with a small set of hand-labelled pages for each class and trained binary classifiers iteratively. The binary classification problem is simpler than the multi-class problem so requires less hand-labelled data to obtain reliable results. At each iteration, we evaluated the most confident and least confident predictions of the model to get a sense of its inductive biases. Judging from these, we supplemented the training data for the next iteration to tweak the inductive biases and have confidence in the resulting model and labels. The sheet music class is a prime example of tweaking inductive biases. Below is an example of a page that can cause a sheet music misclassification if the model learns that sheet music is any page with horizontal lines. Supplementing the training data at each iteration helps get rid of inductive biases like this.
<cetner>
<figure>
<img width="662" alt="Example of possible sheet music misclassification from wrong inductive bias" src="https://user-images.githubusercontent.com/9146894/124964644-40118580-dfef-11eb-8d24-d6e0a6460ca9.png">
<figcaption> Figure 3: Example of possible sheet music misclassification due to wrong inductive biases. </figcaption>
</figure>
</cetner>
<figure>
<img width="662" alt="Example of possible sheet music misclassification from wrong inductive bias" src="https://user-images.githubusercontent.com/9146894/124964644-40118580-dfef-11eb-8d24-d6e0a6460ca9.png">
<figcaption> Figure 3: Example of possible sheet music misclassification due to wrong inductive biases. </figcaption>
</figure>
After creating these binary classifiers for each class, we have a large set of reliable labels and classifiers that can be used to gather more data if necessary.
### Building a Page Classifier
The page classification problem is very similar to ImageNet classification, so we can leverage pre-trained ImageNet models. We used transfer learning in fast.ai and PyTorch to fine-tune pre-trained computer vision architectures for the page-classifier. After initial experiments, it was clear that models with very high ImageNet accuracy, such as EfficientNet, did not perform much better on our dataset. While its difficult to pinpoint exactly why this is the case, we believe it is because of the nature of the classification task, the page resolutions and our data.
The page classification problem is very similar to ImageNet classification, so we can leverage pre-trained ImageNet models. We used transfer learning in [fast.ai](https://www.fast.ai/) and [PyTorch](https://pytorch.org/) to fine-tune pre-trained computer vision architectures for the page-classifier. After initial experiments, it was clear that models with very high ImageNet accuracy, such as EfficientNet, did not perform much better on our dataset. While its difficult to pinpoint exactly why this is the case, we believe it is because of the nature of the classification task, the page resolutions and our data.
We found SqueezeNet, a relatively established lightweight architecture, to be the best balance between accuracy and inference time. Because models such as ResNets and DenseNets are so large, they take a lot of time to train and iterate on. However, SqueezeNet is an order of magnitude smaller than these models, which opens up more possibilities in our training scheme. Now we can train the entire model and are not limited to using the pre-trained architecture as a feature-extractor, which is the case for larger models.
<cetner>
<figure>
<img width="450" alt="Figure 4: SqueezeNet architectures taken from the paper. Left: SqueezeNet; Middle: SqueezeNet with simple bypass; Right: SqueezeNet with complex bypass." src="https://user-images.githubusercontent.com/9146894/124964923-91217980-dfef-11eb-9553-13bf296ced10.png">
<figcaption> Figure 4: SqueezeNet architectures taken from the paper. Left: SqueezeNet; Middle: SqueezeNet with simple bypass; Right: SqueezeNet with complex bypass. </figcaption>
</figure>
</cetner>
<figure>
<img width="450" alt="Figure 4: SqueezeNet architectures taken from the paper. Left: SqueezeNet; Middle: SqueezeNet with simple bypass; Right: SqueezeNet with complex bypass." src="https://user-images.githubusercontent.com/9146894/124964923-91217980-dfef-11eb-9553-13bf296ced10.png">
<figcaption> Figure 4: SqueezeNet architectures taken from the <a href="https://arxiv.org/pdf/1602.07360.pdf">paper</a>. Left: SqueezeNet; Middle: SqueezeNet with simple bypass; Right: SqueezeNet with complex bypass. </figcaption>
</figure>
Additionally, for this particular model, low inference time is key in order to run it on hundreds of millions of documents. Inference time is also directly tied to costs, so an optimal cost/benefit ratio would require significantly higher performance to justify higher processing time.
### Ensembling Pages for Document Classification
### Ensembled Pages for Document Classification
We now have a model to classify document pages and need to use them to determine a prediction for documents and want to combine these classifications with additional meta-data, such as total page count, page dimensions, etc. However, our experiments here showed that a simple ensemble of the page classifications provided an extremely strong baseline that was difficult to beat with meta-data.
@ -103,12 +97,10 @@ While there are different ways of dealing with this, our approach involved two s
### Where do we go from here?
<cetner>
<figure>
<img width="400" alt="Figure 5: Diagram of the overall document understanding system. The red box is what we talked about in this post" src="https://user-images.githubusercontent.com/9146894/124965219-da71c900-dfef-11eb-9d12-4bf9a9772f4c.png">
<figcaption> Figure 5: Diagram of the overall document understanding system. The red box is what we talked about in this post </figcaption>
</figure>
</cetner>
<figure>
<img width="400" alt="Figure 5: Diagram of the overall document understanding system. The red box is what we talked about in this post" src="https://user-images.githubusercontent.com/9146894/124965219-da71c900-dfef-11eb-9d12-4bf9a9772f4c.png">
<figcaption> Figure 5: Diagram of the overall document understanding system. The red box is what we talked about in this post </figcaption>
</figure>
Now that we have a model to filter documents based on visual cues, we can build dedicated information extraction models for each document type sheet music, text-heavy, comics, tables. This is exactly how we proceed from here, and we start with extracting information from text-heavy documents. Part 2 in this series will dive deeper into the challenges and solutions our team encountered while building these models.