From 9d1a28e34edc21a1ce3245ac2ff9774eea66e9ab Mon Sep 17 00:00:00 2001 From: Jonathan Ramkissoon Date: Thu, 8 Jul 2021 16:50:34 -0400 Subject: [PATCH] updated data-science team to applied-research. Job recommendations remain the same --- _category/data-science.md | 2 +- _data/team-structure.yml | 4 +- _data/teams.yml | 3 ++ .../2018-01-05-neural-spelling-corrections.md | 2 +- _posts/2018-02-12-search-query-parsing.md | 2 +- _posts/2018-03-20-scribds-ab-testing.md | 2 +- _posts/2018-04-18-bandits-for-the-win.md | 2 +- _posts/2018-05-31-non-random-seo-test.md | 2 +- ...7-calculating-customer-lifetime-revenue.md | 2 +- _posts/2019-03-04-experiments-with-seq2seq.md | 2 +- ...08-identifying-document-types-at-scribd.md | 48 ++++++++----------- 11 files changed, 33 insertions(+), 38 deletions(-) diff --git a/_category/data-science.md b/_category/data-science.md index aa400d2..073e6db 100644 --- a/_category/data-science.md +++ b/_category/data-science.md @@ -1,4 +1,4 @@ --- -team: Data Science +team: Applied Research permalink: "/blog/category/data-science" --- diff --git a/_data/team-structure.yml b/_data/team-structure.yml index 86950f5..08f28ea 100644 --- a/_data/team-structure.yml +++ b/_data/team-structure.yml @@ -26,9 +26,9 @@ about titles in our library by analyzing content and user behavior and building predictive models. -- team: Data Science +- team: Applied Research description: | - The Data Science team drives decisions by creating insights into the product + The Applied Research team drives decisions by creating insights into the product and improve the user experience with machine learning. - team: Core Platform diff --git a/_data/teams.yml b/_data/teams.yml index df9870e..56a4665 100644 --- a/_data/teams.yml +++ b/_data/teams.yml @@ -8,6 +8,9 @@ iOS: Android: lever: 'Mobile' +Applied Research: + lever: 'Data Science' + Data Science: lever: 'Data Science' diff --git a/_posts/2018-01-05-neural-spelling-corrections.md b/_posts/2018-01-05-neural-spelling-corrections.md index 46205af..9bbbc99 100644 --- a/_posts/2018-01-05-neural-spelling-corrections.md +++ b/_posts/2018-01-05-neural-spelling-corrections.md @@ -5,7 +5,7 @@ author: mattr tags: - seq2seq - data -team: Data Science +team: Applied Research --- Introduction diff --git a/_posts/2018-02-12-search-query-parsing.md b/_posts/2018-02-12-search-query-parsing.md index c3937e9..49e4498 100644 --- a/_posts/2018-02-12-search-query-parsing.md +++ b/_posts/2018-02-12-search-query-parsing.md @@ -5,7 +5,7 @@ author: mattr tags: - search - data -team: Data Science +team: Applied Research --- Scribd has a variety of content to offer and connecting our users with their desired content is a crucial aspect of our product. One of the main ways that users find content on Scribd is through search, and in this post I want to delve into an analysis we did regarding parsing out valuable information from a user’s query in order to better serve them relevant results, and also learn more about what they are searching for. diff --git a/_posts/2018-03-20-scribds-ab-testing.md b/_posts/2018-03-20-scribds-ab-testing.md index 2b3e3ad..59a21d0 100644 --- a/_posts/2018-03-20-scribds-ab-testing.md +++ b/_posts/2018-03-20-scribds-ab-testing.md @@ -5,7 +5,7 @@ author: dfeldman tags: - testing - data -team: Data Science +team: Applied Research --- What is A/B testing? diff --git a/_posts/2018-04-18-bandits-for-the-win.md b/_posts/2018-04-18-bandits-for-the-win.md index d0a0734..a70db49 100644 --- a/_posts/2018-04-18-bandits-for-the-win.md +++ b/_posts/2018-04-18-bandits-for-the-win.md @@ -5,7 +5,7 @@ author: dfeldman tags: - testing - data -team: Data Science +team: Applied Research --- We love A/B testing at Scribd. What follows is a specific example to give you an inside look at the process from idea to implementation for an algorithm test. diff --git a/_posts/2018-05-31-non-random-seo-test.md b/_posts/2018-05-31-non-random-seo-test.md index b115427..262d008 100644 --- a/_posts/2018-05-31-non-random-seo-test.md +++ b/_posts/2018-05-31-non-random-seo-test.md @@ -6,7 +6,7 @@ tags: - seo - testing - data -team: Data Science +team: Applied Research --- Months ago, your friends convinced you to sign up for a half marathon. With three weeks to go, you haven’t even started training. In a growing panic, you turn to the internet for answers. diff --git a/_posts/2019-02-07-calculating-customer-lifetime-revenue.md b/_posts/2019-02-07-calculating-customer-lifetime-revenue.md index cbb708a..d4c679e 100644 --- a/_posts/2019-02-07-calculating-customer-lifetime-revenue.md +++ b/_posts/2019-02-07-calculating-customer-lifetime-revenue.md @@ -5,7 +5,7 @@ author: bclearly tags: - ltr - data -team: Data Science +team: Applied Research --- Why LTR? (Lifetime Revenue) diff --git a/_posts/2019-03-04-experiments-with-seq2seq.md b/_posts/2019-03-04-experiments-with-seq2seq.md index 8f3beac..ff10bb4 100644 --- a/_posts/2019-03-04-experiments-with-seq2seq.md +++ b/_posts/2019-03-04-experiments-with-seq2seq.md @@ -6,7 +6,7 @@ tags: - machinelearning - seq2seq - data -team: Data Science +team: Applied Research --- How much data do you need to train a seq2seq model? Let’s say that you want to translate sentences from one language to another. You probably need a bigger dataset to translate longer sentences than if you wanted to translate shorter ones. How does the need for data grow as the sentence length increases? diff --git a/_posts/2021-07-08-identifying-document-types-at-scribd.md b/_posts/2021-07-08-identifying-document-types-at-scribd.md index 0a922e0..4afb51f 100644 --- a/_posts/2021-07-08-identifying-document-types-at-scribd.md +++ b/_posts/2021-07-08-identifying-document-types-at-scribd.md @@ -4,12 +4,12 @@ title: "Identifying Document Types at Scribd" tags: - machinelearning - data -team: Data Science +team: Applied Research author: jonathanr --- -[User-uploaded documents](https://www.scribd.com/docs) have been a core component of Scribd’s business from the very beginning. Users can upload and share documents, analogous to YouTube and videos. Consequently, our document corpus has become much larger and more diverse over the years. Understanding what we have in the document corpus unlocks many opportunities for discovery and recommendations. Over the past year, one of the missions of the Applied Research team has been to build a system to extract key document meta-data with the goal of enriching downstream discovery systems. Our approach combines semantic understanding with user behaviour in a multi-component machine learning system. This is part 1 in a series of blog posts explaining the challenges faced by the team and solutions explored while building this system. In this post, we present the limitations, challenges and solutions encountered when developing a model to classify arbitrary user-uploaded documents. +[User-uploaded documents](https://www.scribd.com/docs) have been a core component of Scribd’s business from the very beginning. Users can upload and share documents, analogous to YouTube and videos. Consequently, our document corpus has become much larger and more diverse over the years. Understanding what we have in the document corpus unlocks many opportunities for discovery and recommendations. Over the past year, one of the missions of the Applied Research team has been to build a system to extract key document meta-data with the goal of enriching downstream discovery systems. Our approach combines semantic understanding with user behaviour in a multi-component machine learning system. This is part 1 in a series of blog posts explaining the challenges faced by the team and solutions explored while building this system. In this post, we present the limitations, challenges and solutions encountered when developing a model to visually classify arbitrary user-uploaded documents. ## Initial Constraints @@ -47,12 +47,10 @@ As mentioned in the introduction, we need an approach that is language and conte Before the model training started, we faced an interesting data gathering problem. Our goal is to classify documents, so we must gather labelled documents. However, in order to train the page classifier mentioned above, we must also gather labelled pages. Naively, it might seem appropriate to gather labelled documents and use the document label for each of its pages. This isn't appropriate as a single document can contain multiple types of pages. As an example, consider the pages in this document. - -
- Three pages from the same document -
Figure 2: Three different pages from the same document to demonstrate why we can't take the document label and assign it to each page.
-
-
+
+ Three pages from the same document +
Figure 2: Three different pages from the same document to demonstrate why we can't take the document label and assign it to each page.
+
The first and third pages can be considered text-heavy, but definitely not the second. Taking all the pages of this document and labelling them as text-heavy would severely pollute our training and testing data. The same logic applies to each of our 6 classes. @@ -60,35 +58,31 @@ The first and third pages can be considered text-heavy, but definitely not the s To circumvent this challenge, we took an active learning approach to data gathering. We started with a small set of hand-labelled pages for each class and trained binary classifiers iteratively. The binary classification problem is simpler than the multi-class problem so requires less hand-labelled data to obtain reliable results. At each iteration, we evaluated the most confident and least confident predictions of the model to get a sense of its inductive biases. Judging from these, we supplemented the training data for the next iteration to tweak the inductive biases and have confidence in the resulting model and labels. The sheet music class is a prime example of tweaking inductive biases. Below is an example of a page that can cause a sheet music misclassification if the model learns that sheet music is any page with horizontal lines. Supplementing the training data at each iteration helps get rid of inductive biases like this. - -
- Example of possible sheet music misclassification from wrong inductive bias -
Figure 3: Example of possible sheet music misclassification due to wrong inductive biases.
-
-
+
+ Example of possible sheet music misclassification from wrong inductive bias +
Figure 3: Example of possible sheet music misclassification due to wrong inductive biases.
+
After creating these binary classifiers for each class, we have a large set of reliable labels and classifiers that can be used to gather more data if necessary. ### Building a Page Classifier -The page classification problem is very similar to ImageNet classification, so we can leverage pre-trained ImageNet models. We used transfer learning in fast.ai and PyTorch to fine-tune pre-trained computer vision architectures for the page-classifier. After initial experiments, it was clear that models with very high ImageNet accuracy, such as EfficientNet, did not perform much better on our dataset. While it’s difficult to pinpoint exactly why this is the case, we believe it is because of the nature of the classification task, the page resolutions and our data. +The page classification problem is very similar to ImageNet classification, so we can leverage pre-trained ImageNet models. We used transfer learning in [fast.ai](https://www.fast.ai/) and [PyTorch](https://pytorch.org/) to fine-tune pre-trained computer vision architectures for the page-classifier. After initial experiments, it was clear that models with very high ImageNet accuracy, such as EfficientNet, did not perform much better on our dataset. While it’s difficult to pinpoint exactly why this is the case, we believe it is because of the nature of the classification task, the page resolutions and our data. We found SqueezeNet, a relatively established lightweight architecture, to be the best balance between accuracy and inference time. Because models such as ResNets and DenseNets are so large, they take a lot of time to train and iterate on. However, SqueezeNet is an order of magnitude smaller than these models, which opens up more possibilities in our training scheme. Now we can train the entire model and are not limited to using the pre-trained architecture as a feature-extractor, which is the case for larger models. - -
- Figure 4: SqueezeNet architectures taken from the paper. Left: SqueezeNet; Middle: SqueezeNet with simple bypass; Right: SqueezeNet with complex bypass. -
Figure 4: SqueezeNet architectures taken from the paper. Left: SqueezeNet; Middle: SqueezeNet with simple bypass; Right: SqueezeNet with complex bypass.
-
-
+
+ Figure 4: SqueezeNet architectures taken from the paper. Left: SqueezeNet; Middle: SqueezeNet with simple bypass; Right: SqueezeNet with complex bypass. +
Figure 4: SqueezeNet architectures taken from the paper. Left: SqueezeNet; Middle: SqueezeNet with simple bypass; Right: SqueezeNet with complex bypass.
+
Additionally, for this particular model, low inference time is key in order to run it on hundreds of millions of documents. Inference time is also directly tied to costs, so an optimal cost/benefit ratio would require significantly higher performance to justify higher processing time. -### Ensembling Pages for Document Classification +### Ensembled Pages for Document Classification We now have a model to classify document pages and need to use them to determine a prediction for documents and want to combine these classifications with additional meta-data, such as total page count, page dimensions, etc. However, our experiments here showed that a simple ensemble of the page classifications provided an extremely strong baseline that was difficult to beat with meta-data. @@ -103,12 +97,10 @@ While there are different ways of dealing with this, our approach involved two s ### Where do we go from here? - -
- Figure 5: Diagram of the overall document understanding system. The red box is what we talked about in this post -
Figure 5: Diagram of the overall document understanding system. The red box is what we talked about in this post
-
-
+
+ Figure 5: Diagram of the overall document understanding system. The red box is what we talked about in this post +
Figure 5: Diagram of the overall document understanding system. The red box is what we talked about in this post
+
Now that we have a model to filter documents based on visual cues, we can build dedicated information extraction models for each document type – sheet music, text-heavy, comics, tables. This is exactly how we proceed from here, and we start with extracting information from text-heavy documents. Part 2 in this series will dive deeper into the challenges and solutions our team encountered while building these models.