Real time data platform woooooo

This commit is contained in:
R Tyler Croy 2019-08-28 10:51:31 -07:00
parent 8227dc72b6
commit 5ce8ab8289
No known key found for this signature in database
GPG Key ID: E5C92681BEF6CEA2
1 changed files with 143 additions and 0 deletions

View File

@ -0,0 +1,143 @@
---
layout: post
title: Defining the Real-time Data Platform
tags:
- kafka
- scribd
- aws
---
One of the harder parts about building new platform infrastructure at a company
which has been around a while is figuring out exactly _where_ to
begin. At [Scribd](https://www.scribd.com/about/engineering) the company has
built a good product and curated a large corpus of written content, but
where next? As I alluded to in [my previous
post](/2019/08/22/platform-engineering-at-scribd.html) about the Platform
Engineering organization, our "platform" components should help scale out,
accelerate, or open up entirely new avenues of development. In this article, I
want to describe one such project we have been working on and share some of the
thought process behind its inception and prioritization: the Real-time Data
Platform.
(sounds fancy huh?)
My first couple weeks at the company were intense.
The idea of "Core Platform" was sketched out as a team "to scale apps and data" but that
was about the extent of it. The task I took on was to learn as much as I could,
as quickly as I could, in order to get the recruiting and hiring machine
started. Basically, I
needed to point Core Platform in a direction that was correct enough at a high
level in order to know what skills my future colleagues should have. While I
had _tons_ of discussions and did plenty of reading, I almost feel sheepish to
admit this, but much of our direction was heavily influenced by two
conversations, both of which took less than an hour.
The first was with [Kevin Perko](https://www.linkedin.com/in/kperko) (KP), the head
of our [Data Science team](https://www.scribd.com/about/data_science). His team
interacts the most with our current data platform (HDFS, Spark, Hive, etc); in
essence Data Science would be considered one of our customers. I asked some
variant of "what's wrong with the data infrastructure?" and KP unloaded what
must have been months of pent up frustrations shared by his entire team. The
themes that emerged were:
* Developers don't think about the consumers of the data. Garbage in, garbage
out!
* Many nightly tasks spend a _lot_ of time performing unnecessary pre-processing of data.
* The performance of the system is generally poor. Ad-hoc queries from data
scientists, depending on the time of day, are competing with resources for
automated tasks.
* Everything has to be done in this nightly dependent graph of tasks, and when
something goes wrong, it's very manual to recover from errors and typically
ruins somebody's day.
Assuring KP that these were problems we would be solving, his next statement
would become a mainstay of our relationship moving forward: "_when will it be
ready?_"
My second influential conversation was with [Mike
Lewis](https://twitter.com/mikkelewis) the head of Product. This conversation
was quite simple and didn't involve as much trauma counseling as the previous.
I asked "what can't you do today because of our technology limitations?" This
is a good question to ask product teams every now and again. They frequently
are optimising within their current constraints. One role of
platform and infrastructure teams is to remove those constraints. We discussed
the way in which users convert from passersby, to trial, to paid subscribers.
He also highlighted the importance of our recommendations and search results in
this funnel, and lamented the speed at which we can highlight relevant content
to new users. The maxim goes: the faster a new user sees relevant and
interesting content, the more likely they are to stick around.
Pattern matching between the current problems and the technology needed to
enable new product initiatives I named and defined the high level objective for
the **Real-time Data Platform** as follows:
> _To provide a streaming data platform for collecting and acting upon behavioral data
> in near real-time with the ultimate goal to enable day zero personalization in
> Scribd's products._
In more concrete terms, the platform is a collection of cloud-based services
(in AWS, more on that later) for ingesting, processing, and storing behavioral
events from frontend, backend, and mobile clients. The scope of the Real-time
Data Platform extends from event definition and schema, to the layout of events
in persisted into long-term queryable storage, and the tooling which sits on
top of that queryable storage.
As the nominal "product owner" for the effort, I aimed to describe less about
what tools and technologies should be used, and instead forced myself to define
tech-agnostic requirements. Thereby leaving the discovery work for the team I
would ultimately hire.
The Real-time Data Platform must have:
* A high, nearing 100% data SLA. Meaning we must design in such a way to reduce
data loss or corruption at every point of the pipeline.
* Maintain data provenance through the pipeline from data creation to usage. In
essence, a Data Scientist should be able to easily track data from where it
originated, and understand the transformative steps along the way.
* Event streams should be considered API contracts, with schemas suggested or
enforced when possible. A consumer from an event stream should be able to
trust the quality of the events in that stream.
* Data processing and transformation must happen as close to ingestion as
possible. Events which arrive in long-term storage must be structured and
partitioned for optimal query performance with zero or minimal post-processing
required for most use-cases.
* The platform must scale as the data volume grows without requiring
significant redesign or rework.
In essence, we need to change a number of foundational ways in which we
generate, transfer, and consider the data which Scribd uses. As Core Platform
has unpeeled layer after layer of this onion, we have been able to affirm at
each step of the way that we're moving in the right direction, which is by
itself quite exciting.
The design of the Real-time Data Platform which we're currently building out is
something I will share at a high level in a subsequent blog post.
I want to finish this one with some parting thoughts. If you are building
_anything_ foundational in a technology organization, you **must** talk to the
product team. You must also talk to your customers, but I don't like to ask
them what they want, I like to ask what they don't like and don't want. Listen
to that negative feedback, understand what lies beneath the frustrations.
Finally, have a vision for the future, but build and deliver incrementally.
When I first sketched this out, I was forthcoming in stating "this is a 2020
project." I made sure to clarify that this did not mean we wouldn't deliver anything
to the business for 18 months. Instead, I made made sure to explain that to
execute on this overall vision would be a long journey with milestones along
the way.
If you haven't ever watched a skyscraper being built, you would be amazed at
how much of the time is spent digging a great big hole, sinking steel into
bedrock, and pouring concrete. Months of people working in a city block-sized
hole before anything takes shape that even resembles a skyscraper. Building
strong foundations takes time, but that is in essence the role of any platform
and infrastructure organization. The challenge is to keep the business moving
forward today while _also_ building those fundamental components upon which the
business will stand in a year or two.
It is tough, but that's exactly what I signed up for. :)