Real time data platform woooooo
This commit is contained in:
parent
8227dc72b6
commit
5ce8ab8289
|
@ -0,0 +1,143 @@
|
|||
---
|
||||
layout: post
|
||||
title: Defining the Real-time Data Platform
|
||||
tags:
|
||||
- kafka
|
||||
- scribd
|
||||
- aws
|
||||
---
|
||||
|
||||
One of the harder parts about building new platform infrastructure at a company
|
||||
which has been around a while is figuring out exactly _where_ to
|
||||
begin. At [Scribd](https://www.scribd.com/about/engineering) the company has
|
||||
built a good product and curated a large corpus of written content, but
|
||||
where next? As I alluded to in [my previous
|
||||
post](/2019/08/22/platform-engineering-at-scribd.html) about the Platform
|
||||
Engineering organization, our "platform" components should help scale out,
|
||||
accelerate, or open up entirely new avenues of development. In this article, I
|
||||
want to describe one such project we have been working on and share some of the
|
||||
thought process behind its inception and prioritization: the Real-time Data
|
||||
Platform.
|
||||
|
||||
(sounds fancy huh?)
|
||||
|
||||
My first couple weeks at the company were intense.
|
||||
The idea of "Core Platform" was sketched out as a team "to scale apps and data" but that
|
||||
was about the extent of it. The task I took on was to learn as much as I could,
|
||||
as quickly as I could, in order to get the recruiting and hiring machine
|
||||
started. Basically, I
|
||||
needed to point Core Platform in a direction that was correct enough at a high
|
||||
level in order to know what skills my future colleagues should have. While I
|
||||
had _tons_ of discussions and did plenty of reading, I almost feel sheepish to
|
||||
admit this, but much of our direction was heavily influenced by two
|
||||
conversations, both of which took less than an hour.
|
||||
|
||||
The first was with [Kevin Perko](https://www.linkedin.com/in/kperko) (KP), the head
|
||||
of our [Data Science team](https://www.scribd.com/about/data_science). His team
|
||||
interacts the most with our current data platform (HDFS, Spark, Hive, etc); in
|
||||
essence Data Science would be considered one of our customers. I asked some
|
||||
variant of "what's wrong with the data infrastructure?" and KP unloaded what
|
||||
must have been months of pent up frustrations shared by his entire team. The
|
||||
themes that emerged were:
|
||||
|
||||
* Developers don't think about the consumers of the data. Garbage in, garbage
|
||||
out!
|
||||
* Many nightly tasks spend a _lot_ of time performing unnecessary pre-processing of data.
|
||||
* The performance of the system is generally poor. Ad-hoc queries from data
|
||||
scientists, depending on the time of day, are competing with resources for
|
||||
automated tasks.
|
||||
* Everything has to be done in this nightly dependent graph of tasks, and when
|
||||
something goes wrong, it's very manual to recover from errors and typically
|
||||
ruins somebody's day.
|
||||
|
||||
|
||||
Assuring KP that these were problems we would be solving, his next statement
|
||||
would become a mainstay of our relationship moving forward: "_when will it be
|
||||
ready?_"
|
||||
|
||||
My second influential conversation was with [Mike
|
||||
Lewis](https://twitter.com/mikkelewis) the head of Product. This conversation
|
||||
was quite simple and didn't involve as much trauma counseling as the previous.
|
||||
I asked "what can't you do today because of our technology limitations?" This
|
||||
is a good question to ask product teams every now and again. They frequently
|
||||
are optimising within their current constraints. One role of
|
||||
platform and infrastructure teams is to remove those constraints. We discussed
|
||||
the way in which users convert from passersby, to trial, to paid subscribers.
|
||||
He also highlighted the importance of our recommendations and search results in
|
||||
this funnel, and lamented the speed at which we can highlight relevant content
|
||||
to new users. The maxim goes: the faster a new user sees relevant and
|
||||
interesting content, the more likely they are to stick around.
|
||||
|
||||
|
||||
Pattern matching between the current problems and the technology needed to
|
||||
enable new product initiatives I named and defined the high level objective for
|
||||
the **Real-time Data Platform** as follows:
|
||||
|
||||
> _To provide a streaming data platform for collecting and acting upon behavioral data
|
||||
> in near real-time with the ultimate goal to enable day zero personalization in
|
||||
> Scribd's products._
|
||||
|
||||
|
||||
In more concrete terms, the platform is a collection of cloud-based services
|
||||
(in AWS, more on that later) for ingesting, processing, and storing behavioral
|
||||
events from frontend, backend, and mobile clients. The scope of the Real-time
|
||||
Data Platform extends from event definition and schema, to the layout of events
|
||||
in persisted into long-term queryable storage, and the tooling which sits on
|
||||
top of that queryable storage.
|
||||
|
||||
As the nominal "product owner" for the effort, I aimed to describe less about
|
||||
what tools and technologies should be used, and instead forced myself to define
|
||||
tech-agnostic requirements. Thereby leaving the discovery work for the team I
|
||||
would ultimately hire.
|
||||
|
||||
The Real-time Data Platform must have:
|
||||
|
||||
* A high, nearing 100% data SLA. Meaning we must design in such a way to reduce
|
||||
data loss or corruption at every point of the pipeline.
|
||||
* Maintain data provenance through the pipeline from data creation to usage. In
|
||||
essence, a Data Scientist should be able to easily track data from where it
|
||||
originated, and understand the transformative steps along the way.
|
||||
* Event streams should be considered API contracts, with schemas suggested or
|
||||
enforced when possible. A consumer from an event stream should be able to
|
||||
trust the quality of the events in that stream.
|
||||
* Data processing and transformation must happen as close to ingestion as
|
||||
possible. Events which arrive in long-term storage must be structured and
|
||||
partitioned for optimal query performance with zero or minimal post-processing
|
||||
required for most use-cases.
|
||||
* The platform must scale as the data volume grows without requiring
|
||||
significant redesign or rework.
|
||||
|
||||
|
||||
In essence, we need to change a number of foundational ways in which we
|
||||
generate, transfer, and consider the data which Scribd uses. As Core Platform
|
||||
has unpeeled layer after layer of this onion, we have been able to affirm at
|
||||
each step of the way that we're moving in the right direction, which is by
|
||||
itself quite exciting.
|
||||
|
||||
The design of the Real-time Data Platform which we're currently building out is
|
||||
something I will share at a high level in a subsequent blog post.
|
||||
|
||||
I want to finish this one with some parting thoughts. If you are building
|
||||
_anything_ foundational in a technology organization, you **must** talk to the
|
||||
product team. You must also talk to your customers, but I don't like to ask
|
||||
them what they want, I like to ask what they don't like and don't want. Listen
|
||||
to that negative feedback, understand what lies beneath the frustrations.
|
||||
Finally, have a vision for the future, but build and deliver incrementally.
|
||||
When I first sketched this out, I was forthcoming in stating "this is a 2020
|
||||
project." I made sure to clarify that this did not mean we wouldn't deliver anything
|
||||
to the business for 18 months. Instead, I made made sure to explain that to
|
||||
execute on this overall vision would be a long journey with milestones along
|
||||
the way.
|
||||
|
||||
If you haven't ever watched a skyscraper being built, you would be amazed at
|
||||
how much of the time is spent digging a great big hole, sinking steel into
|
||||
bedrock, and pouring concrete. Months of people working in a city block-sized
|
||||
hole before anything takes shape that even resembles a skyscraper. Building
|
||||
strong foundations takes time, but that is in essence the role of any platform
|
||||
and infrastructure organization. The challenge is to keep the business moving
|
||||
forward today while _also_ building those fundamental components upon which the
|
||||
business will stand in a year or two.
|
||||
|
||||
|
||||
It is tough, but that's exactly what I signed up for. :)
|
||||
|
Loading…
Reference in New Issue