Perform some copy editing and transitions on the sections
I hope this helps the post flow a bit more cleanly, this change also updates the title and publication date
This commit is contained in:
parent
ebd3e9cb7f
commit
0150200490
|
@ -1,156 +0,0 @@
|
|||
---
|
||||
layout: post
|
||||
title: "Splitting Development Work Around Kafka Topics"
|
||||
author: christianw
|
||||
tags:
|
||||
- featured
|
||||
- kafka
|
||||
- msk-series
|
||||
team: Core Platform
|
||||
---
|
||||
|
||||
|
||||
Kafka is often lauded for its exceptionally high performance, but another
|
||||
valuable implication of its design is the seams it enables between work streams.
|
||||
We treat our Kafka topics like APIs rather than cogs in a single pipeline.
|
||||
That implies a lot of possible metaphors. One of the first that always springs
|
||||
to my mind is enabling organic usage growth of a specific topic (read: api endpoint)
|
||||
from future-as-of-yet-unknown consumers. But another similarity Kafka topics
|
||||
share with API endpoints is creating a natural seam between producer development
|
||||
tasks and consumer development tasks. As long as producer and consumer developers
|
||||
agree on a contract up front, development can run in parallel rather than blocking
|
||||
consumer development until after the producer is deployed. In this post,
|
||||
I will talk about how we have leveraged this seam while developing streaming pipelines.
|
||||
|
||||
## Kafka Topic Message Schemas
|
||||
|
||||
Kafka topics provide a persistent storage medium where producer applications write
|
||||
data to be consumed by other applications. To consume data from a topic appropriately,
|
||||
applications must know how to deserialize the messages stored on these topics.
|
||||
We formalize the data formats shared by producers and consumers by defining message
|
||||
schemas for each of our topics. We happen to serialize our messages as JSON,
|
||||
and we use a yaml formatting of [JSON schema](https://json-schema.org/) to formalize our shared schema definitions.
|
||||
|
||||
The snippet below shows a general version of what one of these message schemas look like in yaml.
|
||||
This comes from one of our real schemas but with all of the interesting fields removed.
|
||||
|
||||
```yaml
|
||||
---
|
||||
"$schema": http://json-schema.org/draft-07/schema#
|
||||
"$id": some-topic/v1.yml
|
||||
|
||||
title: Some Topic
|
||||
description: Event sent by ...
|
||||
type: object
|
||||
properties:
|
||||
meta:
|
||||
description: Metadata fields added by systems processing the message.
|
||||
type: object
|
||||
properties:
|
||||
schema:
|
||||
description: The message-schema written to this topic by producers.
|
||||
type: string
|
||||
producer:
|
||||
description: Metadata fields populated by the producer.
|
||||
type: object
|
||||
properties:
|
||||
application_version:
|
||||
description: The version of the producer application.
|
||||
type: string
|
||||
timestamp:
|
||||
description: An ISO 8601 formatted date-time with offset representing the time when the event was handled by the producer. E.g. 2020-01-01T15:59:60-08:00.
|
||||
type: string
|
||||
format: date-time
|
||||
required:
|
||||
- version
|
||||
- timestamp
|
||||
required:
|
||||
- producer
|
||||
uuid:
|
||||
description: A unique identifier for the event.
|
||||
type: string
|
||||
user_agent:
|
||||
description: The user agent header of the request.
|
||||
type: string
|
||||
|
||||
# ... - All the other fields
|
||||
|
||||
required:
|
||||
- meta
|
||||
- uuid
|
||||
```
|
||||
|
||||
And an example message that satisfies this schema would look something like:
|
||||
|
||||
```json
|
||||
{
|
||||
"meta": {
|
||||
"schema": "some-topic/v1.yaml",
|
||||
"producer": {
|
||||
"application_version": "v1.0.1",
|
||||
"timestamp": "2020-01-01T15:59:60-08:00"
|
||||
}
|
||||
},
|
||||
"uuid": "3a0178fb-43e7-4340-9e47-9560b7962755",
|
||||
"user_agent": ""
|
||||
}
|
||||
```
|
||||
|
||||
## Kafka Topics in Streaming Pipelines
|
||||
|
||||
For many of our new streaming pipelines, a Kafka topic is the first data sink in the pipeline.
|
||||
The next application in the pipeline reads the topic as a streaming source, performs some transformations
|
||||
and sinks to a [Databricks Delta Lake table](https://databricks.com/product/delta-lake-on-databricks).
|
||||
|
||||
![Streaming Pipeline](/post-images/2020-03-kafka-series/kafka-player-flow.png)
|
||||
|
||||
Sometimes, the work effort on each side of the topic is significant. In one of our recent projects,
|
||||
we had to implement a new Docker image, provision a number of new AWS cloud resources, and do a
|
||||
considerable amount of cross-team coordination and testing before the
|
||||
producer could be deployed to production and start writing data. The work on the consumer side
|
||||
was also quite significant. We had to implement a few different Spark Structured
|
||||
Streaming jobs downstream from the topic.
|
||||
|
||||
## Dividing Labor
|
||||
|
||||
Fortunately, since we define our message schemas beforehand, we know exactly what the data should look like
|
||||
on any given topic. In our recent project, we were able to generate a large file
|
||||
containing new-line delimited JSON and then do full integration testing of all components downstream
|
||||
of the Kafka topic before any production data had actually been written by the producer.
|
||||
|
||||
To do this, we built a very simple tool called [kafka-player](https://github.com/scribd/kafka-player) [^1].
|
||||
All `kafka-player` does is play a file, line-by-line onto a target Kafka topic. It provides a couple of options
|
||||
that make this slightly more flexible than just piping the file to `kafkacat`. Most notably the ability to control
|
||||
message rate.
|
||||
|
||||
When we were just getting started in the local development of our Spark applications, we pointed
|
||||
`kafka-player` to a local Kafka Docker container and set message rate
|
||||
very low (i.e. one message every two seconds), so we could watch transformations and aggregations
|
||||
flow through the streams and build confidence in the business logic we were implementing.
|
||||
After we nailed the business logic and deployed our Spark applications, we pointed `kafka-player` at our
|
||||
development [MSK Cluster](https://aws.amazon.com/msk/) and cranked up the message rate
|
||||
to various thresholds so we could watch the impact on our Spark job resources.
|
||||
|
||||
## Future Extensions
|
||||
|
||||
Controlling message rate has been a very useful feature of `kafka-player` for us already, but the
|
||||
other nice thing about having `kafka-player` in our shared toolbox is that we have a hook in place where we can
|
||||
build in new capabilities as new needs arise.
|
||||
|
||||
For our recent projects, we have been able to generate files
|
||||
representing our message schemas pretty easily so it made sense to keep the tool
|
||||
as simple as possible, but this might not always be the case.
|
||||
As we mature in our usage of JSON schema and encounter cases where generating a large file representing our
|
||||
schemas is impractical, we may find it useful to enhance `kafka-player` so that it can generate random data
|
||||
according to a message schema.
|
||||
|
||||
Deployment tooling may also be on the horizon for `kafka-player`. The level of integration testing
|
||||
we've achieved so far is helpful, but with an image and some container configuration,
|
||||
we could push multiple instances of `kafka-player` writing to the same topic to a container
|
||||
service and create enough traffic to push our downstream consumers to their breaking points.
|
||||
|
||||
|
||||
[^1]: kafka-player is a simple utility for playing a text file onto a Kafka topic that has been [open sourced](https://github.com/scribd/kafka-player) by Scribd.
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,185 @@
|
|||
---
|
||||
layout: post
|
||||
title: "Streaming Development Work with Kafka"
|
||||
author: christianw
|
||||
tags:
|
||||
- featured
|
||||
- kafka
|
||||
- msk-series
|
||||
team: Core Platform
|
||||
---
|
||||
|
||||
We are using
|
||||
[Apache Kafka](https://kafka.apache.org)
|
||||
as the heart of more streaming applications which presents interesting
|
||||
design and development challenges: where does the data producer's
|
||||
responsibilities end, and the consumer's begin? How coupled should they be? And
|
||||
of course, can we accelerate development by building them in parallel? To help
|
||||
address these questions, we treat our Kafka topics like APIs rather than cogs
|
||||
in a single pipeline. Like RESTful API endpoints, Kafka topics create a
|
||||
natural seam between development of the producer and that of the consumer As
|
||||
long as producer and consumer developers both agree on the API contract up
|
||||
front, development can run in parallel. In this post I will share our
|
||||
stream-oriented development approach and the open source utility we developed to
|
||||
make it easier: [kafka-player](https://github.com/scribd/kafka-player).
|
||||
|
||||
|
||||
## APIs for Topics
|
||||
|
||||
Perhaps the most important part of developing producers and consumers in
|
||||
parallel is creating those "API contracts." Kafka topics provide a persistent
|
||||
storage medium where producer applications write data to be consumed by other
|
||||
applications. To consume data from a topic appropriately, applications must
|
||||
know how to deserialize the messages stored on these topics. We formalize the
|
||||
data formats shared by producers and consumers by defining message schemas for
|
||||
each of our topics. We happen to serialize our messages as JSON, and we use a
|
||||
yaml formatting of [JSON schema](https://json-schema.org/) to formalize our
|
||||
shared schema definitions, giving us the necessary API contract to enforce
|
||||
between producer and consumer.
|
||||
|
||||
The snippet below shows a general version of what one of these message schemas look like in yaml.
|
||||
This comes from one of our real schemas but with all of the interesting fields removed.
|
||||
|
||||
```yaml
|
||||
---
|
||||
"$schema": http://json-schema.org/draft-07/schema#
|
||||
"$id": some-topic/v1.yml
|
||||
|
||||
title: Some Topic
|
||||
description: Event sent by ...
|
||||
type: object
|
||||
properties:
|
||||
meta:
|
||||
description: Metadata fields added by systems processing the message.
|
||||
type: object
|
||||
properties:
|
||||
schema:
|
||||
description: The message-schema written to this topic by producers.
|
||||
type: string
|
||||
producer:
|
||||
description: Metadata fields populated by the producer.
|
||||
type: object
|
||||
properties:
|
||||
application_version:
|
||||
description: The version of the producer application.
|
||||
type: string
|
||||
timestamp:
|
||||
description: An ISO 8601 formatted date-time with offset representing the time when the event was handled by the producer. E.g. 2020-01-01T15:59:60-08:00.
|
||||
type: string
|
||||
format: date-time
|
||||
required:
|
||||
- version
|
||||
- timestamp
|
||||
required:
|
||||
- producer
|
||||
uuid:
|
||||
description: A unique identifier for the event.
|
||||
type: string
|
||||
user_agent:
|
||||
description: The user agent header of the request.
|
||||
type: string
|
||||
|
||||
# ... - All the other fields
|
||||
|
||||
required:
|
||||
- meta
|
||||
- uuid
|
||||
```
|
||||
|
||||
And an example message that satisfies this schema would look something like:
|
||||
|
||||
```json
|
||||
{
|
||||
"meta": {
|
||||
"schema": "some-topic/v1.yaml",
|
||||
"producer": {
|
||||
"application_version": "v1.0.1",
|
||||
"timestamp": "2020-01-01T15:59:60-08:00"
|
||||
}
|
||||
},
|
||||
"uuid": "3a0178fb-43e7-4340-9e47-9560b7962755",
|
||||
"user_agent": ""
|
||||
}
|
||||
```
|
||||
|
||||
With the API contract defined, we're one step closer to decoupled development
|
||||
of producer and consumer!
|
||||
|
||||
|
||||
## Working with a Streaming Pipeline
|
||||
|
||||
For many of our new streaming pipelines, a Kafka topic is the first data sink in the pipeline.
|
||||
The next application in the pipeline reads the topic as a streaming source, performs some transformations
|
||||
and sinks to a [Databricks Delta Lake table](https://databricks.com/product/delta-lake-on-databricks).
|
||||
|
||||
![Streaming Pipeline](/post-images/2020-03-kafka-series/kafka-player-flow.png)
|
||||
|
||||
|
||||
The effort to build applications on each side of those topics may be
|
||||
significant. In one of our recent projects, we had to implement a new Docker
|
||||
image, provision a number of new AWS cloud resources, and do a considerable
|
||||
amount of cross-team coordination before the producer could even be deployed to
|
||||
production. The work on the consumer side was also quite significant as we had
|
||||
to implement a number of Spark Structured Streaming jobs downstream from the
|
||||
topic.
|
||||
|
||||
Naturally we didn't want to block the development of the consumer-side while we
|
||||
waited for all the producer-side changes to be implemented and delivered.
|
||||
|
||||
## Dividing Labor
|
||||
|
||||
Fortunately, since we define our message schemas beforehand, we know exactly
|
||||
what the data should look like on any given topic. In that recent project, we
|
||||
were able to generate a large file containing new-line delimited JSON and then
|
||||
do full integration testing of all components downstream of the Kafka topic
|
||||
before any production data had actually been written by the producer.
|
||||
|
||||
To help with this, we built a very simple tool called [kafka-player](https://github.com/scribd/kafka-player) [^1].
|
||||
All `kafka-player` does is play a file, line-by-line onto a target Kafka topic.
|
||||
It provides a couple of options that make this slightly more flexible than just
|
||||
piping the file to `kafkacat`. Most notably the ability to control
|
||||
message rate.
|
||||
|
||||
When we were just getting started in the local development of our Spark applications, we pointed
|
||||
`kafka-player` to a local Kafka Docker container and set message rate
|
||||
very low (i.e. one message every two seconds), so we could watch transformations and aggregations
|
||||
flow through the streams and build confidence in the business logic we were implementing.
|
||||
After we nailed the business logic and deployed our Spark applications, we pointed `kafka-player` at our
|
||||
development [MSK Cluster](https://aws.amazon.com/msk/) and cranked up the message rate
|
||||
to various thresholds so we could watch the impact on our Spark job resources.
|
||||
|
||||
## Future Extensions
|
||||
|
||||
Controlling message rate has been a very useful feature of `kafka-player` for us already, but the
|
||||
other nice thing about having `kafka-player` in our shared toolbox is that we have a hook in place where we can
|
||||
build in new capabilities as new needs arise.
|
||||
|
||||
For our recent projects, we have been able to generate files
|
||||
representing our message schemas pretty easily so it made sense to keep the tool
|
||||
as simple as possible, but this might not always be the case.
|
||||
As we mature in our usage of JSON schema and encounter cases where generating a large file representing our
|
||||
schemas is impractical, we may find it useful to enhance `kafka-player` so that it can generate random data
|
||||
according to a message schema.
|
||||
|
||||
Deployment tooling may also be on the horizon for `kafka-player`. The level of integration testing
|
||||
we've achieved so far is helpful, but with an image and some container configuration,
|
||||
we could push multiple instances of `kafka-player` writing to the same topic to a container
|
||||
service and create enough traffic to push our downstream consumers to their breaking points.
|
||||
|
||||
|
||||
---
|
||||
|
||||
|
||||
Most of our data workloads at Scribd are still very batch-oriented, but
|
||||
streaming applications are already showing incredible potential. The ability to
|
||||
process, aggregate, and join data in real-time has already opened up avenues
|
||||
for our product and engineering teams. As we increase the amount of data which
|
||||
is streamed, I will look forward to sharing more of the tools, tips, and tricks
|
||||
we're adopting to move Scribd engineering into a more "real-time" world!
|
||||
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
[^1]: kafka-player is a simple utility for playing a text file onto a Kafka topic that has been [open sourced](https://github.com/scribd/kafka-player) by Scribd.
|
Loading…
Reference in New Issue