Perform some copy editing and transitions on the sections

I hope this helps the post flow a bit more cleanly, this change also updates the title and publication date
2020-03-24 09:23:29 -07:00 · 2020-03-24 09:23:29 -07:00 · 0150200490
parent ebd3e9cb7f
commit 0150200490
2 changed files with 185 additions and 156 deletions
--- a/_posts/2020-03-23-splitting-development-work-around-kafka-topics.md
+++ b/_posts/2020-03-23-splitting-development-work-around-kafka-topics.md
@ -1,156 +0,0 @@
---
-layout: post
-title: "Splitting Development Work Around Kafka Topics"
-author: christianw
-tags:
- featured
- kafka
- msk-series
-team: Core Platform
---
-
-
-Kafka is often lauded for its exceptionally high performance, but another 
-valuable implication of its design is the seams it enables between work streams. 
-We treat our Kafka topics like APIs rather than cogs in a single pipeline. 
-That implies a lot of possible metaphors. One of the first that always springs 
-to my mind is enabling organic usage growth of a specific topic (read: api endpoint)
-from future-as-of-yet-unknown consumers. But another similarity Kafka topics 
-share with API endpoints is creating a natural seam between producer development 
-tasks and consumer development tasks. As long as producer and consumer developers 
-agree on a contract up front, development can run in parallel rather than blocking 
-consumer development until after the producer is deployed. In this post, 
-I will talk about how we have leveraged this seam while developing streaming pipelines.
-
-## Kafka Topic Message Schemas
-
-Kafka topics provide a persistent storage medium where producer applications write 
-data to be consumed by other applications. To consume data from a topic appropriately, 
-applications must know how to deserialize the messages stored on these topics. 
-We formalize the data formats shared by producers and consumers by defining message 
-schemas for each of our topics. We happen to serialize our messages as JSON, 
-and we use a yaml formatting of [JSON schema](https://json-schema.org/) to formalize our shared schema definitions.
-
-The snippet below shows a general version of what one of these message schemas look like in yaml. 
-This comes from one of our real schemas but with all of the interesting fields removed.
-
-```yaml
---
-"$schema": http://json-schema.org/draft-07/schema#
-"$id": some-topic/v1.yml
-
-title: Some Topic
-description:  Event sent by ...
-type: object
-properties:
-  meta:
-    description: Metadata fields added by systems processing the message.
-    type: object
-    properties:
-      schema:
-        description: The message-schema written to this topic by producers.
-        type: string
-      producer:
-        description: Metadata fields populated by the producer.
-        type: object
-        properties:
-          application_version:
-            description: The version of the producer application.
-            type: string
-          timestamp:
-            description: An ISO 8601 formatted date-time with offset representing the time when the event was handled by the producer. E.g. 2020-01-01T15:59:60-08:00.
-            type: string
-            format: date-time
-        required:
-          - version
-          - timestamp
-    required:
-      - producer
-  uuid:
-    description: A unique identifier for the event.
-    type: string
-  user_agent:
-    description: The user agent header of the request.
-    type: string    
-
-  # ... - All the other fields 
-
-required:
-  - meta
-  - uuid
-```
-
-And an example message that satisfies this schema would look something like:
-
-```json
-{
-  "meta": {
-    "schema": "some-topic/v1.yaml",
-    "producer": {
-      "application_version": "v1.0.1",
-      "timestamp": "2020-01-01T15:59:60-08:00"
-    }
-  },
-  "uuid": "3a0178fb-43e7-4340-9e47-9560b7962755",
-  "user_agent": ""
-}
-```
-
-## Kafka Topics in Streaming Pipelines
-
-For many of our new streaming pipelines, a Kafka topic is the first data sink in the pipeline. 
-The next application in the pipeline reads the topic as a streaming source, performs some transformations 
-and sinks to a [Databricks Delta Lake table](https://databricks.com/product/delta-lake-on-databricks).
-
-![Streaming Pipeline](/post-images/2020-03-kafka-series/kafka-player-flow.png)
-
-Sometimes, the work effort on each side of the topic is significant. In one of our recent projects,
-we had to implement a new Docker image, provision a number of new AWS cloud resources, and do a 
-considerable amount of cross-team coordination and testing before the 
-producer could be deployed to production and start writing data. The work on the consumer side 
-was also quite significant. We had to implement a few different Spark Structured 
-Streaming jobs downstream from the topic.
-
-## Dividing Labor
-
-Fortunately, since we define our message schemas beforehand, we know exactly what the data should look like 
-on any given topic. In our recent project, we were able to generate a large file 
-containing new-line delimited JSON and then do full integration testing of all components downstream
-of the Kafka topic before any production data had actually been written by the producer.
-
-To do this, we built a very simple tool called [kafka-player](https://github.com/scribd/kafka-player) [^1].
-All `kafka-player` does is play a file, line-by-line onto a target Kafka topic. It provides a couple of options
-that make this slightly more flexible than just piping the file to `kafkacat`. Most notably the ability to control 
-message rate. 
-
-When we were just getting started in the local development of our Spark applications, we pointed 
-`kafka-player` to a local Kafka Docker container and set message rate 
-very low (i.e. one message every two seconds), so we could watch transformations and aggregations 
-flow through the streams and build confidence in the business logic we were implementing.
-After we nailed the business logic and deployed our Spark applications, we pointed `kafka-player` at our 
-development [MSK Cluster](https://aws.amazon.com/msk/) and cranked up the message rate
-to various thresholds so we could watch the impact on our Spark job resources.
-
-## Future Extensions
-
-Controlling message rate has been a very useful feature of `kafka-player` for us already, but the
-other nice thing about having `kafka-player` in our shared toolbox is that we have a hook in place where we can
-build in new capabilities as new needs arise. 
-
-For our recent projects, we have been able to generate files
-representing our message schemas pretty easily so it made sense to keep the tool
-as simple as possible, but this might not always be the case. 
-As we mature in our usage of JSON schema and encounter cases where generating a large file representing our
-schemas is impractical, we may find it useful to enhance `kafka-player` so that it can generate random data
-according to a message schema.
-
-Deployment tooling may also be on the horizon for `kafka-player`. The level of integration testing 
-we've achieved so far is helpful, but with an image and some container configuration, 
-we could push multiple instances of `kafka-player` writing to the same topic to a container 
-service and create enough traffic to push our downstream consumers to their breaking points.
-
-
-[^1]: kafka-player is a simple utility for playing a text file onto a Kafka topic that has been [open sourced](https://github.com/scribd/kafka-player) by Scribd.
-
-
-
--- a/_posts/2020-03-24-introducing-kafka-player.md
+++ b/_posts/2020-03-24-introducing-kafka-player.md
@ -0,0 +1,185 @@
+---
+layout: post
+title: "Streaming Development Work with Kafka"
+author: christianw
+tags:
+- featured
+- kafka
+- msk-series
+team: Core Platform
+---
+
+We are using
+[Apache Kafka](https://kafka.apache.org)
+as the heart of more streaming applications which presents interesting
+design and development challenges: where does the data producer's
+responsibilities end, and the consumer's begin? How coupled should they be? And
+of course, can we accelerate development by building them in parallel? To help
+address these questions, we treat our Kafka topics like APIs rather than cogs
+in a single pipeline.  Like RESTful API endpoints, Kafka topics create a
+natural seam between development of the producer and that of the consumer As
+long as producer and consumer developers both agree on the API contract up
+front, development can run in parallel. In this post I will share our
+stream-oriented development approach and the open source utility we developed to
+make it easier: [kafka-player](https://github.com/scribd/kafka-player).
+
+
+## APIs for Topics
+
+Perhaps the most important part of developing producers and consumers in
+parallel is creating those "API contracts." Kafka topics provide a persistent
+storage medium where producer applications write data to be consumed by other
+applications. To consume data from a topic appropriately, applications must
+know how to deserialize the messages stored on these topics.  We formalize the
+data formats shared by producers and consumers by defining message schemas for
+each of our topics. We happen to serialize our messages as JSON, and we use a
+yaml formatting of [JSON schema](https://json-schema.org/) to formalize our
+shared schema definitions, giving us the necessary API contract to enforce
+between producer and consumer.
+
+The snippet below shows a general version of what one of these message schemas look like in yaml.
+This comes from one of our real schemas but with all of the interesting fields removed.
+
+```yaml
+---
+"$schema": http://json-schema.org/draft-07/schema#
+"$id": some-topic/v1.yml
+
+title: Some Topic
+description:  Event sent by ...
+type: object
+properties:
+  meta:
+    description: Metadata fields added by systems processing the message.
+    type: object
+    properties:
+      schema:
+        description: The message-schema written to this topic by producers.
+        type: string
+      producer:
+        description: Metadata fields populated by the producer.
+        type: object
+        properties:
+          application_version:
+            description: The version of the producer application.
+            type: string
+          timestamp:
+            description: An ISO 8601 formatted date-time with offset representing the time when the event was handled by the producer. E.g. 2020-01-01T15:59:60-08:00.
+            type: string
+            format: date-time
+        required:
+          - version
+          - timestamp
+    required:
+      - producer
+  uuid:
+    description: A unique identifier for the event.
+    type: string
+  user_agent:
+    description: The user agent header of the request.
+    type: string
+
+  # ... - All the other fields
+
+required:
+  - meta
+  - uuid
+```
+
+And an example message that satisfies this schema would look something like:
+
+```json
+{
+  "meta": {
+    "schema": "some-topic/v1.yaml",
+    "producer": {
+      "application_version": "v1.0.1",
+      "timestamp": "2020-01-01T15:59:60-08:00"
+    }
+  },
+  "uuid": "3a0178fb-43e7-4340-9e47-9560b7962755",
+  "user_agent": ""
+}
+```
+
+With the API contract defined, we're one step closer to decoupled development
+of producer and consumer!
+
+
+## Working with a Streaming Pipeline
+
+For many of our new streaming pipelines, a Kafka topic is the first data sink in the pipeline.
+The next application in the pipeline reads the topic as a streaming source, performs some transformations
+and sinks to a [Databricks Delta Lake table](https://databricks.com/product/delta-lake-on-databricks).
+
+![Streaming Pipeline](/post-images/2020-03-kafka-series/kafka-player-flow.png)
+
+
+The effort to build applications on each side of those topics may be
+significant.  In one of our recent projects, we had to implement a new Docker
+image, provision a number of new AWS cloud resources, and do a considerable
+amount of cross-team coordination before the producer could even be deployed to
+production. The work on the consumer side was also quite significant as we had
+to implement a number of Spark Structured Streaming jobs downstream from the
+topic.
+
+Naturally we didn't want to block the development of the consumer-side while we
+waited for all the producer-side changes to be implemented and delivered.
+
+## Dividing Labor
+
+Fortunately, since we define our message schemas beforehand, we know exactly
+what the data should look like on any given topic. In that recent project, we
+were able to generate a large file containing new-line delimited JSON and then
+do full integration testing of all components downstream of the Kafka topic
+before any production data had actually been written by the producer.
+
+To help with this, we built a very simple tool called [kafka-player](https://github.com/scribd/kafka-player) [^1].
+All `kafka-player` does is play a file, line-by-line onto a target Kafka topic.
+It provides a couple of options that make this slightly more flexible than just
+piping the file to `kafkacat`. Most notably the ability to control
+message rate.
+
+When we were just getting started in the local development of our Spark applications, we pointed
+`kafka-player` to a local Kafka Docker container and set message rate
+very low (i.e. one message every two seconds), so we could watch transformations and aggregations
+flow through the streams and build confidence in the business logic we were implementing.
+After we nailed the business logic and deployed our Spark applications, we pointed `kafka-player` at our
+development [MSK Cluster](https://aws.amazon.com/msk/) and cranked up the message rate
+to various thresholds so we could watch the impact on our Spark job resources.
+
+## Future Extensions
+
+Controlling message rate has been a very useful feature of `kafka-player` for us already, but the
+other nice thing about having `kafka-player` in our shared toolbox is that we have a hook in place where we can
+build in new capabilities as new needs arise.
+
+For our recent projects, we have been able to generate files
+representing our message schemas pretty easily so it made sense to keep the tool
+as simple as possible, but this might not always be the case.
+As we mature in our usage of JSON schema and encounter cases where generating a large file representing our
+schemas is impractical, we may find it useful to enhance `kafka-player` so that it can generate random data
+according to a message schema.
+
+Deployment tooling may also be on the horizon for `kafka-player`. The level of integration testing
+we've achieved so far is helpful, but with an image and some container configuration,
+we could push multiple instances of `kafka-player` writing to the same topic to a container
+service and create enough traffic to push our downstream consumers to their breaking points.
+
+
+---
+
+
+Most of our data workloads at Scribd are still very batch-oriented, but
+streaming applications are already showing incredible potential. The ability to
+process, aggregate, and join data in real-time has already opened up avenues
+for our product and engineering teams. As we increase the amount of data which
+is streamed, I will look forward to sharing more of the tools, tips, and tricks
+we're adopting to move Scribd engineering into a more "real-time" world!
+
+
+---
+
+
+
+[^1]: kafka-player is a simple utility for playing a text file onto a Kafka topic that has been [open sourced](https://github.com/scribd/kafka-player) by Scribd.