An AWS Lambda for automatically loading JSON files as they're created into Delta tables
Go to file
R Tyler Croy d26d27950c Playing around with reading the file from S3
This structure I don't like, I need to think a bit more about how to structure
the processor code
2021-06-13 09:43:26 -07:00
data Add some properly structured test data and a destination audit_log delta table 2021-04-21 10:30:35 -07:00
src Playing around with reading the file from S3 2021-06-13 09:43:26 -07:00
.gitignore Load environment variables optionally from .env 2021-06-12 09:48:34 -07:00
Cargo.lock Playing around with reading the file from S3 2021-06-13 09:43:26 -07:00
Cargo.toml Playing around with reading the file from S3 2021-06-13 09:43:26 -07:00
LICENSE.txt Initial commit 2021-04-14 20:00:27 -07:00
README.adoc Sketching out the design and implementation of Delta S3 Loader 2021-06-11 11:50:22 -07:00

README.adoc

<html lang="en"> <head> </head>

Delta S3 Loader

Delta S3 Loader is a project to quickly and cheaply bring JSON files added to S3 buckets into Delta Lake. This can be highly useful for legacy or external processes which rely on uploading JSON to an S3 bucket and cannot be properly updated to write directly to Delta Lake.

Modes

Delta S3 Loader can be built into a standalone binary or an AWS Lambda. While both modes are functionally identical they have different configuration requirements as mentioned below.

Standalone

A standalone instance of Delta S3 Loader requires:

  • Destination Delta table path

  • SQS queue ARN

Lambda

When deployed with AWS Lambda, the Lambda function should be configured with an AWS SQS trigger. This causes AWS to manage the queue on behalf of Delta S3 Loader. Learn more in the AWS Lambda trigger documentation.

Design

This project is designed to work in a Lambda or packaged up into a container. It relies on S3 Event Notifications which are delivered into an Amazon SQS queue. The S3 Event Notifications should be configured to funnel events to a single SQS queue per table. The Delta S3 Loader will take all messages from a single queue and insert those into a single table.

Additionally, for source buckets which have multiple types of data in them you may use filtering on event notifications to specify different object prefixes, etc. For example, in a bucket named audit_logs bucket that has data prefixed with:

  • databricks/workspaceId=123/*.json

  • tableau/*.json

  • admin_console/domain=github.com/*.json

A deployment of Delta S3 Loader to only process the admin_console events into an Delta table would require the following event configuration:

SQS Event Configuration
<NotificationConfiguration>
  <QueueConfiguration>
      <Id>1</Id>
      <Filter>
          <S3Key>
              <FilterRule>
                  <Name>prefix</Name>
                  <Value>admin_console/</Value>
              </FilterRule>
              <FilterRule>
                  <Name>suffix</Name>
                  <Value>json</Value>
              </FilterRule>
          </S3Key>
     </Filter>
     <Queue>arn:aws:sqs:us-west-2:444455556666:admin_console_audit_queue</Queue>
     <Event>s3:ObjectCreated:Put</Event>
  </QueueConfiguration>
</NotificationConfiguration>
Caution

Always use different source and destination S3 buckets to avoid infinite loops!

A standalone Delta S3 Loader invocation for the above queue might look something like:

delta-s3-loader -t s3://warehouse/audit_logs_raw/databricks \ # (1)
                -p domain \ # (2)
                -q "arn:aws:sqs:us-west-2:444455556666:admin_console_audit_queue" # (3)
  1. Specify a destination Delta Lake table path in S3.

  2. Annotate the partition columns to help the loader partition data properly.

  3. Specify the input SQS queue by ARN

Environment Variables

When running in an AWS Lambda, Delta S3 Loader should be configured solely with environment variables. In a standalone mode the daemon can be configured with command line options or environment variables

Name Required Description

RUST_LOG

No

Define the log level for the process: error, warn, info, debug.

Authentication/Authorization

Delta S3 Loader assumes that the right AWS environment variables, such as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are defined in the environment. Under the hood the Delta S3 Loader is not responsible for authentication/authorization so please consult the Rusoto AWS credentials documentation for more information.

</html>