137 lines
4.3 KiB
Plaintext
137 lines
4.3 KiB
Plaintext
ifdef::env-github[]
|
|
:tip-caption: :bulb:
|
|
:note-caption: :information_source:
|
|
:important-caption: :heavy_exclamation_mark:
|
|
:caution-caption: :fire:
|
|
:warning-caption: :warning:
|
|
endif::[]
|
|
|
|
:toc: macro
|
|
|
|
= Delta S3 Loader
|
|
|
|
Delta S3 Loader is a project to quickly and cheaply bring JSON files added to
|
|
S3 buckets into Delta Lake. This can be highly useful for legacy or external
|
|
processes which rely on uploading JSON to an S3 bucket and cannot be properly
|
|
updated to write directly to link:https://delta.io[Delta Lake].
|
|
|
|
toc::[]
|
|
|
|
== Modes
|
|
|
|
Delta S3 Loader can be built into a standalone binary or an
|
|
link:https://aws.amazon.com/lambda/[AWS Lambda]. While both modes are
|
|
functionally identical they have different configuration requirements as
|
|
mentioned below.
|
|
|
|
=== Standalone
|
|
|
|
A standalone instance of Delta S3 Loader requires:
|
|
|
|
* Destination Delta table path
|
|
* SQS queue ARN
|
|
|
|
=== Lambda
|
|
|
|
When deployed with AWS Lambda, the Lambda function should be configured with an
|
|
AWS SQS trigger. This causes AWS to manage the queue on behalf of Delta S3
|
|
Loader. Learn more in the
|
|
link:https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-configure-lambda-function-trigger.html[AWS
|
|
Lambda trigger documentation].
|
|
|
|
== Design
|
|
|
|
This project is designed to work in a Lambda or packaged up into a container.
|
|
It relies on
|
|
link:https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html[S3
|
|
Event Notifications] which are delivered into an
|
|
link:http://aws.amazon.com/sqs/[Amazon SQS] queue. The S3 Event Notifications
|
|
should be configured to funnel events to a single SQS queue per table. The
|
|
Delta S3 Loader will take **all** messages from a single queue and insert those
|
|
into a single table.
|
|
|
|
Additionally, for source buckets which have _multiple_ types of data in them you may use link:https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-filtering.html[filtering on event notifications] to specify different object prefixes, etc.
|
|
For example, in a bucket named `audit_logs` bucket that has data prefixed with:
|
|
|
|
* `databricks/workspaceId=123/*.json`
|
|
* `tableau/*.json`
|
|
* `admin_console/domain=github.com/*.json`
|
|
|
|
A deployment of Delta S3 Loader to _only_ process the `admin_console` events
|
|
into an Delta table would require the following event configuration:
|
|
|
|
.SQS Event Configuration
|
|
[source,xml]
|
|
----
|
|
<NotificationConfiguration>
|
|
<QueueConfiguration>
|
|
<Id>1</Id>
|
|
<Filter>
|
|
<S3Key>
|
|
<FilterRule>
|
|
<Name>prefix</Name>
|
|
<Value>admin_console/</Value>
|
|
</FilterRule>
|
|
<FilterRule>
|
|
<Name>suffix</Name>
|
|
<Value>json</Value>
|
|
</FilterRule>
|
|
</S3Key>
|
|
</Filter>
|
|
<Queue>arn:aws:sqs:us-west-2:444455556666:admin_console_audit_queue</Queue>
|
|
<Event>s3:ObjectCreated:Put</Event>
|
|
</QueueConfiguration>
|
|
</NotificationConfiguration>
|
|
----
|
|
|
|
[CAUTION]
|
|
====
|
|
Always use different source and destination S3 buckets to avoid infinite loops!
|
|
====
|
|
|
|
|
|
A standalone Delta S3 Loader invocation for the above queue might look something like:
|
|
|
|
|
|
[source,bash]
|
|
----
|
|
delta-s3-loader -t s3://warehouse/audit_logs_raw/databricks \ # <1>
|
|
-p domain \ # <2>
|
|
-q "arn:aws:sqs:us-west-2:444455556666:admin_console_audit_queue" # <3>
|
|
----
|
|
<1> Specify a destination Delta Lake table path in S3.
|
|
<2> Annotate the partition columns to help the loader partition data properly.
|
|
<3> Specify the input SQS queue by ARN
|
|
|
|
|
|
== Environment Variables
|
|
|
|
When running in an AWS Lambda, Delta S3 Loader should be configured solely with environment variables. In a standalone mode the daemon can be configured with command line options _or_ environment variables
|
|
|
|
|
|
|===
|
|
| Name | Required | Description
|
|
|
|
| `RUST_LOG`
|
|
| No
|
|
| Define the log level for the process: `error`, `warn`, `info`, `debug`.
|
|
|
|
|===
|
|
|
|
|
|
=== Authentication/Authorization
|
|
|
|
Delta S3 Loader assumes that the right AWS environment variables, such as
|
|
`AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` are defined in the environment.
|
|
Under the hood the Delta S3 Loader is not responsible for
|
|
authentication/authorization so please consult the
|
|
link:https://github.com/rusoto/rusoto/blob/master/AWS-CREDENTIALS.md[Rusoto AWS
|
|
credentials documentation] for more information.
|
|
|
|
|
|
|
|
== Related work
|
|
|
|
link:https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html[S3 Auto-loader in Apache Spark]
|
|
|