A utility for turning a bunch of Apache Parquet files into a Delta Lake table
Go to file
R Tyler Croy bb5caee2f7 Introduce glue-create to handle initial table creation with AWS Athena 2024-04-21 23:07:14 +00:00
.github/workflows Introduce glue-create to handle initial table creation with AWS Athena 2024-04-21 23:07:14 +00:00
ci Refactor the build scripts to all point at the same origin 2024-04-21 17:25:01 +00:00
cli Restructure the workspace a bit more to pave the way for shared tooling 2023-11-24 15:09:42 -08:00
crates Refactor the build scripts to all point at the same origin 2024-04-21 17:25:01 +00:00
deployment Handle ObjectRemoved:Delete events and translate those into Delta table removals 2023-12-18 11:49:04 -08:00
examples
lambdas Introduce glue-create to handle initial table creation with AWS Athena 2024-04-21 23:07:14 +00:00
tests Prevent duplicate column definitions showing up in the delta schema 2023-12-12 13:07:28 -08:00
.gitignore Upgrade to the latest deltalake 0l16.4 which includes TTL fixes for locks 2023-11-12 13:44:24 -08:00
Cargo.toml Introduce a glue table updating lambda which is trigged by S3 Event Notifications 2024-04-21 22:33:13 +00:00
Jenkinsfile Add a simple Jenkinsfile 2024-04-21 16:59:56 +00:00
LICENSE.txt
Makefile Refactor the build scripts to all point at the same origin 2024-04-21 17:25:01 +00:00
README.adoc Switch the DynamoDB provisioning to on-demand for examples 2023-12-12 10:31:24 -08:00

README.adoc

<html lang="en"> <head> </head>

Oxbow

Oxbow is a project to take an existing storage location which contains Apache Parquet files into a Delta Lake table. It is intended to run both as an AWS Lambda or as a command line application.

The project is named after Oxbow lakes to keep with the lake theme.

Using

Command Line

Executing cargo build --release from a clone of this repository will build the command line binary oxbow which can be used directly to convert a directory full of .parquet files into a Delta table.

This is an in place operation and will convert the specified table location into a Delta table!

Simple local files
% oxbow --table ./path/to/my/parquet-files
Files on AWS
% export AWS_REGION=us-west-2
% export AWS_SECRET_ACCESS_KEY=xxxx
# Set other AWS environment variables
% oxbow --table s3://my-bucket/prefix/to/parquet

Lambda

The deployment/ directory contains the necessary Terraform to provision the function, a DynamoDB table for locking, S3 bucket, and IAM permissions.

After configuring the necessary authentication for Terraform, the following steps can be used to provision:

cargo lambda build --release --output-format zip --bin oxbow-lambda
terraform init
terraform plan
terraform apply
Note

Terraform configures the Lambda to run with the smallest amount of memory allowed. For bucket locations with massive .parquet files, this may need to be tuned.

Advanced

To help ameliorate concurrency challenges for Delta Lake on AWS with the DynamoDb lock, the deployment/ directory also contains an "advanced" pattern which uses the group-events Lambda to help serialize S3 Bucket Notifications into an AWS SQS FIFO with Message Group IDs.

To build all the necessary code locally for the Advanced pattern, please run make build-release

Development

Building and testing can be done with cargo: cargo test.

In order to deploy this in AWS Lambda, it must first be built with the cargo lambda command line tool, e.g.:

cargo lambda build --release --output-format zip

This will produce the file: target/lambda/oxbow-lambda/bootstrap.zip which can be uploaded direectly in the web console, or referenced in the Terraform (see deployment.tf).

Design

Command Line

When running oxbow via command line it is a one time operation. It will take an existing directory or location full of .parquet files and create a Delta table out of it.

Lambda

When running oxbow inside of a AWS Lambda function it should be configured with an S3 Event Trigger and create new commits to a Delta Lake table any time a .parquet file is added to the bucket/prefix.

Licensing

This repository is licensed under the AGPL 3.0. If your organization is interested in re-licensing this function for re-use, contact me via email for commercial licensing terms: rtyler@buoyantdata.com

</html>