ifdef::env-github[] :tip-caption: :bulb: :note-caption: :information_source: :important-caption: :heavy_exclamation_mark: :caution-caption: :fire: :warning-caption: :warning: endif::[] :toc: macro = Oxbow Oxbow is a project to take an existing storage location which contains link:https://parquet.apache.org[Apache Parquet] files into a link:https://delta.io[Delta Lake table]. It is intended to run both as an AWS Lambda or as a command line application. The project is named after link:https://en.wikipedia.org/wiki/Oxbow_lake[Oxbow lakes] to keep with the lake theme. toc::[] == Using === Command Line Executing `cargo build --release` from a clone of this repository will build the command line binary `oxbow` which can be used directly to convert a directory full of `.parquet` files into a Delta table. This is an _in place_ operation and will convert the specified table location into a Delta table! .Simple local files [source,bash] ---- % oxbow --table ./path/to/my/parquet-files ---- .Files on AWS [source,bash] ---- % export AWS_REGION=us-west-2 % export AWS_SECRET_ACCESS_KEY=xxxx # Set other AWS environment variables % oxbow --table s3://my-bucket/prefix/to/parquet ---- === Lambda The `deployment/` directory contains the necessary Terraform to provision the function, a DynamoDB table for locking, S3 bucket, and IAM permissions. After configuring the necessary authentication for Terraform, the following steps can be used to provision: [source,bash] ---- cargo lambda build --release --output-format zip --bin oxbow-lambda terraform init terraform plan terraform apply ---- [NOTE] ==== Terraform configures the Lambda to run with the smallest amount of memory allowed. For bucket locations with massive `.parquet` files, this may need to be tuned. ==== ==== Advanced To help ameliorate link:https://www.buoyantdata.com/blog/2023-11-27-concurrency-limitations-with-deltalake-on-aws.html[concurrency challenges for Delta Lake on AWS] with the DynamoDb lock, the `deployment/` directory also contains an "advanced" pattern which uses the `group-events` Lambda to help serialize S3 Bucket Notifications into an AWS SQS FIFO with Message Group IDs. To build all the necessary code locally for the Advanced pattern, please run `make build-release` == Development Building and testing can be done with cargo: `cargo test`. In order to deploy this in AWS Lambda, it must first be built with the `cargo lambda` command line tool, e.g.: [source,bash] ---- cargo lambda build --release --output-format zip ---- This will produce the file: `target/lambda/oxbow-lambda/bootstrap.zip` which can be uploaded direectly in the web console, or referenced in the Terraform (see `deployment.tf`). === Design ==== Command Line When running `oxbow` via command line it is a _one time operation_. It will take an existing directory or location full of `.parquet` files and create a Delta table out of it. ==== Lambda When running `oxbow` inside of a AWS Lambda function it should be configured with an S3 Event Trigger and create new commits to a Delta Lake table any time a `.parquet` file is added to the bucket/prefix. == Licensing This repository is licensed under the link:https://www.gnu.org/licenses/agpl-3.0.en.html[AGPL 3.0]. If your organization is interested in re-licensing this function for re-use, contact me via email for commercial licensing terms: `rtyler@buoyantdata.com`