A simple Lambda function for optimizing Delta tables.
Go to file
R Tyler Croy 67950fcb6e Upgrade the deltalake dependency to a released crate 2023-05-23 21:54:02 -07:00
src Introduce the OPTIMIZE_DS environment variable to scope optimize 2023-03-30 23:04:52 -07:00
.gitignore Complete the necessary terraform code for all the necessary resources to run 2023-03-26 22:27:43 -07:00
Cargo.lock Upgrade the deltalake dependency to a released crate 2023-05-23 21:54:02 -07:00
Cargo.toml Upgrade the deltalake dependency to a released crate 2023-05-23 21:54:02 -07:00
LICENSE.txt Initial commit of an optimization function 2023-03-26 21:14:43 -07:00
README.adoc Introduce the OPTIMIZE_DS environment variable to scope optimize 2023-03-30 23:04:52 -07:00
deployment.tf Update the deployment to optimize yesterday's partition 2023-03-30 23:41:07 -07:00

README.adoc

<html lang="en"> <head> </head>

Delta Optimize Lambda

This Lambda function can be used with a periodic trigger to optimize a configured Delta Lake table. Consult the deployment.tf file for an example of how to provision the function in AWS.

Building

Building and testing the Lambda can be done with cargo: cargo test build.

In order to deploy this in AWS Lambda, it must first be built with the cargo lambda command line tool, e.g.:

cargo lambda build --release --output-format zip

This will produce the file: target/lambda/lambda-delta-optimize/bootstrap.zip

Infrastructure

The deployment.tf file contains the necessary Terraform to provision the function, a DynamoDB table for locking, and IAM permissions. This Terraform does not provision an S3 bucket to optimize.

After configuring the necessary authentication for Terraform, the following steps can be used to provision:

cargo lambda build --release --output-format zip
terraform init
terraform plan
terraform apply
Note

Terraform configures the Lambda to run with the smallest amount of memory allowed. For a sizable table, this may not be sufficient for larger tables.

Environment variables

The following environment variables must be set for the function to run properly

Name Value Notes

DATALAKE_LOCATION

s3://my-bucket-name/databases/bronze/http

The s3:// URL of the desired table to optimize.

AWS_S3_LOCKING_PROVIDER

dynamodb

This instructs the deltalake crate to use DynamoDB for locking to provide consistent writes into s3.

OPTIMIZE_DS

yesterday

Only apply optimizations to the ds partition (YYYY-mm-dd), the yesterday value will use the previous day UTC

Licensing

This repository is intentionally licensed under the AGPL 3.0. If your organization is interested in re-licensing this function for re-use, contact me via email for commercial licensing terms: rtyler@brokenco.de

</html>