lambda-delta-optimize/README.adoc

80 lines
2.2 KiB
Plaintext
Raw Permalink Normal View History

2023-03-27 19:46:52 +00:00
ifdef::env-github[]
:tip-caption: :bulb:
:note-caption: :information_source:
:important-caption: :heavy_exclamation_mark:
:caution-caption: :fire:
:warning-caption: :warning:
endif::[]
:toc: macro
= Delta Optimize Lambda
This Lambda function can be used with a periodic trigger to optimize a
configured link:https://delta.io[Delta Lake] table. Consult the `deployment.tf`
file for an example of how to provision the function in AWS.
2023-03-27 19:46:52 +00:00
toc::[]
== Building
Building and testing the Lambda can be done with cargo: `cargo test build`.
In order to deploy this in AWS Lambda, it must first be built with the `cargo
lambda` command line tool, e.g.:
[source,bash]
----
cargo lambda build --release --output-format zip
----
This will produce the file: `target/lambda/lambda-delta-optimize/bootstrap.zip`
== Infrastructure
The `deployment.tf` file contains the necessary Terraform to provision the
function, a DynamoDB table for locking, and IAM permissions. This Terraform
does *not* provision an S3 bucket to optimize.
After configuring the necessary authentication for Terraform, the following
steps can be used to provision:
[source,bash]
----
cargo lambda build --release --output-format zip
terraform init
terraform plan
terraform apply
----
[NOTE]
====
Terraform configures the Lambda to run with the smallest amount of memory allowed. For a sizable table, this may not be sufficient for larger tables.
====
=== Environment variables
The following environment variables must be set for the function to run properly
|===
| Name | Value | Notes
| `DATALAKE_LOCATION`
| `s3://my-bucket-name/databases/bronze/http`
| The `s3://` URL of the desired table to optimize.
| `AWS_S3_LOCKING_PROVIDER`
| `dynamodb`
| This instructs the `deltalake` crate to use DynamoDB for locking to provide consistent writes into s3.
| `OPTIMIZE_DS`
| `yesterday`
| Only apply optimizations to the `ds` partition (`YYYY-mm-dd`), the `yesterday` value will use the previous day UTC
|===
2023-03-27 05:53:10 +00:00
== Licensing
This repository is intentionally licensed under the link:https://www.gnu.org/licenses/agpl-3.0.en.html[AGPL 3.0]. If your organization is interested in re-licensing this function for re-use, contact me via email for commercial licensing terms: `rtyler@brokenco.de`