Using actions directly for the commit also ensures that adds and removes
happen in the same commit rather than the two separate commits as was
done prior.
The webhook and sqs-ingest lambdas both effectively need to take strings
of data and append them to a configured Delta Lake table, so the shared
code comes "up" into the oxbow crate
In essence the Oxbow and Auto-tag lambda should still be triggered by
SQS, but in order to allow them to rely on the same exact bucket
notifications an SNS topic must be configured upstream.
S3 Event Notifications -> SNS -> Oxbow SQS -> Oxbow
`---> Auto tag SQS -> Auto tag
This change will handle deleted files correctly, but will also ensure
that removed files don't incorrectly show up as additions.
With this change S3 LifeCycle configurations should _just work_ with
Delta tables
Fixes#10
In some scenarios Big Query can inline a partition column in output
parquet files and some deduplication needs to happen on columns before
the initial commit on the table gets created
Sponsored-by: Scribd, Inc.
This approach should help address some problems identified in [this blog
post](https://www.buoyantdata.com/blog/2023-11-27-concurrency-limitations-with-deltalake-on-aws.html).
In real-world scenarios lock acquisition timeouts will happen if a large sync
results in a substantial number of parquet files being dropped into the same S3
table prefix.
The simple oxbow deployment is:
S3 Events -> SQS -> oxbow
This approach sequences events into a FIFO queue which should help avoid lock
contention:
S3 Events -> SQS -> group-events -> SQS FIFO -> oxbow
The use of the table prefix as the message group ID ensures that the oxbow
lambda will not be invoked concurrently for the table prefix