Commit Graph

127 Commits

Author SHA1 Message Date
R Tyler Croy bb5caee2f7 Introduce glue-create to handle initial table creation with AWS Athena 2024-04-21 23:07:14 +00:00
R Tyler Croy d8cdbdaf95 Update the GitHub Action to include glue-sync 2024-04-21 22:36:32 +00:00
R Tyler Croy 30f65c4713 Introduce a glue table updating lambda which is trigged by S3 Event Notifications
This introduces some logic to attempt to add additional columns which
appear on the Delta table into the Glue Data Catalog schema.

Unfortunately AWS Glue Data Catalog is eventually consistent so
operations are not immediate.

Gee whiz is this kind of a pain to test.
2024-04-21 22:33:13 +00:00
R Tyler Croy 9d88d396f1 Add the scaffolding for glue-sync 2024-04-21 17:29:52 +00:00
R Tyler Croy 13167d7b18 Refactor the build scripts to all point at the same origin 2024-04-21 17:25:01 +00:00
R Tyler Croy f922e93d30 Add a simple Jenkinsfile 2024-04-21 16:59:56 +00:00
R Tyler Croy a5c501665e Handle odd timestamp types when doing schema evolution
This refactors the code for supporting goofy Airbyte-generated types
on both table create and schema evolution
2024-04-20 21:27:50 +00:00
R Tyler Croy 924cb6855b Implement rudimentary schema evolution based on parquet file schema discovery 2024-04-20 09:16:18 -07:00
R Tyler Croy 46a1c10835 Refactor commits to accept Actions directly and prepare for schema evolution
Using actions directly for the commit also ensures that adds and removes
happen in the same commit rather than the two separate commits as was
done prior.
2024-04-20 08:17:48 -07:00
R Tyler Croy 795f095094 Enable the group-events lambda to use UNWRAP_SNS_ENVELOPE like the others 2024-04-10 15:27:01 -07:00
R Tyler Croy 24edaba873 Add the necessary github workflow code for release of sqs-ingest 2024-04-08 12:30:43 -07:00
R Tyler Croy 6d1bc34b86 Introduce the bulk of sqs-ingest with some refactorings for the webhook
The webhook and sqs-ingest lambdas both effectively need to take strings
of data and append them to a configured Delta Lake table, so the shared
code comes "up" into the oxbow crate
2024-04-08 08:40:05 -07:00
R Tyler Croy 84eba62314 Create the scaffolding for sqs-ingest
Similar to kafka-delta-ingest, but just for SQS!
2024-04-05 15:43:01 -07:00
R Tyler Croy a39e49697a Update the webhook documentation with important settings for use 2024-04-05 15:42:43 -07:00
R Tyler Croy 85f6b1fc22 Properly create checkpoints on every 10th commit.
Missed some logic previously in my haste!
2024-04-02 15:57:35 -07:00
R Tyler Croy cc9cdc299f Properly checkpoint writes via the webhook lambda
Missed a spot in my haste last week, oops!
2024-04-01 11:12:32 -07:00
R Tyler Croy 137209d0de Enhance the webhook lambda to augment with a `ds` column for partitioning 2024-03-21 14:47:52 -07:00
R Tyler Croy c751641ef4 Update the version and add some useful debug flags 2024-03-20 14:49:01 -07:00
R Tyler Croy 85661aeb7e Add release build for github actions on the webhook lambda 2024-03-12 11:25:39 -07:00
R Tyler Croy f163abb52e
Merge pull request #19 from buoyant-data/webhook-support
Add a webhook lambda for appending JSONL
2024-03-12 11:11:41 -07:00
R Tyler Croy a4da7ca032 Add a webhook lambda for appending JSONL 2024-03-12 11:10:42 -07:00
R Tyler Croy 145d1109e1 Properly url decode keys for the auto-tag lambda 2024-02-06 16:28:29 -08:00
R Tyler Croy 805aef3854
Merge pull request #17 from buoyant-data/sns-envelope
Introduce UNWRAP_SNS_ENVELOPE which allows SNS to be introduced upstream
2024-01-26 19:31:04 -08:00
R Tyler Croy 13f88075e7 Introduce UNWRAP_SNS_ENVELOPE which allows SNS to be introduced upstream
In essence the Oxbow and Auto-tag lambda should still be triggered by
SQS, but in order to allow them to rely on the same exact bucket
notifications an SNS topic must be configured upstream.

        S3 Event Notifications -> SNS -> Oxbow SQS -> Oxbow
                                   `---> Auto tag SQS -> Auto tag
2024-01-26 18:05:20 -08:00
R Tyler Croy 15c03540b2
Merge pull request #15 from buoyant-data/auto-tagging
Introduce the simple auto-tag Lambda for adding some tags for lifecyc…
2023-12-21 13:13:41 -08:00
R Tyler Croy ac59e4edc3 Update release workflow 2023-12-21 13:13:12 -08:00
R Tyler Croy 68fc9f7c98 Introduce the simple auto-tag Lambda for adding some tags for lifecycle policies
This will make it easier to set up lifecycle policies on parquet files
but not on the delta table itself.
2023-12-21 13:06:49 -08:00
R Tyler Croy 1875ad1ed6
Merge pull request #13 from buoyant-data/handle-delete-events-10
Handle ObjectRemoved:Delete events and translate those into Delta table removals
2023-12-18 12:21:34 -08:00
R Tyler Croy 0613a6b9d8 Make the release supporting deletions count as 0.9.0 2023-12-18 12:15:22 -08:00
R Tyler Croy b3f45b2b2d Handle ObjectRemoved:Delete events and translate those into Delta table removals
This change will handle deleted files correctly, but will also ensure
that removed files don't incorrectly show up as additions.

With this change S3 LifeCycle configurations should _just work_ with
Delta tables

Fixes #10
2023-12-18 11:49:04 -08:00
R Tyler Croy 6c34f4ed81 Bump to 0.8.4 2023-12-12 15:52:30 -08:00
R Tyler Croy 4eb1068aca Bump version for next release 2023-12-12 13:21:05 -08:00
R Tyler Croy 9205ca8d77
Merge pull request #12 from buoyant-data/duplicate-ds-columns
Prevent duplicate columns in the schema when partitions are present in parquet files
2023-12-12 13:15:48 -08:00
R Tyler Croy 114c1b6b51 Prevent duplicate column definitions showing up in the delta schema
In some scenarios Big Query can inline a partition column in output
parquet files and some deduplication needs to happen on columns before
the initial commit on the table gets created

Sponsored-by: Scribd, Inc.
2023-12-12 13:07:28 -08:00
R Tyler Croy bfac6c2dd0 Switch the DynamoDB provisioning to on-demand for examples
No sense in these being provisioned
2023-12-12 10:31:24 -08:00
R Tyler Croy 6448997080 Clean up the version inclusion in the release artifacts 2023-12-05 17:05:29 -08:00
R Tyler Croy 292cda1e2a Update the release workflow and clean up the REEADME for a 0.8.1 release 2023-12-05 16:58:55 -08:00
R Tyler Croy 0197de233d Ignore s3:TestEvent in the SQS event processing pipeline
Fixes #8
2023-12-05 16:52:32 -08:00
R Tyler Croy c2d6f27b0c On table creation modify the timestamp data type for simplicity's sake
The `deltalake` crate should likely be improved to avoid having issues
with Timestamps with millisecond precison since the protocol supports
them
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc)
but this unblocks behavior now. 🤔
2023-12-05 16:52:32 -08:00
R Tyler Croy 03caba50f8 Shorten the deployment target 2023-12-04 08:01:49 -08:00
R Tyler Croy 66fa770e93 Ensure that the group-id is always valid for SQS
S3 URLs that exceed the expected length of group id (128) causes
problems
2023-12-04 08:00:50 -08:00
R Tyler Croy 45584233ba Introduce the meat of the group-events function
This helps in situations where singular tables are receiving a large influx of
events
2023-12-01 21:12:44 -08:00
R Tyler Croy 3e27d1c014 Implement the bulk of the group-events Lamdba which will help sequence writes
This approach should help address some problems identified in [this blog
post](https://www.buoyantdata.com/blog/2023-11-27-concurrency-limitations-with-deltalake-on-aws.html).
In real-world scenarios lock acquisition timeouts will happen if a large sync
results in a substantial number of parquet files being dropped into the same S3
table prefix.

The simple oxbow deployment is:

    S3 Events -> SQS -> oxbow

This approach sequences events into a FIFO queue which should help avoid lock
contention:

    S3 Events -> SQS -> group-events -> SQS FIFO -> oxbow

The use of the table prefix as the message group ID ensures that the oxbow
lambda will not be invoked concurrently for the table prefix
2023-11-30 17:37:39 -08:00
R Tyler Croy d379600461 Introduce a simple GNU/Makefile to make common development tasks easier 2023-11-27 17:55:13 -08:00
R Tyler Croy e9e7f82ca3 Fix race on lock acquisition
Was not paying attention to the dynamodb-lock-rs documentation when I originally
created try_acquire_lock() which does _not_ have the retry behavior. This means
that if a lock is taken by another invocation it will fail the function and
result in DLQ'ing messages unnecessarily
2023-11-27 17:36:56 -08:00
R Tyler Croy 4d12f066db Pushing common code that will be needed in the grouping lambda into the shared crate 2023-11-27 17:36:56 -08:00
R Tyler Croy f5b7c98cd0 Restructure the workspace a bit more to pave the way for shared tooling 2023-11-24 15:09:42 -08:00
R Tyler Croy ff295e05af Restructure project to be a workspace
This makes way for building another lambda or two in here.
2023-11-22 23:12:00 -08:00
R Tyler Croy b3119e7b7e Fix missing string formatting for the lock key
The lack of proper formatting here leads to unnecessary lock contention in high
concurrency setups
2023-11-22 21:15:50 -08:00
R Tyler Croy 9f0867488e Upgrade to the latest deltalake 0.16 release 2023-11-15 08:20:14 -08:00