Commit Graph

16 Commits

Author SHA1 Message Date
R Tyler Croy 13167d7b18 Refactor the build scripts to all point at the same origin 2024-04-21 17:25:01 +00:00
R Tyler Croy a5c501665e Handle odd timestamp types when doing schema evolution
This refactors the code for supporting goofy Airbyte-generated types
on both table create and schema evolution
2024-04-20 21:27:50 +00:00
R Tyler Croy 924cb6855b Implement rudimentary schema evolution based on parquet file schema discovery 2024-04-20 09:16:18 -07:00
R Tyler Croy 46a1c10835 Refactor commits to accept Actions directly and prepare for schema evolution
Using actions directly for the commit also ensures that adds and removes
happen in the same commit rather than the two separate commits as was
done prior.
2024-04-20 08:17:48 -07:00
R Tyler Croy 6d1bc34b86 Introduce the bulk of sqs-ingest with some refactorings for the webhook
The webhook and sqs-ingest lambdas both effectively need to take strings
of data and append them to a configured Delta Lake table, so the shared
code comes "up" into the oxbow crate
2024-04-08 08:40:05 -07:00
R Tyler Croy 137209d0de Enhance the webhook lambda to augment with a `ds` column for partitioning 2024-03-21 14:47:52 -07:00
R Tyler Croy a4da7ca032 Add a webhook lambda for appending JSONL 2024-03-12 11:10:42 -07:00
R Tyler Croy 13f88075e7 Introduce UNWRAP_SNS_ENVELOPE which allows SNS to be introduced upstream
In essence the Oxbow and Auto-tag lambda should still be triggered by
SQS, but in order to allow them to rely on the same exact bucket
notifications an SNS topic must be configured upstream.

        S3 Event Notifications -> SNS -> Oxbow SQS -> Oxbow
                                   `---> Auto tag SQS -> Auto tag
2024-01-26 18:05:20 -08:00
R Tyler Croy 68fc9f7c98 Introduce the simple auto-tag Lambda for adding some tags for lifecycle policies
This will make it easier to set up lifecycle policies on parquet files
but not on the delta table itself.
2023-12-21 13:06:49 -08:00
R Tyler Croy b3f45b2b2d Handle ObjectRemoved:Delete events and translate those into Delta table removals
This change will handle deleted files correctly, but will also ensure
that removed files don't incorrectly show up as additions.

With this change S3 LifeCycle configurations should _just work_ with
Delta tables

Fixes #10
2023-12-18 11:49:04 -08:00
R Tyler Croy 114c1b6b51 Prevent duplicate column definitions showing up in the delta schema
In some scenarios Big Query can inline a partition column in output
parquet files and some deduplication needs to happen on columns before
the initial commit on the table gets created

Sponsored-by: Scribd, Inc.
2023-12-12 13:07:28 -08:00
R Tyler Croy 0197de233d Ignore s3:TestEvent in the SQS event processing pipeline
Fixes #8
2023-12-05 16:52:32 -08:00
R Tyler Croy c2d6f27b0c On table creation modify the timestamp data type for simplicity's sake
The `deltalake` crate should likely be improved to avoid having issues
with Timestamps with millisecond precison since the protocol supports
them
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc)
but this unblocks behavior now. 🤔
2023-12-05 16:52:32 -08:00
R Tyler Croy 3e27d1c014 Implement the bulk of the group-events Lamdba which will help sequence writes
This approach should help address some problems identified in [this blog
post](https://www.buoyantdata.com/blog/2023-11-27-concurrency-limitations-with-deltalake-on-aws.html).
In real-world scenarios lock acquisition timeouts will happen if a large sync
results in a substantial number of parquet files being dropped into the same S3
table prefix.

The simple oxbow deployment is:

    S3 Events -> SQS -> oxbow

This approach sequences events into a FIFO queue which should help avoid lock
contention:

    S3 Events -> SQS -> group-events -> SQS FIFO -> oxbow

The use of the table prefix as the message group ID ensures that the oxbow
lambda will not be invoked concurrently for the table prefix
2023-11-30 17:37:39 -08:00
R Tyler Croy 4d12f066db Pushing common code that will be needed in the grouping lambda into the shared crate 2023-11-27 17:36:56 -08:00
R Tyler Croy f5b7c98cd0 Restructure the workspace a bit more to pave the way for shared tooling 2023-11-24 15:09:42 -08:00