R Tyler Croy
bb5caee2f7
Introduce glue-create to handle initial table creation with AWS Athena
2024-04-21 23:07:14 +00:00
R Tyler Croy
d8cdbdaf95
Update the GitHub Action to include glue-sync
2024-04-21 22:36:32 +00:00
R Tyler Croy
30f65c4713
Introduce a glue table updating lambda which is trigged by S3 Event Notifications
...
This introduces some logic to attempt to add additional columns which
appear on the Delta table into the Glue Data Catalog schema.
Unfortunately AWS Glue Data Catalog is eventually consistent so
operations are not immediate.
Gee whiz is this kind of a pain to test.
2024-04-21 22:33:13 +00:00
R Tyler Croy
9d88d396f1
Add the scaffolding for glue-sync
2024-04-21 17:29:52 +00:00
R Tyler Croy
13167d7b18
Refactor the build scripts to all point at the same origin
2024-04-21 17:25:01 +00:00
R Tyler Croy
f922e93d30
Add a simple Jenkinsfile
2024-04-21 16:59:56 +00:00
R Tyler Croy
a5c501665e
Handle odd timestamp types when doing schema evolution
...
This refactors the code for supporting goofy Airbyte-generated types
on both table create and schema evolution
2024-04-20 21:27:50 +00:00
R Tyler Croy
924cb6855b
Implement rudimentary schema evolution based on parquet file schema discovery
2024-04-20 09:16:18 -07:00
R Tyler Croy
46a1c10835
Refactor commits to accept Actions directly and prepare for schema evolution
...
Using actions directly for the commit also ensures that adds and removes
happen in the same commit rather than the two separate commits as was
done prior.
2024-04-20 08:17:48 -07:00
R Tyler Croy
795f095094
Enable the group-events lambda to use UNWRAP_SNS_ENVELOPE like the others
2024-04-10 15:27:01 -07:00
R Tyler Croy
24edaba873
Add the necessary github workflow code for release of sqs-ingest
2024-04-08 12:30:43 -07:00
R Tyler Croy
6d1bc34b86
Introduce the bulk of sqs-ingest with some refactorings for the webhook
...
The webhook and sqs-ingest lambdas both effectively need to take strings
of data and append them to a configured Delta Lake table, so the shared
code comes "up" into the oxbow crate
2024-04-08 08:40:05 -07:00
R Tyler Croy
84eba62314
Create the scaffolding for sqs-ingest
...
Similar to kafka-delta-ingest, but just for SQS!
2024-04-05 15:43:01 -07:00
R Tyler Croy
a39e49697a
Update the webhook documentation with important settings for use
2024-04-05 15:42:43 -07:00
R Tyler Croy
85f6b1fc22
Properly create checkpoints on every 10th commit.
...
Missed some logic previously in my haste!
2024-04-02 15:57:35 -07:00
R Tyler Croy
cc9cdc299f
Properly checkpoint writes via the webhook lambda
...
Missed a spot in my haste last week, oops!
2024-04-01 11:12:32 -07:00
R Tyler Croy
137209d0de
Enhance the webhook lambda to augment with a `ds` column for partitioning
2024-03-21 14:47:52 -07:00
R Tyler Croy
c751641ef4
Update the version and add some useful debug flags
2024-03-20 14:49:01 -07:00
R Tyler Croy
85661aeb7e
Add release build for github actions on the webhook lambda
2024-03-12 11:25:39 -07:00
R Tyler Croy
f163abb52e
Merge pull request #19 from buoyant-data/webhook-support
...
Add a webhook lambda for appending JSONL
2024-03-12 11:11:41 -07:00
R Tyler Croy
a4da7ca032
Add a webhook lambda for appending JSONL
2024-03-12 11:10:42 -07:00
R Tyler Croy
145d1109e1
Properly url decode keys for the auto-tag lambda
2024-02-06 16:28:29 -08:00
R Tyler Croy
805aef3854
Merge pull request #17 from buoyant-data/sns-envelope
...
Introduce UNWRAP_SNS_ENVELOPE which allows SNS to be introduced upstream
2024-01-26 19:31:04 -08:00
R Tyler Croy
13f88075e7
Introduce UNWRAP_SNS_ENVELOPE which allows SNS to be introduced upstream
...
In essence the Oxbow and Auto-tag lambda should still be triggered by
SQS, but in order to allow them to rely on the same exact bucket
notifications an SNS topic must be configured upstream.
S3 Event Notifications -> SNS -> Oxbow SQS -> Oxbow
`---> Auto tag SQS -> Auto tag
2024-01-26 18:05:20 -08:00
R Tyler Croy
15c03540b2
Merge pull request #15 from buoyant-data/auto-tagging
...
Introduce the simple auto-tag Lambda for adding some tags for lifecyc…
2023-12-21 13:13:41 -08:00
R Tyler Croy
ac59e4edc3
Update release workflow
2023-12-21 13:13:12 -08:00
R Tyler Croy
68fc9f7c98
Introduce the simple auto-tag Lambda for adding some tags for lifecycle policies
...
This will make it easier to set up lifecycle policies on parquet files
but not on the delta table itself.
2023-12-21 13:06:49 -08:00
R Tyler Croy
1875ad1ed6
Merge pull request #13 from buoyant-data/handle-delete-events-10
...
Handle ObjectRemoved:Delete events and translate those into Delta table removals
2023-12-18 12:21:34 -08:00
R Tyler Croy
0613a6b9d8
Make the release supporting deletions count as 0.9.0
2023-12-18 12:15:22 -08:00
R Tyler Croy
b3f45b2b2d
Handle ObjectRemoved:Delete events and translate those into Delta table removals
...
This change will handle deleted files correctly, but will also ensure
that removed files don't incorrectly show up as additions.
With this change S3 LifeCycle configurations should _just work_ with
Delta tables
Fixes #10
2023-12-18 11:49:04 -08:00
R Tyler Croy
6c34f4ed81
Bump to 0.8.4
2023-12-12 15:52:30 -08:00
R Tyler Croy
4eb1068aca
Bump version for next release
2023-12-12 13:21:05 -08:00
R Tyler Croy
9205ca8d77
Merge pull request #12 from buoyant-data/duplicate-ds-columns
...
Prevent duplicate columns in the schema when partitions are present in parquet files
2023-12-12 13:15:48 -08:00
R Tyler Croy
114c1b6b51
Prevent duplicate column definitions showing up in the delta schema
...
In some scenarios Big Query can inline a partition column in output
parquet files and some deduplication needs to happen on columns before
the initial commit on the table gets created
Sponsored-by: Scribd, Inc.
2023-12-12 13:07:28 -08:00
R Tyler Croy
bfac6c2dd0
Switch the DynamoDB provisioning to on-demand for examples
...
No sense in these being provisioned
2023-12-12 10:31:24 -08:00
R Tyler Croy
6448997080
Clean up the version inclusion in the release artifacts
2023-12-05 17:05:29 -08:00
R Tyler Croy
292cda1e2a
Update the release workflow and clean up the REEADME for a 0.8.1 release
2023-12-05 16:58:55 -08:00
R Tyler Croy
0197de233d
Ignore s3:TestEvent in the SQS event processing pipeline
...
Fixes #8
2023-12-05 16:52:32 -08:00
R Tyler Croy
c2d6f27b0c
On table creation modify the timestamp data type for simplicity's sake
...
The `deltalake` crate should likely be improved to avoid having issues
with Timestamps with millisecond precison since the protocol supports
them
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc )
but this unblocks behavior now. 🤔
2023-12-05 16:52:32 -08:00
R Tyler Croy
03caba50f8
Shorten the deployment target
2023-12-04 08:01:49 -08:00
R Tyler Croy
66fa770e93
Ensure that the group-id is always valid for SQS
...
S3 URLs that exceed the expected length of group id (128) causes
problems
2023-12-04 08:00:50 -08:00
R Tyler Croy
45584233ba
Introduce the meat of the group-events function
...
This helps in situations where singular tables are receiving a large influx of
events
2023-12-01 21:12:44 -08:00
R Tyler Croy
3e27d1c014
Implement the bulk of the group-events Lamdba which will help sequence writes
...
This approach should help address some problems identified in [this blog
post](https://www.buoyantdata.com/blog/2023-11-27-concurrency-limitations-with-deltalake-on-aws.html ).
In real-world scenarios lock acquisition timeouts will happen if a large sync
results in a substantial number of parquet files being dropped into the same S3
table prefix.
The simple oxbow deployment is:
S3 Events -> SQS -> oxbow
This approach sequences events into a FIFO queue which should help avoid lock
contention:
S3 Events -> SQS -> group-events -> SQS FIFO -> oxbow
The use of the table prefix as the message group ID ensures that the oxbow
lambda will not be invoked concurrently for the table prefix
2023-11-30 17:37:39 -08:00
R Tyler Croy
d379600461
Introduce a simple GNU/Makefile to make common development tasks easier
2023-11-27 17:55:13 -08:00
R Tyler Croy
e9e7f82ca3
Fix race on lock acquisition
...
Was not paying attention to the dynamodb-lock-rs documentation when I originally
created try_acquire_lock() which does _not_ have the retry behavior. This means
that if a lock is taken by another invocation it will fail the function and
result in DLQ'ing messages unnecessarily
2023-11-27 17:36:56 -08:00
R Tyler Croy
4d12f066db
Pushing common code that will be needed in the grouping lambda into the shared crate
2023-11-27 17:36:56 -08:00
R Tyler Croy
f5b7c98cd0
Restructure the workspace a bit more to pave the way for shared tooling
2023-11-24 15:09:42 -08:00
R Tyler Croy
ff295e05af
Restructure project to be a workspace
...
This makes way for building another lambda or two in here.
2023-11-22 23:12:00 -08:00
R Tyler Croy
b3119e7b7e
Fix missing string formatting for the lock key
...
The lack of proper formatting here leads to unnecessary lock contention in high
concurrency setups
2023-11-22 21:15:50 -08:00
R Tyler Croy
9f0867488e
Upgrade to the latest deltalake 0.16 release
2023-11-15 08:20:14 -08:00