Commit Graph

16 Commits

Author SHA1 Message Date
R Tyler Croy 114c1b6b51 Prevent duplicate column definitions showing up in the delta schema
In some scenarios Big Query can inline a partition column in output
parquet files and some deduplication needs to happen on columns before
the initial commit on the table gets created

Sponsored-by: Scribd, Inc.
2023-12-12 13:07:28 -08:00
R Tyler Croy 0197de233d Ignore s3:TestEvent in the SQS event processing pipeline
Fixes #8
2023-12-05 16:52:32 -08:00
R Tyler Croy c2d6f27b0c On table creation modify the timestamp data type for simplicity's sake
The `deltalake` crate should likely be improved to avoid having issues
with Timestamps with millisecond precison since the protocol supports
them
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc)
but this unblocks behavior now. 🤔
2023-12-05 16:52:32 -08:00
R Tyler Croy a4e44c26c5 Behavior a bit more defensively to avoid processing checkpoint parquet files
I am fairly confident that checkpoints are being ignored by this processing, but
I am not 100% certain so this defensive programming will help a bit
2023-11-13 12:10:09 -08:00
R Tyler Croy e5af87ae6f Remove the redundant set of dynamodb locks and rely on deltalake's built-in
I believe the redundant set of locks is not serving enough utility and can be
set aside. The `convert` path I do not have enough confidence yet that on a new
table the right behavior will be manifest but that requires more testing.
2023-11-13 07:44:23 -08:00
R Tyler Croy 0e090415c8 Fix a number of minor clippy warnings 2023-11-12 13:48:48 -08:00
R Tyler Croy 2d8659b1cf Upgrade to the 0.15 version of the deltalake crate
This brings in newer arrow and datafusion dependencies with some other fixes
2023-09-07 23:48:09 -07:00
R Tyler Croy aa8f2df8f2 Url decoded keys are needed much earlier in the processing of events
This commit incorporates a hack for aws_lambda_events not filling out
url_decoded_key to make everything easier downstream of the event loop
2023-05-07 19:09:12 -07:00
R Tyler Croy dbf5882a9b Introduce the main lambda functionality of creating or appending to a table 2023-05-07 16:21:46 -07:00
R Tyler Croy 455329c8c2 Scaffolding of the minimum terraform and lambda code to receive the bucket notifications
This is not yet functioning in the true sense of `oxbow` yet, but at least is
ready for test cycles with real use-cases in AWS
2023-05-07 14:18:54 -07:00
R Tyler Croy 0192d04f69 Add an integration test for validating all the golden tables
This currently fails because a parquet file's schema is not delta compatible
somehow:

thread 'test_all_tables' panicked at 'Failed to convert the schema for creating the table: SchemaError("Invalid data type for Delta Lake: Timestamp(Nanosecond, None)")', /usr/home/tyler/source/github/buoyant-data/oxbow/src/lib.rs:118:10

I have a hunch that this might be similar to delta-io/delta-rs#1286
2023-05-06 14:52:50 -07:00
R Tyler Croy 5df34ed5f3 Clean up some suggestions from clippy 2023-05-06 14:31:07 -07:00
R Tyler Croy 61e3e98a4b Support creating delta tables from storage with hive style partitioning schemes 2023-05-06 14:29:16 -07:00
R Tyler Croy 0055b693bc Sync the hive/ test data with the connectors repository
I forgot that I had removed the _delta_log/ originally when testing. I'll need
these to compare the results in the integration tests
2023-05-06 12:04:48 -07:00
R Tyler Croy b45f11f163 Add an integration test to perform the most simple validation of conversion
This replicates what I was doing in the command line and ensures that there
won't be regressions as I refactor now
2023-05-06 09:17:21 -07:00
R Tyler Croy b9bc10ec56 Add a slice of the golden data set from delta-io/connectors 2023-05-06 09:06:47 -07:00