R Tyler Croy
114c1b6b51
Prevent duplicate column definitions showing up in the delta schema
...
In some scenarios Big Query can inline a partition column in output
parquet files and some deduplication needs to happen on columns before
the initial commit on the table gets created
Sponsored-by: Scribd, Inc.
2023-12-12 13:07:28 -08:00
R Tyler Croy
0197de233d
Ignore s3:TestEvent in the SQS event processing pipeline
...
Fixes #8
2023-12-05 16:52:32 -08:00
R Tyler Croy
c2d6f27b0c
On table creation modify the timestamp data type for simplicity's sake
...
The `deltalake` crate should likely be improved to avoid having issues
with Timestamps with millisecond precison since the protocol supports
them
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc )
but this unblocks behavior now. 🤔
2023-12-05 16:52:32 -08:00
R Tyler Croy
a4e44c26c5
Behavior a bit more defensively to avoid processing checkpoint parquet files
...
I am fairly confident that checkpoints are being ignored by this processing, but
I am not 100% certain so this defensive programming will help a bit
2023-11-13 12:10:09 -08:00
R Tyler Croy
e5af87ae6f
Remove the redundant set of dynamodb locks and rely on deltalake's built-in
...
I believe the redundant set of locks is not serving enough utility and can be
set aside. The `convert` path I do not have enough confidence yet that on a new
table the right behavior will be manifest but that requires more testing.
2023-11-13 07:44:23 -08:00
R Tyler Croy
0e090415c8
Fix a number of minor clippy warnings
2023-11-12 13:48:48 -08:00
R Tyler Croy
2d8659b1cf
Upgrade to the 0.15 version of the deltalake crate
...
This brings in newer arrow and datafusion dependencies with some other fixes
2023-09-07 23:48:09 -07:00
R Tyler Croy
aa8f2df8f2
Url decoded keys are needed much earlier in the processing of events
...
This commit incorporates a hack for aws_lambda_events not filling out
url_decoded_key to make everything easier downstream of the event loop
2023-05-07 19:09:12 -07:00
R Tyler Croy
dbf5882a9b
Introduce the main lambda functionality of creating or appending to a table
2023-05-07 16:21:46 -07:00
R Tyler Croy
455329c8c2
Scaffolding of the minimum terraform and lambda code to receive the bucket notifications
...
This is not yet functioning in the true sense of `oxbow` yet, but at least is
ready for test cycles with real use-cases in AWS
2023-05-07 14:18:54 -07:00
R Tyler Croy
0192d04f69
Add an integration test for validating all the golden tables
...
This currently fails because a parquet file's schema is not delta compatible
somehow:
thread 'test_all_tables' panicked at 'Failed to convert the schema for creating the table: SchemaError("Invalid data type for Delta Lake: Timestamp(Nanosecond, None)")', /usr/home/tyler/source/github/buoyant-data/oxbow/src/lib.rs:118:10
I have a hunch that this might be similar to delta-io/delta-rs#1286
2023-05-06 14:52:50 -07:00
R Tyler Croy
5df34ed5f3
Clean up some suggestions from clippy
2023-05-06 14:31:07 -07:00
R Tyler Croy
61e3e98a4b
Support creating delta tables from storage with hive style partitioning schemes
2023-05-06 14:29:16 -07:00
R Tyler Croy
0055b693bc
Sync the hive/ test data with the connectors repository
...
I forgot that I had removed the _delta_log/ originally when testing. I'll need
these to compare the results in the integration tests
2023-05-06 12:04:48 -07:00
R Tyler Croy
b45f11f163
Add an integration test to perform the most simple validation of conversion
...
This replicates what I was doing in the command line and ensures that there
won't be regressions as I refactor now
2023-05-06 09:17:21 -07:00
R Tyler Croy
b9bc10ec56
Add a slice of the golden data set from delta-io/connectors
2023-05-06 09:06:47 -07:00