Commit Graph

130 Commits

Author SHA1 Message Date
R Tyler Croy 24834b9d13 Checkpoint every 10 commits.
Technically I suppose you could use the checkpoint lambda, but it's literally
one function call here :P
2023-05-07 19:16:43 -07:00
R Tyler Croy aa8f2df8f2 Url decoded keys are needed much earlier in the processing of events
This commit incorporates a hack for aws_lambda_events not filling out
url_decoded_key to make everything easier downstream of the event loop
2023-05-07 19:09:12 -07:00
R Tyler Croy 3d0b48eb3a Handle url-encoded hive partition key names
S3Object.url_decoded_key has let me down
2023-05-07 18:40:28 -07:00
R Tyler Croy 0fcb951063 The `lambda` feature is required to build the lambda 2023-05-07 18:25:01 -07:00
R Tyler Croy a6d8121d42 Refactor the append_to_table function into the crate
This also should (untested) properly create new transactions with partition
information
2023-05-07 18:24:07 -07:00
R Tyler Croy b565624df1 Update the readme to include details about using the oxbow lambda function 2023-05-07 18:14:04 -07:00
R Tyler Croy 92bd91beed Properly prepare the ObjectMeta records to be added to the table
Basically the location of these objects needs to be relative to the _delta_log/
table, therefore the prune_prefix argument makes it easier to convert locations

It might make sense for this to be non-optional in the future but I'm erring on
the side of less refactoring
2023-05-07 18:05:23 -07:00
R Tyler Croy b0bba64ada Allow the lambda function to log to CloudWatch logs
This is pretty key for debugging 😸
2023-05-07 18:00:09 -07:00
R Tyler Croy dbf5882a9b Introduce the main lambda functionality of creating or appending to a table 2023-05-07 16:21:46 -07:00
R Tyler Croy bddb3bdb8e Properly build with both features (lambda/cli)
I'm still not entirely pleased with this approach. I'm going to plkay with it
some more. I'm keen to try this rather than having two different binaries built
2023-05-07 14:51:12 -07:00
R Tyler Croy 856bfac746 infer_log_path_from to help the lambda place the _delta_log correctly 2023-05-07 14:45:52 -07:00
R Tyler Croy 80e1edbb8d Add stub directory tree for examples using oxbow 2023-05-07 14:20:52 -07:00
R Tyler Croy 455329c8c2 Scaffolding of the minimum terraform and lambda code to receive the bucket notifications
This is not yet functioning in the true sense of `oxbow` yet, but at least is
ready for test cycles with real use-cases in AWS
2023-05-07 14:18:54 -07:00
R Tyler Croy e88f21ab5c Correct parsing error in readme 2023-05-07 14:18:14 -07:00
R Tyler Croy 0192d04f69 Add an integration test for validating all the golden tables
This currently fails because a parquet file's schema is not delta compatible
somehow:

thread 'test_all_tables' panicked at 'Failed to convert the schema for creating the table: SchemaError("Invalid data type for Delta Lake: Timestamp(Nanosecond, None)")', /usr/home/tyler/source/github/buoyant-data/oxbow/src/lib.rs:118:10

I have a hunch that this might be similar to delta-io/delta-rs#1286
2023-05-06 14:52:50 -07:00
R Tyler Croy 5df34ed5f3 Clean up some suggestions from clippy 2023-05-06 14:31:07 -07:00
R Tyler Croy 61e3e98a4b Support creating delta tables from storage with hive style partitioning schemes 2023-05-06 14:29:16 -07:00
R Tyler Croy 0055b693bc Sync the hive/ test data with the connectors repository
I forgot that I had removed the _delta_log/ originally when testing. I'll need
these to compare the results in the integration tests
2023-05-06 12:04:48 -07:00
R Tyler Croy 7b31ec42e3
Create rust.yml 2023-05-06 11:58:44 -07:00
R Tyler Croy 7fe1204145 Fix code fencing 2023-05-06 09:26:20 -07:00
R Tyler Croy 1355f8a34a Add some CLI usage information to the README
Before I forget!
2023-05-06 09:25:02 -07:00
R Tyler Croy b45f11f163 Add an integration test to perform the most simple validation of conversion
This replicates what I was doing in the command line and ensures that there
won't be regressions as I refactor now
2023-05-06 09:17:21 -07:00
R Tyler Croy b9bc10ec56 Add a slice of the golden data set from delta-io/connectors 2023-05-06 09:06:47 -07:00
R Tyler Croy b5dd6e77d0 Implement the most simple use-case for a command line invocation
This will convert a single non-partitioned directory into a delta table
2023-05-06 08:32:11 -07:00
R Tyler Croy bc2d2ccc4c Move the lib functions into the lib module 2023-05-02 21:05:30 -07:00
R Tyler Croy ba82fa93d7 Working on discover_parquet_files() for identifying parquet files to import 2023-05-02 21:02:25 -07:00
R Tyler Croy d3f1c85fa7 Starting implementation of the CLI version with local files for testing 2023-05-02 19:09:24 -07:00
R Tyler Croy 1fa65baecb Start structuring the project to support a CLI and in the future a lambda mode 2023-05-01 21:30:21 -07:00
R Tyler Croy 97efd6a37c Add deltalake with the license before starting implementation 2023-05-01 21:01:28 -07:00
R Tyler Croy a64bc4c69d initial commit 2023-05-01 20:56:59 -07:00