Commit Graph

1423 Commits

Author SHA1 Message Date
dependabot[bot] 81593e9194 chore(deps): update sqlparser requirement from 0.44 to 0.46
Updates the requirements on [sqlparser](https://github.com/sqlparser-rs/sqlparser-rs) to permit the latest version.
- [Changelog](https://github.com/sqlparser-rs/sqlparser-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/sqlparser-rs/sqlparser-rs/compare/v0.44.0...v0.46.0)

---
updated-dependencies:
- dependency-name: sqlparser
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-05-07 08:53:54 -07:00
emcake 35664c0ef0 fix: Return unsupported error for merging schemas in the presence of partiton columns 2024-05-07 07:12:38 -07:00
KyJah Keys cfb20f1795 applied cargo fmt 2024-05-07 06:47:12 -07:00
KyJah Keys 7192997604 fix(python, rust): region lookup wasn't working correctly for dynamo 2024-05-07 06:47:12 -07:00
Yijie Shen e370d34571
fix(rust): unable to read delta table when table contains both null and non-null add stats (#2476)
# Description
To fix the issue when a delta table contains add action with
stats_parsed: null.

As shown in the test case, `001.json` contains an Add action with stats,
while `002.json` contains an Add action with `stats_parsed: null`,
before this fix, it will complain:

```
Arrow { source: InvalidArgumentError("all columns in a record batch must have the same length") }
```

The issue is that the array for `num_records` has two values, while for
other stats such as null_count, the None value is filtered out by
`flat_map`, so there is only one value in the array.


# Related Issue(s)
closes #2477 

# Documentation

<!---
Share links to useful documentation
--->
2024-05-06 21:47:48 +00:00
R Tyler Croy d7165cfef8
fix: check to see if the file exists before attempting to rename (#2482)
In the case of /tmp existing on tmpfs with musl, the prior version of
this would fail with a cross-device link error before bubbling up a not
found error
2024-05-06 21:28:48 +00:00
Ion Koutsouris e25aed70a0
fix(python, rust): use new schema for stats parsing instead of old (#2480)
# Description
In some edge cases where we schema evolve, it would parse the stats with
the old schema result in these kind of errors:
`Exception: Json error: whilst decoding field 'minValues': whilst
decoding field 'foo': failed to parse 1000000000000 as Int8`

```python
import polars as pl
from deltalake import write_deltalake

pl.DataFrame({
    "foo": [1]
}, schema={"foo": pl.Int8}).write_delta("TEST_TABLE_BUG")


write_deltalake("TEST_TABLE_BUG", data = pl.DataFrame({
    "foo": [1000000000000]
}, schema={"foo": pl.Int64}).to_arrow(), mode='overwrite', overwrite_schema=True,engine='rust')
```

Instead of taking the old schema, I added an optional schema to be
passed in the logMapper
2024-05-06 16:39:13 +00:00
Adrian Garcia Badaracco d0617b5ca1
feat(python): add parameter to DeltaTable.to_pyarrow_dataset() (#2465)
Otherwise there is no way to union this with another dataset.
2024-05-05 22:14:37 +00:00
R Tyler Croy e7af965abc chore: update the deltalake-aws version and clippy for release of #2452 2024-05-04 09:03:40 -07:00
Peter Ke ad89cc3caf format 2024-05-04 09:03:40 -07:00
Peter Ke 6ef3caa79a abort commit 2024-05-04 09:03:40 -07:00
R Tyler Croy 85089b1c74 chore: update the changelog to include rust-v0.17.3 2024-05-01 23:09:01 -07:00
R Tyler Croy f6d110815c chore: update the python version and dependencies for release 2024-05-01 14:32:06 -07:00
R Tyler Croy 716acc31b7 chore: bump the metacrate and correct some of the version ranges for patch releases 2024-05-01 14:32:06 -07:00
R Tyler Croy 92128eb7c6 chore: bump deltalake-azure for release 2024-05-01 14:32:06 -07:00
R Tyler Croy 55a0c6ea0d chore: update the deltalake-azure number for release 2024-05-01 14:32:06 -07:00
R Tyler Croy b54cf99605 chore: increment the patch version for deltalake-gcp 2024-05-01 14:32:06 -07:00
R Tyler Croy 27c1e48cd9 chore: bump the deltalake-aws version for release 2024-05-01 14:32:06 -07:00
Michele Vigilante 0c8e5d56d3
feat(python, rust): add OBJECT_STORE_CONCURRENCY_LIMIT setting for ObjectStoreFactory (#2458)
# Description
This PR adds a configuration to control concurrent access to the
underlying object store. It also includes a visibility change to the
S3LogStoreFactory to align it with all other provider implementations.

# Related Issue(s)
- closes #2457 
- resolves #2353

# Documentation

https://docs.rs/object_store/latest/object_store/limit/struct.LimitStore.html

---------

Co-authored-by: Michele Vigilante <michele.vigilante@radancy.com>
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
2024-05-01 16:04:41 +00:00
R Tyler Croy 68965afa8d chore: bump the core and python crate for its next release 2024-05-01 08:37:43 -07:00
Stephen Carman 4dce000f02
feat: cdf reader for delta tables (#2048)
# Description
This PR is the initial work for Change Data Feed (CDF) readers for delta
tables. This PR looks a lot larger than it really is because a physical
test table is checked in with this which will be removed once the loop
is closed on CDF reading/writing.

# Related Issue(s)

# Documentation

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#change-data-files
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-cdc-file

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
2024-04-30 18:15:39 -07:00
R Tyler Croy 41cb9d7b73 chore: increment the version of the deltalake-gcp crate
This change also loosens the meta-crate version dependency to allow more
easy upgrades in the future
2024-04-30 11:19:40 -07:00
Adrian Garcia Badaracco 2aa4571a21 Update mod.rs 2024-04-30 11:19:40 -07:00
Adrian Garcia Badaracco fb0a2decfa Update crates/gcp/src/storage.rs 2024-04-30 11:19:40 -07:00
Adrian Garcia Badaracco 771991393d add debug 2024-04-30 11:19:40 -07:00
Adrian Garcia Badaracco 5a301288e2 Add file 2024-04-30 11:19:40 -07:00
Adrian Garcia Badaracco 99a4121681 Handle 429 from GCS 2024-04-30 11:19:40 -07:00
Ion Koutsouris 28ad3950d9
feat(rust): advance state in post commit (#2396)
# Description
We advance the state in the post commit now, so it's done in a single
location as per suggestion from @Blajda here:
https://github.com/delta-io/delta-rs/pull/2391#issuecomment-2041500757

This PR also supersedes this one:
https://github.com/delta-io/delta-rs/pull/2280

# Related Issue(s)
- fixes #2279
- fixes #2262
2024-04-27 13:08:43 -04:00
Luis 9d3ecbeb62
chore(rust): bump arrow v51 and datafusion v37.1 (#2395)
# Description
Update the arrow and datafusion dependencies.

# Related Issue(s)
- closes #2328

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
2024-04-26 18:28:50 +00:00
Ion Koutsouris 6a7c684d9b
fix(python): reuse table state in write engine (#2453)
# Description
Instead of reusing the table state, it was being instantiated every time
you call write with the rust engine.

- https://github.com/delta-io/delta-rs/discussions/2448
2024-04-25 06:37:31 -07:00
R Tyler Croy f55ddc64a3 Introduce the `Operation` trait for all operations to implement
Currently this is nothing but a shim that ensures everything implements
IntoFuture, which they all already do, but in the future this will help
enforce consistency as well as provide common behaviors.
2024-04-25 01:25:08 -07:00
KyJah Keys dd358ef8e8
fix(python, rust): remove imds calls from profile auth and region (#2442)
# Description
The AWS SDK uses EC2 instance metadata in the default provider chain,
the profile chain and the region provider

# Related Issue(s)
<!---
For example:

- closes #106
--->
- closes #2377 
# Documentation

<!---
Share links to useful documentation
--->
2024-04-23 16:23:52 +00:00
Ion Koutsouris 12979dd881
fix(python, rust): check timestamp_ntz in nested fields, add check_can_writestamp_ntz in pyarrow writer (#2443)
# Description
The nested fields weren't checked, which meant you could get a
timestampNtz in your schema but not have the reader/writer features set.
This check is now done recursively.
2024-04-23 15:43:14 +00:00
Ion Koutsouris da6ed7b39d
fix(python, rust): use from_name during column projection creation (#2441)
# Description
@Blajda I don't think `from_qualified_name_ignore_case` was needed here
since the delta_fields don't have relation information, they are just
the column names.

`from_qualified_name_ignore_case` will try to parse `__delta_rs_c_y--1`
and results into `__delta_rs_c_y`, while `from_name `just keeps the
column name as-is, which is preferred.


# Related Issue(s)
- closes https://github.com/delta-io/delta-rs/issues/2438
2024-04-22 22:24:58 -04:00
Ion Koutsouris 15abe448dc
chore: bump python for 0.17 release (#2439)
# Description
The description of the main changes of your pull request

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
2024-04-22 17:24:31 +00:00
Ion Koutsouris f12834e22f
fix(python,rust): missing remove actions during `create_or_replace` (#2437)
# Description
The overwrite mode never added the remove actions, which causes your
table to get in an invalid state.
2024-04-22 17:02:00 +00:00
Igor 5f137ca8af
fix(python): load_as_version with datetime object with no timezone specified (#2429)
# Description
Upon attempting to retrieve the version with a datetime object, the
`load_as_version` method throws a `ValueError: Failed to parse datetime
string: premature end of input`.

Datetime objects without a specified timezone will be treated as UTC
datetimes.
2024-04-21 21:38:09 +00:00
Jonas Irgens Kylling ebbdd69274
feat: implement repartitioned for DeltaScan (#2421)
# Description
This implements repartitioned from the ExecutionPlan trait of DeltaScan.
Currently, Delta tables without partitions are read with all its files
in a single file group of the underlying `ParquetExec`. This seems to
mean that Delta tables without partitions are read without concurrency.
With repartitioned we can repartition the DeltaScan to get concurrency
when reading.
2024-04-16 18:05:04 +00:00
Ion Koutsouris 9736522b87
feat: lazy static runtime in python (#2424)
# Description
As suggested by @wjones127 to create a lazy static runtime, supersedes
this PR: https://github.com/delta-io/delta-rs/pull/1950
2024-04-16 16:35:33 +00:00
Yijie Shen aa8f4d5390
fix(rust): stats_parsed has different number of records with stats (#2405)
# Description
- `stats_parsed` is a StructArray instead of StringArray
- Parse `Add` action's `stats` to `stats_parsed` would panic due to the
use of `slice.array_data()`.

# Related Issue(s)
<!---
For example:

- 
--->

closes #2312 

# Documentation

<!---
Share links to useful documentation
--->

https://docs.rs/arrow/51.0.0/arrow/array/struct.GenericByteArray.html#method.value_data

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
2024-04-15 08:55:17 +02:00
Ion Koutsouris faa743a6f1
fix(rust): timestamp deserialization format not following protocol + missing timestampNtz deserialization (#2383)
# Description
Our timestamp deserialization format didn't include the %6f to decode
this value: 1970-01-01 00:00:00.123456. Additionally during timestampNtz
I didn't add deserialization of that primitive type :)


- fixes https://github.com/delta-io/delta-rs/issues/2380
- fixes https://github.com/delta-io/delta-rs/issues/2381
2024-04-14 21:29:04 -07:00
Avril Aysha d49d95ba4b
docs: add Daft integration (#2402)
This adds an integration page for using Delta Lake with Daft.

---------

Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
2024-04-13 08:09:39 +02:00
Ion Koutsouris 133941afdb
fix: time travel when checkpointed and logs removed (#2389)
# Description
It first sets a proper lower boundary instead of always assuming 0,
since we can also have checkpointed tables which had logRetention that
caused logs to be removed before a checkpoint.


- closes https://github.com/delta-io/delta-rs/issues/521
2024-04-12 05:57:46 +00:00
KyJah Keys 64b3e54126 added missing file 2024-04-11 22:36:30 -07:00
KyJah Keys 0d1790306b feat: added configuration variables to handle EC2 metadata service 2024-04-11 22:36:30 -07:00
Erdem Sarili 3094bd28ce
fix: return error when checkpoints and metadata get out of sync (#2406)
# Description
When a table is corrupted and `_last_checkpoint` file points to a 
version that has been deleted, `list_log_files_with_checkpoint` 
function panics. With this change `list_log_files_with_checkpoint` 
function returns an error allowing callers react to such issues.

# Related Issue(s)
- https://github.com/delta-io/delta-rs/issues/2290
2024-04-11 12:52:37 +00:00
Ion Koutsouris 5eade5e1f0
feat(rust): post commit hook (v2), create checkpoint hook (#2391)
# Description
Introduces a post commit, which can do additional actions before
returning the FinalizedCommit.

Current commit hook will creates a checkpoint if it meets the condition
of the interval.

Also bumping the default interval to 100 commits. 10 commits can be a
bit aggressive

# Related Issue(s)
- closes https://github.com/delta-io/delta-rs/issues/913
2024-04-07 12:26:07 -04:00
Alessandro Rinaldi fef111c129
docs: document required aws permissions (#2393)
# Description
This documents the required AWS permissions on S3 and DynamoDB to
interact with deltalakes.

# Related Issue(s)
- mentions #1091
2024-04-06 22:05:48 +02:00
Peter Ke 69317f821e
fix(rust): remove flush after writing every batch (#2387)
# Description

Reverts https://github.com/delta-io/delta-rs/pull/2318 by removing
`flush` after writing each batch since it was causing smaller than
expected row groups to be written during compaction.

# Related Issue(s)
- closes #2386
2024-04-05 00:46:33 +00:00
Ion Koutsouris 6f81b8034d
fix(python, rust): expr parsing date/timestamp (#2357)
# Description
We weren't parsing all scalar values yet, parses date32/64 and
timestampmicros now as well.

# Related Issue(s)
- fixes https://github.com/delta-io/delta-rs/issues/2344
2024-04-02 08:08:39 +02:00