# Description
To fix the issue when a delta table contains add action with
stats_parsed: null.
As shown in the test case, `001.json` contains an Add action with stats,
while `002.json` contains an Add action with `stats_parsed: null`,
before this fix, it will complain:
```
Arrow { source: InvalidArgumentError("all columns in a record batch must have the same length") }
```
The issue is that the array for `num_records` has two values, while for
other stats such as null_count, the None value is filtered out by
`flat_map`, so there is only one value in the array.
# Related Issue(s)
closes#2477
# Documentation
<!---
Share links to useful documentation
--->
In the case of /tmp existing on tmpfs with musl, the prior version of
this would fail with a cross-device link error before bubbling up a not
found error
# Description
In some edge cases where we schema evolve, it would parse the stats with
the old schema result in these kind of errors:
`Exception: Json error: whilst decoding field 'minValues': whilst
decoding field 'foo': failed to parse 1000000000000 as Int8`
```python
import polars as pl
from deltalake import write_deltalake
pl.DataFrame({
"foo": [1]
}, schema={"foo": pl.Int8}).write_delta("TEST_TABLE_BUG")
write_deltalake("TEST_TABLE_BUG", data = pl.DataFrame({
"foo": [1000000000000]
}, schema={"foo": pl.Int64}).to_arrow(), mode='overwrite', overwrite_schema=True,engine='rust')
```
Instead of taking the old schema, I added an optional schema to be
passed in the logMapper
# Description
This PR adds a configuration to control concurrent access to the
underlying object store. It also includes a visibility change to the
S3LogStoreFactory to align it with all other provider implementations.
# Related Issue(s)
- closes#2457
- resolves#2353
# Documentation
https://docs.rs/object_store/latest/object_store/limit/struct.LimitStore.html
---------
Co-authored-by: Michele Vigilante <michele.vigilante@radancy.com>
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Currently this is nothing but a shim that ensures everything implements
IntoFuture, which they all already do, but in the future this will help
enforce consistency as well as provide common behaviors.
# Description
The AWS SDK uses EC2 instance metadata in the default provider chain,
the profile chain and the region provider
# Related Issue(s)
<!---
For example:
- closes#106
--->
- closes#2377
# Documentation
<!---
Share links to useful documentation
--->
# Description
The nested fields weren't checked, which meant you could get a
timestampNtz in your schema but not have the reader/writer features set.
This check is now done recursively.
# Description
@Blajda I don't think `from_qualified_name_ignore_case` was needed here
since the delta_fields don't have relation information, they are just
the column names.
`from_qualified_name_ignore_case` will try to parse `__delta_rs_c_y--1`
and results into `__delta_rs_c_y`, while `from_name `just keeps the
column name as-is, which is preferred.
# Related Issue(s)
- closes https://github.com/delta-io/delta-rs/issues/2438
# Description
The description of the main changes of your pull request
# Related Issue(s)
<!---
For example:
- closes#106
--->
# Documentation
<!---
Share links to useful documentation
--->
# Description
Upon attempting to retrieve the version with a datetime object, the
`load_as_version` method throws a `ValueError: Failed to parse datetime
string: premature end of input`.
Datetime objects without a specified timezone will be treated as UTC
datetimes.
# Description
This implements repartitioned from the ExecutionPlan trait of DeltaScan.
Currently, Delta tables without partitions are read with all its files
in a single file group of the underlying `ParquetExec`. This seems to
mean that Delta tables without partitions are read without concurrency.
With repartitioned we can repartition the DeltaScan to get concurrency
when reading.
# Description
- `stats_parsed` is a StructArray instead of StringArray
- Parse `Add` action's `stats` to `stats_parsed` would panic due to the
use of `slice.array_data()`.
# Related Issue(s)
<!---
For example:
-
--->
closes#2312
# Documentation
<!---
Share links to useful documentation
--->
https://docs.rs/arrow/51.0.0/arrow/array/struct.GenericByteArray.html#method.value_data
---------
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
# Description
It first sets a proper lower boundary instead of always assuming 0,
since we can also have checkpointed tables which had logRetention that
caused logs to be removed before a checkpoint.
- closes https://github.com/delta-io/delta-rs/issues/521
# Description
When a table is corrupted and `_last_checkpoint` file points to a
version that has been deleted, `list_log_files_with_checkpoint`
function panics. With this change `list_log_files_with_checkpoint`
function returns an error allowing callers react to such issues.
# Related Issue(s)
- https://github.com/delta-io/delta-rs/issues/2290
# Description
Introduces a post commit, which can do additional actions before
returning the FinalizedCommit.
Current commit hook will creates a checkpoint if it meets the condition
of the interval.
Also bumping the default interval to 100 commits. 10 commits can be a
bit aggressive
# Related Issue(s)
- closes https://github.com/delta-io/delta-rs/issues/913
# Description
Reverts https://github.com/delta-io/delta-rs/pull/2318 by removing
`flush` after writing each batch since it was causing smaller than
expected row groups to be written during compaction.
# Related Issue(s)
- closes#2386