Add a link and more details around the Buoyant Data concurrency post
This commit is contained in:
parent
d8b565cd49
commit
4e95924fd7
51
_posts/2023-11-29-locking-with-deltalake.md
Normal file
51
_posts/2023-11-29-locking-with-deltalake.md
Normal file
@ -0,0 +1,51 @@
|
||||
---
|
||||
layout: post
|
||||
title: "Improving lock performance for delta-rs"
|
||||
tags:
|
||||
- buoyantdata
|
||||
- deltalake
|
||||
- rust
|
||||
---
|
||||
|
||||
I have had the good fortune this year to help a number of organizations develop
|
||||
and deploy native data applications in Python and Rust using a project I helped
|
||||
found: [delta-rs](https://github.com/delta-io/delta-rs). At a high level
|
||||
delta-rs is a Rust implementation of the [Delta Lake
|
||||
protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md) which
|
||||
offers ACID-like transactions for data lake use-cases. One of the big areas of
|
||||
my focus has been in evaluating and improving performance in highly concurrent
|
||||
runtime environments on AWS.
|
||||
|
||||
To help others understand the problem domain I spent some time earlier in the
|
||||
week documenting the challenges in AWS on the Buoyant Data blog: [Concurrency
|
||||
limitations for Delta Lake on
|
||||
AWS](https://www.buoyantdata.com/blog/2023-11-27-concurrency-limitations-with-deltalake-on-aws.html)
|
||||
|
||||
> In the case of AWS S3's consistency model many operations are strongly
|
||||
> consistent, but concurrent operations on the same key are not. AWS encourages
|
||||
> application-level object locking, which the delta-rs implements using AWS
|
||||
> DynamoDB.
|
||||
|
||||
AWS S3 is an incredible piece of technology that washes away a myriad of common
|
||||
storage problems, and has been jokingly referred to as "the 8th wonder of the
|
||||
world" by [Corey Quinn](https://www.lastweekinaws.com/). THe lack of a
|
||||
"putIfAbsent" like semantic is however _very_ annoying for the Delta Lake
|
||||
protocol, adding the need for an application-wide *lock* for Delta users:
|
||||
|
||||
> The dynamodb-lock approach allows for some sensible cooperation between
|
||||
> concurrent writers but the key limitation is that all concurrent operations
|
||||
> must synchronize on the table itself. There is no smaller division of
|
||||
> concurrency than a table operation
|
||||
|
||||
In the blog post I offer some potential approaches to mitigate the weakness of
|
||||
needing a table-level lock for concurrent Delta Lake writers on AWS, but the
|
||||
problem will unfortunately remain until in some form or fashion until S3
|
||||
introduces a "putIfAbsent" semantic which allows writers to "put" a file only
|
||||
if it doesn't exist in an atomic way.
|
||||
|
||||
For concurrent Delta writers I can offer some advice, but unfortunately
|
||||
effective cooperative distributed concucrrency at scale remains a challenging
|
||||
problem! :)
|
||||
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user