Add a link and more details around the Buoyant Data concurrency post

This commit is contained in:
R Tyler Croy 2023-11-29 08:16:06 -08:00
parent d8b565cd49
commit 4e95924fd7
1 changed files with 51 additions and 0 deletions

View File

@ -0,0 +1,51 @@
layout: post
title: "Improving lock performance for delta-rs"
- buoyantdata
- deltalake
- rust
I have had the good fortune this year to help a number of organizations develop
and deploy native data applications in Python and Rust using a project I helped
found: [delta-rs]( At a high level
delta-rs is a Rust implementation of the [Delta Lake
protocol]( which
offers ACID-like transactions for data lake use-cases. One of the big areas of
my focus has been in evaluating and improving performance in highly concurrent
runtime environments on AWS.
To help others understand the problem domain I spent some time earlier in the
week documenting the challenges in AWS on the Buoyant Data blog: [Concurrency
limitations for Delta Lake on
> In the case of AWS S3's consistency model many operations are strongly
> consistent, but concurrent operations on the same key are not. AWS encourages
> application-level object locking, which the delta-rs implements using AWS
> DynamoDB.
AWS S3 is an incredible piece of technology that washes away a myriad of common
storage problems, and has been jokingly referred to as "the 8th wonder of the
world" by [Corey Quinn]( THe lack of a
"putIfAbsent" like semantic is however _very_ annoying for the Delta Lake
protocol, adding the need for an application-wide *lock* for Delta users:
> The dynamodb-lock approach allows for some sensible cooperation between
> concurrent writers but the key limitation is that all concurrent operations
> must synchronize on the table itself. There is no smaller division of
> concurrency than a table operation
In the blog post I offer some potential approaches to mitigate the weakness of
needing a table-level lock for concurrent Delta Lake writers on AWS, but the
problem will unfortunately remain until in some form or fashion until S3
introduces a "putIfAbsent" semantic which allows writers to "put" a file only
if it doesn't exist in an atomic way.
For concurrent Delta writers I can offer some advice, but unfortunately
effective cooperative distributed concucrrency at scale remains a challenging
problem! :)