Add a backup and recovery with delta lake post
This is just a high level overview, I think the tech.scribd.com post will have a lot more detail, but wanted to put something out while this was fresh in my brain
This commit is contained in:
parent
daf8034e59
commit
803106ac9e
|
@ -0,0 +1,88 @@
|
|||
---
|
||||
layout: post
|
||||
title: Recovering from disasters with Delta Lake
|
||||
tags:
|
||||
- deltalake
|
||||
- scribd
|
||||
---
|
||||
|
||||
Entering into the data platform space with a lot of experience in more
|
||||
traditional production operations is a _lot_ of fun, especially when you ask
|
||||
questions like "what if `X` goes horribly wrong?" My favorite scenario to
|
||||
consider is: "how much damage could one accidentally cause with our existing
|
||||
policies and controls?" At [Scribd](https://tech.scribd.com) we have made
|
||||
[Delta Lake](https://delta.io) a cornerstone of our data platform, and as such
|
||||
I've spent a lot of time thinking about what could go wrong and how we would
|
||||
defend against it.
|
||||
|
||||
|
||||
To start I recommend reading this recent post from Databricks: [Attack of the
|
||||
Delta
|
||||
Clones](https://databricks.com/blog/2021/04/20/attack-of-the-delta-clones-against-disaster-recovery-availability-complexity.html)
|
||||
which provides a good overview of the `CLONE` operation in Delta and some
|
||||
patterns for "undoing" mistaken operations. Their blog post does a fantastic
|
||||
job demonstrating the power ot clones in Delta Lake, for example:
|
||||
|
||||
|
||||
```sql
|
||||
-- Creating a new cloned table from loan_details_delta
|
||||
CREATE OR REPLACE TABLE loan_details_delta_clone
|
||||
DEEP CLONE loan_details_delta;
|
||||
|
||||
-- Original view of data
|
||||
SELECT addr_state, funded_amnt FROM loan_details_delta GROUP BY addr_state, funded_amnt
|
||||
|
||||
-- Clone view of data
|
||||
SELECT addr_state, funded_amnt FROM loan_details_delta_clone GROUP BY addr_state, funded_amnt
|
||||
```
|
||||
|
||||
|
||||
For my disaster recovery needs, the clone-based approach is insufficient as I detailed in [this post](https://groups.google.com/g/delta-users/c/2WOymkv4KgI/m/zvqKkQwJDwAJ) on the delta-users mailing list:
|
||||
|
||||
|
||||
> Our requirements are basically to prevent catastrophic loss of business critical data via:
|
||||
>
|
||||
> * Erroneous rewriting of data by an automated job
|
||||
> * Inadvertent table drops through metastore automation.
|
||||
> * Overaggressive use of VACUUM command
|
||||
> * Failed manual sync/cleanup operations by Data Engineering staff
|
||||
>
|
||||
> It's important to consider whether you're worried about the transaction log
|
||||
> getting corrupted, files in storage (e.g. ADLS) disappearing, or both.
|
||||
|
||||
|
||||
Generally speaking, I'm less concerned about malicious actors so much as
|
||||
_incompetent_ ones. It is **far** more likely that a member of the team
|
||||
accidentally deletes data, than somebody kicking in a few layers of cloud-based
|
||||
security and deleting it for us.
|
||||
|
||||
My preference is to work at a layer _below_ Delta Lake to provide disaster
|
||||
recover mechanisms, in essence at the object store layer (S3). Relying strictly
|
||||
on `CLONE` gets you copies of data which can definitely be beneficial _but_ the
|
||||
downside is that whatever is running the query has access to both the "source"
|
||||
and the "backup" data.
|
||||
|
||||
The concern is that if some mistake was able to delete my source data, there's
|
||||
nothing actually standing in its way of deleting the backup data as well.
|
||||
|
||||
|
||||
In my mailing list post, I posited a potential solution:
|
||||
|
||||
> For example, with a simple nightly rclone(.org) based snapshot of an S3 bucket, the
|
||||
> "restore" might mean copying the transction log and new parquet files _back_ to
|
||||
> the originating S3 bucket and *losing* up to 24 hours of data, since the
|
||||
> transaction logs would basically be rewound to the last backup point.
|
||||
|
||||
|
||||
Since that email we have deployed our Delta Lake backup solution,
|
||||
which operates strictly at an S3 layer and allows us to impose hard walls (IAM)
|
||||
between writers of the source and backup data.
|
||||
|
||||
One of my colleagues is writing that blog post up for
|
||||
[tech.scribd.com](https://tech.scribd.com) and I hope to see it published later
|
||||
this week so make sure you follow us on Twitter
|
||||
[@ScribdTech](https://twitter.com/scribdtech) or subscribe to the [RSS
|
||||
feed](https://tech.scribd.com/feed.xml)!
|
||||
|
||||
|
||||
|
Loading…
Reference in New Issue