Add a backup and recovery with delta lake post

This is just a high level overview, I think the tech.scribd.com post will have a lot more detail, but wanted to put something out while this was fresh in my brain
2021-04-26 13:23:23 -07:00 · 2021-04-26 13:23:23 -07:00 · 803106ac9e
parent daf8034e59
commit 803106ac9e
1 changed files with 88 additions and 0 deletions
--- a/_posts/2021-04-26-disaster-recovery-with-delta-lake.md
+++ b/_posts/2021-04-26-disaster-recovery-with-delta-lake.md
@ -0,0 +1,88 @@
 ---
 layout: post
 title: Recovering from disasters with Delta Lake
 tags:
 - deltalake
 - scribd
 ---
 Entering into the data platform space with a lot of experience in more
 traditional production operations is a _lot_ of fun, especially when you ask
 questions like "what if `X` goes horribly wrong?"  My favorite scenario to
 consider is: "how much damage could one accidentally cause with our existing
 policies and controls?"  At [Scribd](https://tech.scribd.com) we have made
 [Delta Lake](https://delta.io) a cornerstone of our data platform, and as such
 I've spent a lot of time thinking about what could go wrong and how we would
 defend against it.  
 To start I recommend reading this recent post from Databricks: [Attack of the
 Delta
 Clones](https://databricks.com/blog/2021/04/20/attack-of-the-delta-clones-against-disaster-recovery-availability-complexity.html)
 which provides a good overview of the `CLONE` operation in Delta and some
 patterns for "undoing" mistaken operations. Their blog post does a fantastic
 job demonstrating the power ot clones in Delta Lake, for example:
 ```sql
 -- Creating a new cloned table  from loan_details_delta
 CREATE OR REPLACE TABLE loan_details_delta_clone
    DEEP CLONE loan_details_delta;
 -- Original view of data
 SELECT addr_state, funded_amnt FROM loan_details_delta GROUP BY addr_state, funded_amnt
 -- Clone view of data
 SELECT addr_state, funded_amnt FROM loan_details_delta_clone GROUP BY addr_state, funded_amnt
 ```
 For my disaster recovery needs, the clone-based approach is insufficient as I detailed in [this post](https://groups.google.com/g/delta-users/c/2WOymkv4KgI/m/zvqKkQwJDwAJ) on the delta-users mailing list:
 > Our requirements are basically to prevent catastrophic loss of business critical data via:
 > 
 > * Erroneous rewriting of data by an automated job
 > * Inadvertent table drops through metastore automation.
 > * Overaggressive use of VACUUM command
 > * Failed manual sync/cleanup operations by Data Engineering staff
 > 
 > It's important to consider whether you're worried about the transaction log
 > getting corrupted, files in storage (e.g. ADLS) disappearing, or both.
 Generally speaking, I'm less concerned about malicious actors so much as
 _incompetent_ ones. It is **far** more likely that a member of the team
 accidentally deletes data, than somebody kicking in a few layers of cloud-based
 security and deleting it for us.
 My preference is to work at a layer _below_ Delta Lake to provide disaster
 recover mechanisms, in essence at the object store layer (S3). Relying strictly
 on `CLONE` gets you copies of data which can definitely be beneficial _but_ the
 downside is that whatever is running the query has access to both the "source"
 and the "backup" data.
 The concern is that if some mistake was able to delete my source data, there's
 nothing actually standing in its way of deleting the backup data as well.
 In my mailing list post, I posited a potential solution:
 > For example, with a simple nightly rclone(.org) based snapshot of an S3 bucket, the
 > "restore" might mean copying the transction log and new parquet files _back_ to
 > the originating S3 bucket and *losing* up to 24 hours of data, since the
 > transaction logs would basically be rewound to the last backup point.
 Since that email we have deployed our Delta Lake backup solution,
 which operates strictly at an S3 layer and allows us to impose hard walls (IAM)
 between writers of the source and backup data.
 One of my colleagues is writing that blog post up for
 [tech.scribd.com](https://tech.scribd.com) and I hope to see it published later
 this week so make sure you follow us on Twitter
 [@ScribdTech](https://twitter.com/scribdtech) or subscribe to the [RSS
 feed](https://tech.scribd.com/feed.xml)!