Add a backup and recovery with delta lake post

This is just a high level overview, I think the tech.scribd.com post will have a lot more detail, but wanted to put something out while this was fresh in my brain
2021-04-26 13:23:23 -07:00 · 2021-04-26 13:23:23 -07:00 · 803106ac9e
parent daf8034e59
commit 803106ac9e
1 changed files with 88 additions and 0 deletions
--- a/_posts/2021-04-26-disaster-recovery-with-delta-lake.md
+++ b/_posts/2021-04-26-disaster-recovery-with-delta-lake.md
@ -0,0 +1,88 @@
+---
+layout: post
+title: Recovering from disasters with Delta Lake
+tags:
+- deltalake
+- scribd
+---
+
+Entering into the data platform space with a lot of experience in more
+traditional production operations is a _lot_ of fun, especially when you ask
+questions like "what if `X` goes horribly wrong?"  My favorite scenario to
+consider is: "how much damage could one accidentally cause with our existing
+policies and controls?"  At [Scribd](https://tech.scribd.com) we have made
+[Delta Lake](https://delta.io) a cornerstone of our data platform, and as such
+I've spent a lot of time thinking about what could go wrong and how we would
+defend against it.  
+
+
+To start I recommend reading this recent post from Databricks: [Attack of the
+Delta
+Clones](https://databricks.com/blog/2021/04/20/attack-of-the-delta-clones-against-disaster-recovery-availability-complexity.html)
+which provides a good overview of the `CLONE` operation in Delta and some
+patterns for "undoing" mistaken operations. Their blog post does a fantastic
+job demonstrating the power ot clones in Delta Lake, for example:
+
+
+```sql
+-- Creating a new cloned table  from loan_details_delta
+CREATE OR REPLACE TABLE loan_details_delta_clone
+    DEEP CLONE loan_details_delta;
+
+-- Original view of data
+SELECT addr_state, funded_amnt FROM loan_details_delta GROUP BY addr_state, funded_amnt
+
+-- Clone view of data
+SELECT addr_state, funded_amnt FROM loan_details_delta_clone GROUP BY addr_state, funded_amnt
+```
+
+
+For my disaster recovery needs, the clone-based approach is insufficient as I detailed in [this post](https://groups.google.com/g/delta-users/c/2WOymkv4KgI/m/zvqKkQwJDwAJ) on the delta-users mailing list:
+
+
+> Our requirements are basically to prevent catastrophic loss of business critical data via:
+> 
+> * Erroneous rewriting of data by an automated job
+> * Inadvertent table drops through metastore automation.
+> * Overaggressive use of VACUUM command
+> * Failed manual sync/cleanup operations by Data Engineering staff
+> 
+> It's important to consider whether you're worried about the transaction log
+> getting corrupted, files in storage (e.g. ADLS) disappearing, or both.
+
+
+Generally speaking, I'm less concerned about malicious actors so much as
+_incompetent_ ones. It is **far** more likely that a member of the team
+accidentally deletes data, than somebody kicking in a few layers of cloud-based
+security and deleting it for us.
+
+My preference is to work at a layer _below_ Delta Lake to provide disaster
+recover mechanisms, in essence at the object store layer (S3). Relying strictly
+on `CLONE` gets you copies of data which can definitely be beneficial _but_ the
+downside is that whatever is running the query has access to both the "source"
+and the "backup" data.
+
+The concern is that if some mistake was able to delete my source data, there's
+nothing actually standing in its way of deleting the backup data as well.
+
+
+In my mailing list post, I posited a potential solution:
+
+> For example, with a simple nightly rclone(.org) based snapshot of an S3 bucket, the
+> "restore" might mean copying the transction log and new parquet files _back_ to
+> the originating S3 bucket and *losing* up to 24 hours of data, since the
+> transaction logs would basically be rewound to the last backup point.
+
+
+Since that email we have deployed our Delta Lake backup solution,
+which operates strictly at an S3 layer and allows us to impose hard walls (IAM)
+between writers of the source and backup data.
+
+One of my colleagues is writing that blog post up for
+[tech.scribd.com](https://tech.scribd.com) and I hope to see it published later
+this week so make sure you follow us on Twitter
+[@ScribdTech](https://twitter.com/scribdtech) or subscribe to the [RSS
+feed](https://tech.scribd.com/feed.xml)!
+
+
+