diff --git a/_posts/2021-04-26-disaster-recovery-with-delta-lake.md b/_posts/2021-04-26-disaster-recovery-with-delta-lake.md index 645d1f8..f26d6ff 100644 --- a/_posts/2021-04-26-disaster-recovery-with-delta-lake.md +++ b/_posts/2021-04-26-disaster-recovery-with-delta-lake.md @@ -13,7 +13,7 @@ consider is: "how much damage could one accidentally cause with our existing policies and controls?" At [Scribd](https://tech.scribd.com) we have made [Delta Lake](https://delta.io) a cornerstone of our data platform, and as such I've spent a lot of time thinking about what could go wrong and how we would -defend against it. +defend against it. To start I recommend reading this recent post from Databricks: [Attack of the @@ -41,12 +41,12 @@ For my disaster recovery needs, the clone-based approach is insufficient as I de > Our requirements are basically to prevent catastrophic loss of business critical data via: -> +> > * Erroneous rewriting of data by an automated job > * Inadvertent table drops through metastore automation. > * Overaggressive use of VACUUM command > * Failed manual sync/cleanup operations by Data Engineering staff -> +> > It's important to consider whether you're worried about the transaction log > getting corrupted, files in storage (e.g. ADLS) disappearing, or both. @@ -85,4 +85,9 @@ this week so make sure you follow us on Twitter feed](https://tech.scribd.com/feed.xml)! +--- + +**Update**: my colleague Kuntal wrote [this blog post on backing up Delta Lake with AWS S3 Batch Operations](https://tech.scribd.com/blog/2021/backing-up-data-warehouse.html) which is what we're doing here at [Scribd](https://tech.scribd.com) + +