---
layout: post
title: Recovering from disasters with Delta Lake
tags:
- deltalake
- scribd
---

Entering into the data platform space with a lot of experience in more
traditional production operations is a _lot_ of fun, especially when you ask
questions like "what if `X` goes horribly wrong?"  My favorite scenario to
consider is: "how much damage could one accidentally cause with our existing
policies and controls?"  At [Scribd](https://tech.scribd.com) we have made
[Delta Lake](https://delta.io) a cornerstone of our data platform, and as such
I've spent a lot of time thinking about what could go wrong and how we would
defend against it.  


To start I recommend reading this recent post from Databricks: [Attack of the
Delta
Clones](https://databricks.com/blog/2021/04/20/attack-of-the-delta-clones-against-disaster-recovery-availability-complexity.html)
which provides a good overview of the `CLONE` operation in Delta and some
patterns for "undoing" mistaken operations. Their blog post does a fantastic
job demonstrating the power ot clones in Delta Lake, for example:


```sql
-- Creating a new cloned table  from loan_details_delta
CREATE OR REPLACE TABLE loan_details_delta_clone
    DEEP CLONE loan_details_delta;

-- Original view of data
SELECT addr_state, funded_amnt FROM loan_details_delta GROUP BY addr_state, funded_amnt

-- Clone view of data
SELECT addr_state, funded_amnt FROM loan_details_delta_clone GROUP BY addr_state, funded_amnt
```


For my disaster recovery needs, the clone-based approach is insufficient as I detailed in [this post](https://groups.google.com/g/delta-users/c/2WOymkv4KgI/m/zvqKkQwJDwAJ) on the delta-users mailing list:


> Our requirements are basically to prevent catastrophic loss of business critical data via:
> 
> * Erroneous rewriting of data by an automated job
> * Inadvertent table drops through metastore automation.
> * Overaggressive use of VACUUM command
> * Failed manual sync/cleanup operations by Data Engineering staff
> 
> It's important to consider whether you're worried about the transaction log
> getting corrupted, files in storage (e.g. ADLS) disappearing, or both.


Generally speaking, I'm less concerned about malicious actors so much as
_incompetent_ ones. It is **far** more likely that a member of the team
accidentally deletes data, than somebody kicking in a few layers of cloud-based
security and deleting it for us.

My preference is to work at a layer _below_ Delta Lake to provide disaster
recover mechanisms, in essence at the object store layer (S3). Relying strictly
on `CLONE` gets you copies of data which can definitely be beneficial _but_ the
downside is that whatever is running the query has access to both the "source"
and the "backup" data.

The concern is that if some mistake was able to delete my source data, there's
nothing actually standing in its way of deleting the backup data as well.


In my mailing list post, I posited a potential solution:

> For example, with a simple nightly rclone(.org) based snapshot of an S3 bucket, the
> "restore" might mean copying the transction log and new parquet files _back_ to
> the originating S3 bucket and *losing* up to 24 hours of data, since the
> transaction logs would basically be rewound to the last backup point.


Since that email we have deployed our Delta Lake backup solution,
which operates strictly at an S3 layer and allows us to impose hard walls (IAM)
between writers of the source and backup data.

One of my colleagues is writing that blog post up for
[tech.scribd.com](https://tech.scribd.com) and I hope to see it published later
this week so make sure you follow us on Twitter
[@ScribdTech](https://twitter.com/scribdtech) or subscribe to the [RSS
feed](https://tech.scribd.com/feed.xml)!