Disaster Recovery for Data Platforms

Gareth Richards

Principal Consultant

25 June 2026

Why It’s Hard, and What Databricks Managed DR Changes

For every organisation operating a data platform in the cloud, the question of how to manage downtime is a key part of the architecture conversation. Regardless of the size of the platform or the organisation, the question has never been *whether* to plan for disaster recovery, but instead *how*, for which scenarios, and at what cost. Get it wrong and you either over-invest in infrastructure that never gets used, or face a multi-day crisis when a regional cloud outage hits.

This post covers the core DR concepts for data platforms, why the lakehouse architecture can make DR harder than it looks, and what Databricks’ Managed Disaster Recovery – now in Public Preview – changes for customers.

Part 1: Disaster Recovery Concepts

What Is DR, and What Isn’t It?

Disaster recovery is the set of policies and procedures that enable recovery of critical systems after a significant failure. It is distinct from concepts such as High Availability which keeps systems running through redundancy and zone-level failover. Most cloud providers handle this for you provided you have the correct configuration. DR instead addresses the kinds of failure that HA cannot solve for – whether that’s a regional failure or a targeted issue.

The Two Metrics That Matter

One of the most important first steps in the DR conversation is defining the requirements, and this usually falls into two key metrics

RPO – Recovery Point Objective – This is the maximum acceptable data loss, measured in time

RTO – Recovery Time Objective – This is the maximum acceptable downtime before recovery

Lower RPO and RTO mean less business impact but higher cost and complexity, so it is important to be realistic in the requirements as we often find that sensible targets can lead to much more efficient DR solutions.

As with many architectural conversations, these ‘softer’ business requirements need to be understood before jumping into a technology and architecture conversation. There are many ways to handle disasters, and the right one for a given organisation will depend on the defined RPO and RTO.

DR Strategies

Once you have an understanding of the key objectives that will meet the requirements of the business, you can start to consider what strategy you need to employ to handle disasters.

Disasters can take many forms – in a cloud context the most frequently discussed one is a regional outage, where all your cloud resources in their deployed region are unavailable. However there may be more targeted disasters, that impact on the quality and substance of the data in the platform. A full DR strategy will consider both the large scale failures and the targeted ones, in order to establish a complete solution that enables business continuity.

So, the range of scenarios we are trying to handle might look like;

Full regional outage – access to all infrastructure in a given region is lost, and access to the platform can only be restored by reverting to infrastructure in a different region

Storage intact, control plane down – data is untouched, but compute is unavailable and jobs fail

Workload-specific failure – issues with incoming data lead to pipeline failures or data corruption, preventing downstream processes and analytics from running

To handle these scenarios, we might deploy a number of different strategies to achieve the best RPO and RTO in any given situation, including;

Wait it out

An option that is spoken about infrequently, but one that can work for the majority of scenarios. Cloud infrastructure providers have exceptionally high SLAs for service availability with significant financial consequences for failing to meet them. Depending on the nature of your data platform and the resources within it, very many outages will be resolved by the cloud provider before other DR solutions have stepped in or the RPO and RTO objectives have been missed. The benefits of the wait-it-out approach prompt the question: how often will downtime genuinely exceed our tolerance?

Backup and Restore

Creating periodic snapshots shipped to a secondary location, either with custom processes or a paid-for solution. These tend to be simple and relatively cheap day to day, but require a failover process to be built around them and the RPO and RTO can be high if that process depends on key personal inside business hours. Where data is backed up but infrastructure access is lost, new infrastructure will need to be stood up. However, for non-critical workloads this can be very effective.

Secondary Infrastructure

This option is probably the most ‘traditional’ DR solution and feels very familiar for organisations with historic or current on-premise footprints. The creation of ‘shadow’ infrastructure that waits patiently for a DR scenario to occur is reassuring, but costly. That cost can be reduced depending on the tolerable RTO;

Pilot Light – Create minimal secondary infrastructure which is kept running, then scale it up on failover. Moderate cost, faster RTO than a pure backup/restore solution.

Warm Standby – Maintain a reduced-scale replica of the production environment running continuously, ready to be promoted. Faster RTO, higher ongoing cost.

Active-Active – A full production environment runs in multiple regions simultaneously. Near-zero RPO and RTO, but the highest cost and complexity.

Going back to the lack of ‘one size fits all’ a full DR plan is unlikely to settle on a single solution. Different failures will have different scopes and impacts, and deploying a secondary platform instance to deal with a single corrupted data feed is like hammering a nail using a planet, and your CFO is unlikely to thank you for duplicating cloud costs to meet a 24 hour RTO.

The best DR plans will consider each of the potential sources of failure, the needs of the business, and set out the expected solutions and steps to take when any given failure occurs.

Why Data Platforms Are Different

At this point it is worth looking at why modern cloud data platforms don’t always fit well with traditional DR approaches. Existing DR tools were primarily designed for databases and application servers which operated as a single complete unit. Cloud platforms often operate across four distinct layers that must all be replicated and kept in sync:

Data – stored in open formats (Delta Lake, Parquet, Iceberg) in cloud object storage

Metadata – managed by Unity Catalog: schemas, lineage, governance properties

Workspace Assets – notebooks, jobs, SQL warehouses, pipelines, dashboards, ACLs

Governance configuration – permissions, column masks, row filters, secrets

Replicating storage alone does not produce a working secondary environment, and restoring managed table data into a new environment isn’t necessarily going to be understood by a new platform.

Part 2: DR on Databricks — The Manual Approach and Its Limits

What Teams Had to Build

Until recently, Databricks DR was a DIY exercise. To meet the different scenarios above, teams had to script Unity Catalog metadata replication across regions, sync workspace assets via Databricks REST APIs or Terraform, manage storage replication at the cloud provider layer, and maintain failover runbooks coordinated across multiple teams. Each layer required its own tooling, monitoring, and operational process. Even Deep Clone solutions that could easily replicate data into another workspace required the maintenance of secondary infrastructure with additional monitoring and potential for failure.

Where Manual DR Breaks Down

The DIY approach is most exposed during partial failures. When replication is scripted rather than managed, a partial outage immediately raises hard questions: which components have successfully replicated? Is the catalog metadata current? Are jobs in sync? Can a selective failover even be executed given how the scripts are structured?

Without a unified replication pipeline, answering these questions takes time – often the first hour of an incident is spent assessing the secondary environment rather than recovering. The DIY approach tends to produce reasonable DR capability for the full-outage scenario, which is rare and well-rehearsed, while being far less reliable for partial failures, which are routine.

The accumulated pain points are consistent across teams: high operational overhead to build and maintain the solution; consistency risk as components fall silently out of sync; unknown replication health until it matters; untested runbooks that fail in scenarios that were never anticipated; and failover complexity that spans teams with different ownership and on-call responsibilities.

Part 3: Databricks Managed Disaster Recovery

Background

To combat the challenges of a full DR solution in Databricks, Managed DR has been developed and following a successful development and private preview, the feature has now reached Public Preview (as of June 2026).

The premise is straightforward: Databricks takes ownership of the replication pipeline so customers don’t have to build or maintain one.

The full documentation is here but some key points are summarised below;

What It Does

Managed DR replicates your Databricks deployment — data, metadata, and workspace assets — to a secondary region on a continuous basis. The scope covers:

Unity Catalog: managed table data and metadata, external table metadata, views, functions, tags, permissions

Workspace assets: notebooks, jobs, SQL warehouses, cluster configs, AI/BI dashboards, files, folders, ACLs

Two capabilities stand out which work together add polish to the solution

Stable URL is a single connection string for JDBC, ODBC, REST API, and web UI access that survives failover – downstream systems require no reconfiguration when the active region changes.

Customer-controlled failover means you decide when to switch, whether for a scheduled DR test or a live incident. Fail-back uses the same process in reverse. You retain control of when to initiate a failover, or wait for the core issue to be resolved by the cloud provider.

Setup Overview

Enable the Mission Critical add-on on both primary and secondary workspaces
Provision a secondary workspace and Unity Catalog metastore in the target region, matching the primary’s network, Private Link, and key configuration
Create a Failover Group in the account console (Resilience section), selecting catalogs and assets to replicate
Optionally configure a Stable URL
Monitor replication health and current RPO via `system.replication.states`

You may need to also sync users across regions depending on how you manage identities within your own implementation.

The process will complete an initial sync of data and assets to the secondary region. This is a one-time event but can take up to two weeks for large workspaces with a significant amount of data. After that, replication runs continuously so data in the secondary region workspace remains up to date (within the limits of the time it takes to sync data when updates occur.

When you want to initiate a failover;

For planned failovers (DR tests, maintenance), drain active workloads, wait for replication to reach a consistent point, and trigger from the account console.

For unplanned failovers, trigger immediately – data written after the last completed replication cycle may be lost. In both cases, failover completes in minutes.

Cost

Managed DR is delivered through the Mission Critical workspace add-on, which applies a compute rate uplift to all DBU consumption on both the primary and secondary workspaces. Databricks does not publish this rate – it is negotiated per account so you should contact the Databricks Account team for access to the solution and a definite cost.

Additional costs to factor in: cross-region data egress, Private Link setup, and customer-managed key configuration on the secondary. If you already use the ESC add-on, Mission Critical consolidates the billing – you pay one rate, not both.

One important nuance: replication frequency is not configurable. The continuous replication schedule is managed by Databricks. Unlike a manual approach where a fixed sync interval gives a predictable, budgetable cost, here the cadence is determined by Databricks and may vary with workload churn. Monitor actual costs after enabling the add-on.

The total cost of the manual alternative is also not zero. Engineering time, replication tooling, monitoring, and ongoing maintenance carry real cost. For complex deployments, the managed option may prove cost-competitive once the DIY total cost is honestly accounted for.

Current Limitations

Managed DR is in gated Public Preview and has meaningful gaps:

Not yet replicated: materialised views, streaming tables, Lakeflow pipelines, ML models, model serving endpoints, vector search indexes, Unity Catalog secrets – supplementary approaches are needed for these
Secondary catalogs are read-only during replication; no compute runs on the secondary until failover
Initial bootstrap can take up to two weeks for large workspaces (one-time only)
Maximum 300 catalogs and 100 failover groups per account
Replication frequency cannot be configured. In a manual setup, a fixed sync interval gives a known worst-case RPO you can state in an SLA. With Managed DR, the effective RPO is the lag in Databricks’ continuous cycle at the moment of failure – observable via `system.replication.states`, but not a contractual value you set. Teams with formal RPO commitments should factor this into their compliance assessment.
Access is gated – contact your Databricks account team to enable the Public Preview

Conclusion

Databricks Managed DR removes the hardest part of lakehouse DR: building and maintaining a unified replication pipeline across data, metadata, workspace assets, and governance configuration. That has historically been a significant engineering burden, and transferring it to Databricks as a managed service is a meaningful change.

The stable URL is a practical win that is easy to underestimate – eliminating the need to reconfigure downstream connections after a failover matters most at 2am during a live incident.

That said, the limitations are real. Several asset types are not yet covered, replication frequency cannot be controlled, and costs require careful modelling. For workloads dependent on streaming tables, ML endpoints, or with strict contractual RPO requirements, Managed DR may need to complement rather than replace a broader resilience strategy.

For organisations that have been deferring a proper DR investment because the manual approach was too complex to build reliably, the calculus has changed. Now is a good time to evaluate it.

Fundamentally, DR still requires careful planning and consideration to identify the right solution for your business. Mapping out the scenarios and the business requirements will lead you to an understanding of which solutions will work, and how you can achieve a sensible, robust DR solution that enables business continuity.

To get started: contact your Databricks account team and review the official documentation. Once enabled, make a planned failover test the first priority – familiarity with the process matters as much as having the infrastructure in place.