From Google Cloud, Protect Data from Disasters Using New Asynchronous Replication

Blog written by David Seidman, product manager, and Dengkui Xi, engineering manager, Google Cloud

In today’s business landscape, data availability and integrity are paramount. Disasters, whether natural or man-made, can disrupt operations and pose a significant risk to critical information.

To address this, we are introducing Persistent Disk Asynchronous Replication, which enables DR for Compute Engine workloads by replicating data between Google Cloud regions, providing a sub-1mn RPO and low RTO.

Simplicity is central to the design of this solution. Replication is managed with a few API calls – there are no required VM agents, no dedicated replication VMs, no constraints on supported guest operating systems, and no performance overhead on the workload. PD Async Replication works at the block infrastructure level, delivering a fast, common infrastructure foundation to protect against disasters. It is designed to be simple to onboard, operate, and monitor, and can be used in combination with other forms of data protection, including disk clones, regional (synchronous replication) disks, and snapshots. Replication is performed directly from a primary disk in the region where you run your workload, to a recovery disk you create in a secondary region.

“Persistent Disk Asynchronous Replication delivers the solid foundation we need for infrastructure-based Disaster Recovery, and helps us meet our regulatory requirements for low RPO/RTO data protection. PD Async Replication supports our full lifecycle of DR testing, failover and failback,” said Nicola Carotti, head, cloud center of excellence, Intesa Sanpaolo

Setting up replication
You can enable PD Async Replication on your existing PD disks with just 2 calls in the API, gcloud, or Google Cloud console. First, create a new blank disk in the secondary region with a reference to the primary disk you want to protect. Then, start replication from the primary disk with a reference to the secondary disk. From that point on, data is automatically replicated between disks, typically with an RPO of less than 1mn, depending on the change rate of the disk. This setup workflow helps to ensure that an explicit action is taken in both regions before any data is transferred. You don’t need to reconfigure your network to use PD Async Replication.

Once PD Async Replication is running, you can observe the time since last replication and the network bytes sent in Cloud Monitoring. It is up to the operations teams responsible for a workload to decide when a disaster has occurred in the primary region and when to initiate a failover. To begin the failover, stop replication between disks and attach the secondary disk to a VM in the secondary region; you can accomplish this within minutes. To restore the workload back to the primary region after a previous failover to the secondary region, create a new replication pair back to the primary region so that the workload can failback. By replicating data between data centers in different regions, you create resilient data replicas that safeguard against localized disruptions caused by natural disasters or other localized events.

Use consistency groups for complex stateful workloads
In cases where workloads have distributed, dependent data across disks and VM instances, consistency groups allow for coordinated management of dependent data. With consistency groups, PD Async Replication enables simultaneous and atomic data replication by automatically synchronizing the replication period across all disks in the group. This helps to ensure that data is consistent between primary and secondary disks, for successful workload recovery in the event of a disaster.

Testing DR
To help ensure that recovery procedures will work in a real disaster, we recommend running tests periodically in the secondary region. You can do this without disrupting or disconnecting PD Async Replication by bulk-cloning the secondary disks with a consistency group applied, even while they are receiving new data.

Deploy for HA and DR
Regional Persistent Disk (Regional PD) and PD Async Replication are designed to be used together for workloads that require both HA and DR. You can configure a Regional PD to be the primary or secondary async disk, which can work in combination with a zonal disk in the primary or secondary region. In a scenario where an outage occurs in just one zone in the primary region where Regional PD is configured, the disk will continue to replicate from the remaining healthy zone to the secondary region. Note that PD Async Replication is set up between 2 distinct disks, while each Regional PD is a single disk that stores data in 2 zones and is attachable in 2 zones.

Improve your HA and DR posture
By leveraging PD Async Replication and consistency groups, businesses gain robust data protection and recovery capabilities. This approach helps you safeguard critical data vs. disasters, minimize downtime, provide data consistency, and enhance fault tolerance. With the power of asynchronous replication and consistency groups, you can help build a resilient platform that provides continuous access to data even in the face of unforeseen disruptions.

You can access PD Async Replication through the console, Compute Engine API, gcloud tool, Terraform, and Cloud Monitoring.

Learn more by reading public documentation