Scalable Ransomware Protection
Analysis of VMware Cloud DR and ransomware recovery service offerings
This is a Press Release edited by StorageNewsletter.com on September 27, 2023 at 2:01 pmThis report, published in September 2023, was authored by:
Russ Fellows, VP, The Futurum Labs, and
Krista Macomber, senior analyst, data protection and security, The Futurum Group LLC.
Scalable Ransomware Protection:
Analysis of VMware’s Cloud DR and Ransomware Recovery Service Offerings
Overview
As the need for cyber-resiliency grows more pressing by the day, the emphasisfor data protection tools shifts towards recoverability. After all, what good is protecting data if it cannot be recovered, and cyber-attacks have become the most pressing “disasters” necessitating protection. In fact, in Futurum Group’s Trends in Enterprise Data Protection 2023, the ability to improve cyber-resiliency emerged as the top reason that respondents who would change their primary data protection vendor, strategy or solution, stated that they would make this change. Cyber-resiliency even surpassed lower TCO – which is always a leading concern in the data protection space, and which dipped from being the most prominent reason customers would make a change (more than 60%) in the 2019 iteration of this study.
In general terms, cyber-resiliency refers to the ability to withstand security incidents, such as cyber-attacks. Given that cyber-attacks are an inevitable reality for practically all organizations, “withstanding” a cyber-attack is becoming synonymous with minimizing its damage – that is, mitigating downtime, data loss, financial losses, and reputational damage (typically resulting from loss of sensitive data). Recoverability plays an important role from this standpoint. Underscoring this point, in our study, approximately two-thirds of respondents indicated using DR and replication technologies as a part of their data protection implementations.
Since resuming operations as quickly as possible from an incident plays a direct role in facilitating BC following a cyber-attack, the criticality of recoverability from the standpoint of cyber-resiliency is clear. Simply put, recoverability is critical to avoiding costly and prolonged downtime of critical business services, applications, systems, and data that could result from cyber-attacks, ultimately wreaking havoc to an organization’s operations. Specifically, IT operations requires quick recoverability, and the ability to prioritize critical resources that have been impacted to meet the requirement of restoring business-critical services as quickly as possible. This requires well-defined and tested backup and recovery processes, alongside the ability to identify and contain threats. Nearly two-thirds (67%) of respondents in our study indicated that they are testing/validating their backups, while 58% indicated that they are testing DR processes. Although positive, these figures underscore further opportunity to expand testing.
Recoverability is also critical when it comes to avoiding data loss, including corruption and theft, which are common in cyber-attacks. This is a particular concern when it comes to the severe consequences that could occur if an organization’s “crown jewels” or customers’ sensitive information were to be lost. This is where our best practices for data protection, including following the 3-2-1 backup strategy, which prescribes for regular backups stored on offsite storage, come in. Tested and documented recovery processes also play a role, in order to ensure that these backups can be recovered following an attack.
It is implied but important to note that in minimizing downtime and data loss, recoverability also helps to mitigate potential financial losses, including ransoms and other fines, that can result from being hit with a cyber-attack. Along this vein, cyber-attacks have raised the visibility of recoverability to the C-Suite, demonstrating recoverability is increasingly a compliance requirement for data privacy legislation such as GDPR, CCPA, and HIPAA.
Finally, investing in recoverability can help to improve the organization’s overall security posture by uncovering and opening the door for IT Operations to address security vulnerabilities that could be exploited by attackers.
Ransomware recovery requirements
Ransomware presents a recovery scenario that is different and unique from more “traditional” scenarios such as recovering from a natural disaster. As such, a discussion of the specific requirements of ransomware recovery is warranted.
1. A comprehensive, documented, and regularly vetted ransomware protection, mitigation, recovery and response plan is a critical first step. This spans not just technology, but people and processes (for instance, identifying the specific incident response team). Although this is true for any DR scenario, ransomware presents a few unique factors. Not only are attackers continuously innovating their approaches, but attacks are continuing to make headlines, resulting in pressure on IT operations teams from the C-Suite to be as prepared as possible. Keeping up with the pace at which attacks are evolving is no small feat for IT Operations.
2. Having an air gap between production and retained protected copies is critical. Many ransomware attacks now target on-line protection points in the form of snapshots, or even traditional backups.
3. Attacks should be identified, using tools that can detect anomalous behavior, including network monitoring. Using the earlier example of a “traditional” disaster, IT knows when the earthquake, flood, etc. hit, and impacted IT services. Ransomware may penetrate the environment and lay dormant for hours, days, weeks, or longer.
4. Once the attack has been discovered, forensics need to be conducted to identify the spread or “blast radius” of the attack, and the characteristics of the attack. Forensic analysis guides the recovery process by helping identify the “last known good” backup copy that is available, specifically which files, or entire systems, can be recovered. These processes should be prioritized by business impact – with priority given to applications necessary for day-to-day business service operations.
5. During recovery, infected systems should be isolated and tested in a sandbox environment to avoid restoring malware back into production. There are many levels of possible isolation, but a minimal level would be logical network partitioning, with increasing preference given to solutions with physical isolation and multiple levels of network isolation.
6. After verified images are created, recovery operations are executed. With ransomware, it is important that critical data and systems be prioritized. A so-called “surgical” approach is likely required, whereby specific files are recovered in order to avoid losing good data changes that may have occurred after the attack. SLAs (SLAs), which include RPO and RTO (recovery point and time objectives, or the amount of data loss and downtime that can be tolerated) are typically “best efforts” that depend on the nature of the specific attack.
A number of other best practices for data protection still apply and remain critical in a ransomware recovery scenario. These include encryption of data at rest, and creating immutable protection points as appropriate, to enable recovery. Additionally, creating “air-gapped”/ isolated backups become more critical as backup repositories themselves are more regularly targeted. Of course, all backup copies should be tested regularly for integrity, and it is important to keep software and systems up to date with patches to mitigate vulnerabilities.
The Futurum Group has developed a Cyber Resiliency Framework for analyzing products within the broader category of providing cyber resiliency, which includes protection and recovery from ransomware. See Appendix for details on the Framework categories and VMware Cloud DR’s rating.
VMware Ransomware Recovery Solution overview
The VMware Cloud DR solution is a service offering available in multiple locations across the world, based upon VMware and AWS region availability. VMware Ransomware Recovery is a purpose-built additional service for ransomware resiliency that may be added to VMware Cloud DR. It offers accelerated ransomware recovery with minimal data loss, delivered as an integrated software-as-a-service (SaaS) solution.
The ransomware recovery-as-a-service solution uses an isolated analysis and recovery workflow that prevents reinfection of production workloads. Guided recovery workflows allow organizations to quickly identify recovery point candidates, validate restore points using embedded behavioral analysis, and recover data with minimal loss. VMware Cloud DR can be used to protect vSphere VMs by replicating them to the cloud and recovering VMs, to a target Software Defined Data Center (SDDC) on VMware Cloud on AWS.
Click to enlarge
Figure 1: VMware Cloud DR Recovery Capabilities
Some of the key capabilities of VMware Cloud DR applicable to ransomware recovery situations include:
- Secure, immutable backup copies: VMware Cloud DR’s backup copies are operationally air-gapped – with copies retained in VMware’s Scale-out Cloud File System (SCFS) maintained in the VMware Cloud on AWS. No recovery point data is overwritten, with an immutable log structured filesystem used for storing recovery points.
- RBAC Roles and MFA Authentication: VMware Cloud console uses authentication and role-based access control (RBAC) along with optional multi-factor authentication (MFA).
- Policy Based Protection: The VMware Ransomware Recovery service relies upon policy-based protection plans, which are a part of VMware’s Cloud DR service. Existing or ransomware specific polices may be used to specify the number, type and location of copies to protect VMs. Additionally, polices provide flexible DR to enable recovery of complex application environments.
- Deep Retention History: The SCFS provides recovery points that are minutes, months, even years old to enable organizations to better recover. The deeper retention helps mitigate situations where ransomware has been in the environment for a long time and quick access to older recovery points when forensics or baseline restore selection is desired.
- Isolated Recovery Environment (IRE): VMware Cloud DR service utilizes isolated environment that prevents reinfection during the validation process or when working with potentially compromised systems. The cloud IRE can be utilized to recover multiple locations, including both on-prem and other AWS SDDCs.
- Iterative Wizard Validation Process: The recovery workflow helps to rapidly validate multiple VM’s, by first choosing recovery points, power-on VM’s into the IRE to enable live behavioral analysis with validation scanning from Carbon Black, and VMware NSX The process enables security and IT teams to evaluate each VM independently prior to moving a known good copy back into production.
- Additional Recovery Options: Additional recovery options exist to enable the use VMware Cloud DR to extract specific files or folders from more recent recovery points as part of the recovery process without bringing the associated VM into a production inventory and risking re-infection.
Figure 2: Partial Dashboard of VMware Cloud DR
Topology of VMware Cloud DR Environment
Shown above in Figure 1 is a portion of the VMware Cloud DR cloud console, showing a summary of the environment, the protected sites, along with the recovery environment and a graphical topology of the environment.
Additionally, a significant amount of information is available at the top or “Dashboard” view of the VMware Cloud DR console. The ability to quickly understand the status of multiple sites, VM protection status, along with operations in progress, outstanding issues and the topology are all important. The web-based UI felt familiar to FGL team members who utilize other VMware tools regularly, which is important to help reduce training and administrative overhead for companies adopting VMware Cloud DR protection.
In summary, the VMWare Cloud DR Ransomware service provides a highly scalable recovery solution based upon secure copies via the Scale-out Cloud File System, along with customizable, automated DR workflows and orchestration.
Key findings
As a service offering, VMware Cloud DR Ransomware Recovery provides rapid setup and scalability capabilities that should meet the needs of companies of nearly any size ranging from small Enterprises up to large, globally dispersed multi-national companies.
The time to plan, deploy and establish an adequate ransomware recovery environment can often be a significant undertaking. Although planning may still be required, the VMware Cloud DR Ransomware Recovery service itself can be initiated in less than a day, without significant training by leveraging existing IT staff already familiar with vSphere tools and the VMware Cloud portal.
The service itself was found to provide a comprehensive set of ransomware protection features, along with well-thought-out wizards and workflows to help rapidly facilitate on-boarding and usage of the offering. The majority of areas were rated as meeting or exceeding expectations. There were no significant shortcomings identified, with only a few areas noted as having room for improvement or additional development. Although there is no ability to limit access to the underlying VM infrastructure, by design VMs may be completely reconstructed if they are properly protected; including through using air-gapped backups, it is possible to recreate a VM’s entire runtime environment. As a result, even if a VM is damaged or rendered unusable through malicious intent, a full recovery can occur as long as a usable, protected copy of the VM exists.
Audit overview
The Futurum Group Lab (FGL) audit process consisted of multiple sessions working with VMware technical engineering to review specific features, capabilities, and processes. The focus of FGL’s audit was to observe the end-to-end process of ransomware recovery.
At a high level, the process was as follows:
Preparation:
- Initially, we logged into the cloud console from the VMware Cloud portal
- We then reviewed the elements of the VMware Cloud DR Ransomware
Recovery offering
- Next, we established an Isolated Recovery Environment (IRE) in the VMware on AWS Cloud environment
- Initially choosing a single host (pilot light) environment, then expanded to three hosts
- Next was to add a VMware DRaaS connector to each protected vCenter DC
- Following this we then defined protection policies using the included guided wizard
- Policies included the frequency of protection points created and retention of copies sent to the scale-out cloud filesystem SCFS
- Utilizing the defined policies, we assigned VMs to one or more protection policies in preparation for a ransomware or other malware attack
- Finally, we reviewed VMs and also the VMware Cloud DR console for any warnings or other notifications indicated potential issues
Attack:
- Note we simulated a ransomware event by running a program that generated random data into a file
- We also added a potentially suspect file and deleted a file
Response:
- Responding to a potential attack was performed several times, evaluating different methods and processes that may be used
- The guided recovery workflow has many options for evaluating, verifying, and then restoring VM’s back into production
Note 1: The audit process included many “What if” scenarios, in order to explore alternative methods of analyzing, recovering, verifying and the restoring VM’s into production.
Note 2: The audit focused on the integrated VMware Cloud DR Ransomware Recovery service offering and not on standalone Carbon Black, or VMware NSX features both of which may help defend, detect, or deter ransomware attacks.
VMware Cloud DR Ransomware Recovery audit summary
Using The Futurum Group’s Cyber Resiliency framework as a basis, the Futurum Group Lab team analyzed the audited VMware Ransomware capabilities compared to the desired features and capabilities contained within the Cyber Resiliency framework.
A full version of the Framework is provided in the Appendix, with a summarized version of critical items presented below in Tables 1 and 2.
Table 1: Cyber Resiliency – Security Considerations
Table 2: Cyber Resiliency – Ransomware Considerations
VMware Cloud DR audit details
During the audit of VMware’s ransomware recovery service, Futurum Group Labs worked through typical use cases and scenarios, while noting the features provided to meet requirements presented earlier in The Futurum Group’s Cyber Resiliency framework, including Authentication & Access and Security & Audit Capabilities, along with the Ransomware Resiliency-specific features.
The VMware Cloud DR Ransomware Recovery service offering is based upon existing VMware services, including VMware Cloud on AWS and VMware’s existing DR service offering. The ransomware capability extends and enhances VM protection capabilities to support the needs of ransomware or other malware attacks that differ from traditional disasters in several ways. The technologies utilized for the VMware Cloud DR Ransomware Recovery service include an AWS SDDC, along with VMware’s scale-out cloud file system (SCFS), its Carbon Black malware detection and prevention along with VMware NSX for network isolation, deep packet inspection and other next gen firewall capabilities.
The tools provided will feel familiar to VM admins who regularly utilize vSphere, since concepts, terminology and even UI design elements are all heavily influenced by other VMware tools. The protection workflow entails identifying VMs for protection, and scheduling regular protection intervals using policy-based Protection Groups which then create point in time copies within the cloud-hosted file system in AWS that is associated with an AWS SDDC, known as an isolated recovery environment, or IRE.
Additionally, policy-based DR plans may be constructed to help plan for complex recovery scenarios that require specific VM network configurations, power-on order and other dependencies that may exist for complex recovery scenarios. The primary interaction with the VMware Cloud DR Ransomware Recovery service is via a console, available via the common VMware Cloud Console portal. The VMware Cloud DR console is designed around using wizard guided workflows both for creating and editing Protection Groups and DR plans, and additionally facilitating the recovery process.
Authentication, Authorization and Security
The Authorization and Authentication capabilities within the VMware Cloud DR Ransomware Recovery service are built upon VMware’s existing Cloud console and vSphere authentication. Role based access controls (RBAC) are foundational to vSphere administration with many roles pre-defined, and the ability to customize or create new roles and groups to limit access and administrative capabilities based upon a role’s defined allowed actions. Additionally, the Authorization features include the ability to maintain an internal set of users and groups or integrate with existing enterprise authorization frameworks if desired. Additionally, features like multi-factor authentication (MFA) are also present and well-integrated with the other authorization and authentication services.
With respect to security, VMware has been enhancing these capabilities over the past decade by enabling encryption of VMs both within vSphere and separately at the vSAN layer if desired as well. Perhaps the most critical aspect of encryption is not performing the encryption, but rather how well the key management features work and how they are integrated with existing standards such as OASIS KMIP and others. vSphere (which includes ESXi hypervisor and vCenter administration) supports an internal key management or the use of external, KMIP compliant key managers. It is worth noting, however, that while VMware supports various VM encryption scenarios, VMware Cloud DR currently does not support encrypted VMs, but does enable encryption of data at rest through vSAN encryption.
Policy Based Ransomware Protection
Organizations can use flexible policy defined Protection Groups, which set the context for the VM inventory that will be recovered in a DR plan. The Protection Groups define the inventory included in that group, along with the schedules for automated snapshots and replication to the SCFS. A snapshot is VMware Cloud DR’s construct for a point-in-time backup, which will become the system’s “recovery points” for use in both test and actual recoveries.
When dealing with ransomware recovery, the organization will likely need to incorporate more control points in their DR plan and guide the recovery steps to prevent attack propagation and the threat of reinfection. The individual VMs being restored will need to be checked more closely than a more typical site DR failover. One of the new features and key differences between site DR and ransomware enabled DR plans is the granularity and individual VM recovery handling. We will go into more detail later in this guide on the guided recovery workflow applied to each VM identified in a DR plan that will be used for ransomware recovery.
The Ransomware Recovery features are chosen as one of the final steps in creating a DR plan, with the notice that choosing Ransomware protection will incur additional charges for protecting the VM and offering to install security and vulnerability scanning software when the VMs are restored into the recovery environment, which provides the necessary hooks for VMware Carbon Black to scan running VMs and analyze them for malware.
Creating a Recovery Plan within VMware Cloud DR is accomplished through a guided wizard which steps through identifying multiple VMs, mapping a source vCenter to failover vCenter server, assigning compute and storage resources, and then mapping the network and IP address settings required including new IP address networks, gateways, and DNS settings. Additional options exist to add custom scripting to a VM’s recovery, along with specific recovery steps that are often necessary to ensure that VMs are brought up in a specific sequence necessary for complex application environments.
Click to enlarge
Figure 3: VMware Cloud DR Ransomware Recovery Plan
Ransomware Recovery Workflow
The recovery process can be complex and may require an iterative process in order to trial, verify, and then recover VMs utilizing different recovery points in time. The VMware Cloud DR Ransomware Recovery service provides a workflow to aid in the iterative process and presents necessary information using a web console interface with multiple indicators showing protection points and rate of data change along with rate of data entropy. Entropy, which is correlated with randomness, may indicate data encryption by flagging an increased rate of data change. Often, malware attacks lead to increased rates of both entropy and data change, indicating that a protection point prior to these events is likely a good recovery candidate. While a conservative approach from a security perspective would advocate for recovery points further back in time, this approach is opposite of what an application owner would desire, which is the most recent data protection point in order to minimize data loss. Thus, finding the optimal point in time requires verifying multiple images for malware.
Click to enlarge
Figure 4: VM Validation
The workflow was designed to accommodate multiple administrators, security teams, and other IT teams working together to triage and recover multiple VMs as rapidly as possible. VMware has developed a notion of “badging” VM’ to clearly indicate their status, with badges visible to all admins throughout the workflow. These badges may be applied and updated as the validation workflow progresses.
The “Validation” process entails running a VM within the IRE for some period and allowing the instrumentation within Carbon Black and NSX to evaluate the VM behavior for any signs of malware. At this point, VMware Ransomware Recovery recommends using “Badging” to assign a review status to a VM which then becomes visible to all other VMware Cloud DR users who may also be actively reviewing and evaluating VMs for recovery. Badge statuses include “Encrypted,” “Compromised,” “Warning,” Verified, and “Unknown”. The validation process would typically require a VM to run for some period in complete isolation, as established through NSX’s micro-segmentation and other firewall capabilities.
Figure 5: VM Ransomware Recovery Badging
If validated in the quarantined state, the VM may then be allowed to receive in-bound network traffic, to enable time services, DNS or other common network services, while still restricting communications, again via another automatically established NSX segmentation policy. If this state of analysis does not indicate issues, a group of VMs may then be allowed to communicate with each other, which is particularly useful to evaluate multi-VM applications which require interdependent communications in order to operate properly. After this stage is successfully passed, the group of VMs may then be promoted to run with full network access rights they would normally operate with, as a final check before deeming one or more VMs as “clean” or “validated” and able to be brought back into producton.
Figure 6: VM Network Isolation During Recovery
Figure 7: VM Recovery Into Production
Final thoughts
Companies of all sizes typically list ransomware attacks as one of their biggest potential threats, due to the high likelihood of an attack and the significant potential for damage. Despite this, many companies indicate that their Ransomware protection is inadequate, due to lack of planning, inadequate resources such as immutable backup copies, lack of air-gapping or isolated recovery environments or simply the staff or resources to establish ransomware protection. The VMware Cloud DR Ransomware Recovery offering is highly scalable, enabling companies of nearly any size to utilize the service for critical applications.
There are many advantages to utilizing a service-based offering for ransomware resiliency, including providing scalable offerings and rapid deployment while also helping to eliminate the ransomware resilience service itself as a point of attack. In the event a site is completely incapacitated due to a natural disaster, malware, or other attack, the VMware Cloud DR Ransomware Recovery service may be accessed via any web connected location. This natural isolation between production and the VMware Ransomware Recovery DR site provides a significant advantage compared to reliance upon on-premises or a centralized DR site.
Another advantage of a cloud-based service offering when it comes to cyber resiliency is the ability to rapidly initialize and on-board existing VM infrastructure along with the near instant scalability of the VMware Cloud DR service. For smaller enterprises, the VMware Cloud DR Ransomware Recovery service can be used to set up ransomware protection for a few critical VMs in less than a day, without the need for secondary sites, special equipment, or other costly dedicated resources. Likewise for large companies with multiple locations, the VMware Cloud DR service may also accommodate either centralized, or regional protection using VMware on AWS sites as required.
Clearly, the VMware Cloud DR Ransomware Recovery service was designed to address common ransomware scenarios, with well thought out workflows and wizards to help plan, mitigate and then recover from malware attacks. Overall, the Futurum Group Lab concludes that the VMware Cloud DR service meets or exceeds the requirements of the FG’s Cyber Resiliency framework for assessing cyber resiliency to ransomware or other malware attacks.
Infrastructure protected
The configuration utilized for the lab audit included 3 sites, one labeled as production, and one labeled as test site A and test site B. These may be seen in Figure 1.
VMware Sites Protected
- 3 sites, each with 1 vCenter instance
- A total of 112 VMs across the three sites
- At least 50 VMs were in active protection groups (number varied during the audit) y Over 3,000 total VM snapshots in Cloud Backup
- Total of 4.8TB of protected capacity stored in Cloud Backup Filesystem
VMware IRE
The audit utilized several sizes of isolated recovery environments (IRE) during the audit, in order to demonstrate the ability to change the IRE as required.
The sizes used were as follows:
1. A small, single node (aka “Pilot Light”) in order to perform audit tasks y Initial time to deploy the SDDC was approximately 2.5h
2. A 2-node “Production IRE”
y Time to increase from 1 to 2 nodes was less than 30mn
3. This was then increased to a 3-node IRE
y Similarly, time to increase from 2 to 3 was less than 30mn
Test VM environment
The VM’s tested were a mix of Linux and Windows, with the primary systems used during demonstration being MS Windows Server systems. File level recovery tools are OS dependent.