IT Disaster Recovery Testing

Amit Malhotra

General Manager Cyber and Resilience

Published Jul 23, 2018

IT Disaster Recovery Testing

A disaster recovery plan that you haven’t tested is worth only the paper that you write it on (or the hard drive that you store it on). Until you thoroughly test all the recovery procedures, the organization shouldn’t expect those procedures to save it from ruin if a disaster strikes.

You need to perform five types of testing on all disaster recovery procedures:

Paper: An individual reads through a recovery procedure and makes any annotations or suggested corrections.
Walkthrough: A recovery team reviews the recovery procedure, step by step. Issues and discussions fill the day.
Simulation: A recovery team walks through a scripted simulation, discussing assessment and recovery procedures so they can determine whether a disaster recovery plan is reasonable.
Parallel: A recovery team tests recovery procedures by actually building or setting up recovery systems. The team also performs test transactions on the systems to see how well the procedures work and whether team members can actually build and operate the recovery systems.
Cutover: A recovery team performs a full cutover, in which recovery systems that the recovery team build or prepare on short notice support live business processes. This is the ultimate test of a DR plan.

Testing Lifecycle:

Periodic testing: Test all DR procedures regularly, according to a schedule that fits the risks associated with the individual business processes being supported. For example, life-support processes probably deserve weekly or monthly cutover testing, but you can test less critical processes less often. This testing process includes not only repeated walkthroughs, but also scheduled simulations, parallel tests, and cutover tests. You should perform a parallel or cutover test at least once per year.
Periodic review: Have subject matter experts review disaster recovery procedures at least two to four times each year to ensure that those procedures are still relevant and accurate. Review emergency contact lists monthly.
Periodic revisions: Periodic testing and review indicate when you need to update recovery plans and emergency contact lists.
Business Impact Analysis and risk analysis review: Review the BIA and risk analysis documents at least once per year to ensure that key objectives, such as the Recovery Time Objective (RTO) and Recovery Point Objective (RPO), are still adequate.
Integration into business processes: Business activities such as system upgrades, mergers and acquisitions, and new product or service launches should include routine reviews of BIA, risk analysis, and other DR documents to ensure that they remain current and relevant.

Preventing Technology-Related Disasters

Software and hardware failures aren’t wholly preventable, but you should do all that’s reasonable to prevent failures while still preparing for them. The following list contains many measures you can take to prepare for hardware and software failures when they do happen:

Perform regular data backups. Copying data from main hard drives to other hard drives or backup tape is the best insurance in cases of hard drive or related failure.
Perform regular data restores. Just because you can perform data backups doesn’t mean you can get that data back! Test your organization’s ability to restore data at least once per month to make sure that backups are working and that you can actually recover data from backup tapes.
Keep spare systems. In some cases, you might more easily recover an application or database onto a different system than diagnose or repair a problem on a primary server. You might be able to use development servers, test servers, and servers for less-critical applications as spare systems.
Keep spare parts. Having spare disk drives, memory, motherboards, and power supplies gives you more choices when you experience a hardware failure.
Have service manuals. You never know who may need to open up one of your servers or storage systems. The usual experts may not be around when you need them.

Resilient architecture

The methods for building a resilient architecture are

Server clustering: By using special clustering software, you can apply an active/active configuration to two servers in which both are performing the full application load in a sharing basis. Or you can apply an active/passive configuration in which one server processes application transactions and the other is ready to take over at a moment’s notice. You can store servers in a cluster in the same room, the same city, or thousands of miles apart.
Data replication and mirroring: Copying transaction data from one storage system to another. If one storage system fails, the other has an up-to-date copy of all recent transactions.

Security incidents

A security incident can reach disaster levels in a number of ways:

Data corruption: If the incident causes data corruption, the organization may be forced to take systems offline until you can recover or rebuild the data. In large databases, this process can take several days, even on the fastest available computers.
Denial of Service (DoS): A concentrated attack, especially when it originates from large numbers of systems, can render a server or an entire network of servers unreachable to customers and partners. Such attacks can last for hours, days, or even weeks.
Forensics: Your organization (or law enforcement) may need to carry out forensic operations on affected systems to gather evidence for a possible prosecution. Trained personnel usually conduct forensics on quiescent systems (systems in which activity is halted)

IT Disaster Recovery Testing

Amit Malhotra

General Manager Cyber and Resilience

IT Disaster Recovery Testing

Testing Lifecycle:

Preventing Technology-Related Disasters

Resilient architecture

Security incidents

Next Topic: Database - High Availability, Recovery and Restoration Options

More articles by this author

Insights from the community

Others also viewed

Systematic Approach to IT Disaster Recovery Plan

Three Steps to Mastering Disaster Recovery

Disaster Recovery testing, a must but planning is essential

Revamping Your Disaster Recovery Plan for PCI DSS v4.0

Disaster Recovery Overview

Is your Disaster Recovery Plan a disaster in waiting?

Disaster Recovery: Comprehensive disaster recovery plans to minimize downtime

Disaster Recovery Overview and considerations

15 Keys to an Effective IT Disaster Recovery Plan

Explore topics

IT Disaster Recovery Testing

Testing Lifecycle:

Preventing Technology-Related Disasters

Resilient architecture

Security incidents

Next Topic: Database - High Availability, Recovery and Restoration Options

Strengthening Cybersecurity and Availability: Lessons from the Azure Outage

Sep 18, 2023

Evolving IT Continuity: Building Resilience at the Heart of Cybersecurity

Sep 14, 2023

Failover Clustering and Always On Availability Groups (SQL Server)

Aug 16, 2018

IT Continuity - Gap Assessment and Implementation

Jul 16, 2018

BUSINESS CONTINUITY PLAN DEVELOPMENT AND IMPLEMENTATION and IT CONTINUITY MANAGEMENT AS PER ITIL V3

Jul 11, 2018

IT Disaster Recovery Strategy

Jul 4, 2018

Business Continuity Approach for Notifiable Data Breach (BCM@NDB)

Feb 2, 2018

We don’t need a IT Continuity Plan as we have our applications on the cloud – Think Again! “Amazon Web Services outage causes issues across the intern

Mar 2, 2017

Yahoo hack: Email accounts of Australian politicians, police and judges compromised in massive breach, dataset reveals

Jan 17, 2017

The Paris Attacks: “Physical and Environmental Security”

Nov 15, 2015

Insights from the community

Others also viewed

Systematic Approach to IT Disaster Recovery Plan

Three Steps to Mastering Disaster Recovery

Disaster Recovery testing, a must but planning is essential

Revamping Your Disaster Recovery Plan for PCI DSS v4.0

Disaster Recovery Overview

Is your Disaster Recovery Plan a disaster in waiting?

Disaster Recovery: Comprehensive disaster recovery plans to minimize downtime

Disaster Recovery Overview and considerations

15 Keys to an Effective IT Disaster Recovery Plan

Explore topics