IT Disaster Recovery Testing
IT Disaster Recovery Testing
A disaster recovery plan that you haven’t tested is worth only the paper that you write it on (or the hard drive that you store it on). Until you thoroughly test all the recovery procedures, the organization shouldn’t expect those procedures to save it from ruin if a disaster strikes.
You need to perform five types of testing on all disaster recovery procedures:
- Paper: An individual reads through a recovery procedure and makes any annotations or suggested corrections.
- Walkthrough: A recovery team reviews the recovery procedure, step by step. Issues and discussions fill the day.
- Simulation: A recovery team walks through a scripted simulation, discussing assessment and recovery procedures so they can determine whether a disaster recovery plan is reasonable.
- Parallel: A recovery team tests recovery procedures by actually building or setting up recovery systems. The team also performs test transactions on the systems to see how well the procedures work and whether team members can actually build and operate the recovery systems.
- Cutover: A recovery team performs a full cutover, in which recovery systems that the recovery team build or prepare on short notice support live business processes. This is the ultimate test of a DR plan.
Testing Lifecycle:
- Periodic testing: Test all DR procedures regularly, according to a schedule that fits the risks associated with the individual business processes being supported. For example, life-support processes probably deserve weekly or monthly cutover testing, but you can test less critical processes less often. This testing process includes not only repeated walkthroughs, but also scheduled simulations, parallel tests, and cutover tests. You should perform a parallel or cutover test at least once per year.
- Periodic review: Have subject matter experts review disaster recovery procedures at least two to four times each year to ensure that those procedures are still relevant and accurate. Review emergency contact lists monthly.
- Periodic revisions: Periodic testing and review indicate when you need to update recovery plans and emergency contact lists.
- Business Impact Analysis and risk analysis review: Review the BIA and risk analysis documents at least once per year to ensure that key objectives, such as the Recovery Time Objective (RTO) and Recovery Point Objective (RPO), are still adequate.
- Integration into business processes: Business activities such as system upgrades, mergers and acquisitions, and new product or service launches should include routine reviews of BIA, risk analysis, and other DR documents to ensure that they remain current and relevant.
Preventing Technology-Related Disasters
Software and hardware failures aren’t wholly preventable, but you should do all that’s reasonable to prevent failures while still preparing for them. The following list contains many measures you can take to prepare for hardware and software failures when they do happen:
- Perform regular data backups. Copying data from main hard drives to other hard drives or backup tape is the best insurance in cases of hard drive or related failure.
- Perform regular data restores. Just because you can perform data backups doesn’t mean you can get that data back! Test your organization’s ability to restore data at least once per month to make sure that backups are working and that you can actually recover data from backup tapes.
- Keep spare systems. In some cases, you might more easily recover an application or database onto a different system than diagnose or repair a problem on a primary server. You might be able to use development servers, test servers, and servers for less-critical applications as spare systems.
- Keep spare parts. Having spare disk drives, memory, motherboards, and power supplies gives you more choices when you experience a hardware failure.
- Have service manuals. You never know who may need to open up one of your servers or storage systems. The usual experts may not be around when you need them.
Resilient architecture
The methods for building a resilient architecture are
- Server clustering: By using special clustering software, you can apply an active/active configuration to two servers in which both are performing the full application load in a sharing basis. Or you can apply an active/passive configuration in which one server processes application transactions and the other is ready to take over at a moment’s notice. You can store servers in a cluster in the same room, the same city, or thousands of miles apart.
- Data replication and mirroring: Copying transaction data from one storage system to another. If one storage system fails, the other has an up-to-date copy of all recent transactions.
Security incidents
A security incident can reach disaster levels in a number of ways:
- Data corruption: If the incident causes data corruption, the organization may be forced to take systems offline until you can recover or rebuild the data. In large databases, this process can take several days, even on the fastest available computers.
- Denial of Service (DoS): A concentrated attack, especially when it originates from large numbers of systems, can render a server or an entire network of servers unreachable to customers and partners. Such attacks can last for hours, days, or even weeks.
- Forensics: Your organization (or law enforcement) may need to carry out forensic operations on affected systems to gather evidence for a possible prosecution. Trained personnel usually conduct forensics on quiescent systems (systems in which activity is halted)