Disaster Recovery
Disaster Recovery
Disaster Recovery Plan (DRP)
Detailed plan for resuming operations after a disaster
- Application, data center, building, campus, region, etc.
Extensive planning prior to the disaster
- Backups
- Off-site data replication
- Cloud alternatives
- Remote site
Many third-party options
- Physical locations
- Recovery services
RTO (Recovery Time Objective)
Measured as an amount of time
- Time to resume operations
- We would like this to be near-zero
Recovery time objective (RTO)
- Get up and running quickly
- Get back to a particular service level in a certain timeframe
RPO (Recovery Point Objective)
Also measured as an amount of time
- The goal is had a near-zero RPO
Recovery point objective (RPO)
- How much data loss is acceptable
- Bring the system back online; how far back in time does data go?
Define the right RPO
- Banking transactions, patient information
- Very short — Less than an hour
- Website updates, internal documents
- 1–4 hours
MTTR and MTBF
Meantime to repair (MTTR)
- Average time required to fix the issue
- The time from the point of the failure to full functionality
Mean time between failures (MTBF)
- Predict the time between outages
- Often takes many variables into account
Site Resiliency
Recovery site is prepped
- Data is synchronized
A disaster is called
- Business processes failover to the alternate processing site
Problem is addressed
- This can take hours, weeks, or longer
Revert back to the primary location
- The process must be documented for both directions
Cold Site
No hardware
- Empty building
No data
- Bring it with you
No people
- Bus in your team
Hot Site
An exact replica
- Duplicate everything
Stocked with hardware
- Constantly updated
- You buy two of everything
Applications and software are constantly updated
- Automated replication
Flip a switch and everything moves
- This may be quite a few switches
Warm Site
Somewhere between cold and hot
- Just enough to get going
Bing room with rack space
- You bring the hardware
Hardware is ready and waiting
- You bring the software and data
Tabletop Exercises
Performing a full-scale disaster drill can be costly
- And time-consuming
Many of the logistics can be determined through analysis
- You don’t physically have to go through a disaster or drill
Get key players together for a tabletop exercise
- Talk through a simulated disaster
Validation Tests
Test yourselves before an actual event
- Scheduled update sessions (annual, semi-annual, etc.)
Use well-defined rules of engagement
- Do not touch the production systems
Very specific scenario
- Limited time to run the event
Evaluate response
- Document and discuss
Network Redundancy
Active-passive
Two devices are installed and configured
- Only one operates at a time
If one device fails, the other takes over
- Constant communication between the pair
Configuration and real-time session information is constantly synchronized
- The failover might occur at any time
Active-active
You bought two devices
- Use both at the same time
More complex to design and operate
- Data can flow in many directions
- A challenge to manage the flows
- Monitoring and controlling data requires a very good understanding of the underlying infrastructure


