Disaster Recovery

Disaster Recovery

Disaster Recovery Plan (DRP)

Detailed plan for resuming operations after a disaster

  • Application, data center, building, campus, region, etc.

Extensive planning prior to the disaster

  • Backups
  • Off-site data replication
  • Cloud alternatives
  • Remote site

Many third-party options

  • Physical locations
  • Recovery services

RTO (Recovery Time Objective)

Measured as an amount of time

  • Time to resume operations
  • We would like this to be near-zero

Recovery time objective (RTO)

  • Get up and running quickly
  • Get back to a particular service level in a certain timeframe

RPO (Recovery Point Objective)

Also measured as an amount of time

  • The goal is had a near-zero RPO

Recovery point objective (RPO)

  • How much data loss is acceptable
  • Bring the system back online; how far back in time does data go?

Define the right RPO

  • Banking transactions, patient information
    • Very short — Less than an hour
  • Website updates, internal documents
    • 1–4 hours

MTTR and MTBF

Meantime to repair (MTTR)

  • Average time required to fix the issue
  • The time from the point of the failure to full functionality

Mean time between failures (MTBF)

  • Predict the time between outages
  • Often takes many variables into account

Site Resiliency

Recovery site is prepped

  • Data is synchronized

A disaster is called

  • Business processes failover to the alternate processing site

Problem is addressed

  • This can take hours, weeks, or longer

Revert back to the primary location

  • The process must be documented for both directions

Cold Site

No hardware

  • Empty building

No data

  • Bring it with you

No people

  • Bus in your team

Hot Site

An exact replica

  • Duplicate everything

Stocked with hardware

  • Constantly updated
  • You buy two of everything

Applications and software are constantly updated

  • Automated replication

Flip a switch and everything moves

  • This may be quite a few switches

Warm Site

Somewhere between cold and hot

  • Just enough to get going

Bing room with rack space

  • You bring the hardware

Hardware is ready and waiting

  • You bring the software and data

Tabletop Exercises

Performing a full-scale disaster drill can be costly

  • And time-consuming

Many of the logistics can be determined through analysis

  • You don’t physically have to go through a disaster or drill

Get key players together for a tabletop exercise

  • Talk through a simulated disaster

Validation Tests

Test yourselves before an actual event

  • Scheduled update sessions (annual, semi-annual, etc.)

Use well-defined rules of engagement

  • Do not touch the production systems

Very specific scenario

  • Limited time to run the event

Evaluate response

  • Document and discuss

Network Redundancy

Active-passive

Two devices are installed and configured

  • Only one operates at a time

If one device fails, the other takes over

  • Constant communication between the pair

Configuration and real-time session information is constantly synchronized

  • The failover might occur at any time

Active-active

You bought two devices

  • Use both at the same time

More complex to design and operate

  • Data can flow in many directions
  • A challenge to manage the flows
  • Monitoring and controlling data requires a very good understanding of the underlying infrastructure