Resiliency and Recovery

Resiliency

High Availability

Redundancy doesn’t mean always available

  • May need to be powered on manually

HA (high availability)

  • always on, always available

May include many components working together

  • Active can provide scalability advantages

Higher availability almost always means higher costs

  • There’s always another contingency you could add
  • Upgraded power, high-quality server components, etc.

Server Clustering

Combine two or more servers

  • Appears and operates as a single large server
  • Users only see one device

Easily increase capacity and availability

  • Add more servers to the cluster

Usually configured in the OS

  • All devices in the cluster commonly use the same OS

Load Balancing

Load is distributed across multiple servers

  • The servers are often unaware of each other

Distribute the load across multiple devices

  • Can be different OSes

The load balancer adds or removes devices

  • Add a server to increase capacity
  • Remove any servers not responding

Site resiliency

Recovery site is prepped

  • Data is synchronized

A disaster is called

  • Business processes failover to the alternate processing site

Problem is addressed

  • This can take hours, weeks, or longer

Revert back to the primary location

  • The process must be documented for both directions

Hot Site

An exact replica

  • Duplicate everything

Stocked with hardware

  • Constantly updated
  • You buy two of everything

Applications and software are constantly updated

  • Automated replication

Flip a switch and everything moves

  • This may be quite a few switches

Cold Site

No hardware

  • Empty building

No data

  • Bring it with you

No people

  • Bus in your team

Warm Site

Somewhere between cold and hot

  • Just enough to get going

Big room with rack space

  • You bring the hardware

Geographic Dispersion

These sites should be physically different from the organization’s primary location

  • Many disruptions can affect a large area
  • Hurricane, tornado, floods, etc.

Can be a logistical challenge

  • Transporting equipment
  • Getting employee’s on-site
  • Getting back to the main office

Platform Diversity

Every OS contains potential security issues

  • You can’t avoid them

Many security vulnerabilities are specific to a single OS

  • Windows vulnerabilities don’t commonly affect Linux or macOS
  • And vice versa

Use many platforms

  • Different applications, clients, and OSes
  • Spread the risk around

Multi-Cloud Systems

There are many cloud providers

  • Amazon Web Services, Microsoft Azure, Google Cloud, etc.

Plan for cloud outages

  • These can sometimes happen

Data is both geographically dispersed and cloud service dispersed

  • A breach with one provider would not affect the others
  • Plan for every contingency

Continuity of Operations Planning (COOP)

Not everything goes according to plan

  • Disaster can cause a disruption to the norm

We rely on our computer systems

  • Technology is pervasive

There need to be an alternative

  • Manual transactions
  • Paper receipts
  • Phone calls for transactions approvals

These must be documented and tested before a problem occurs

Capacity Planning

Match supply to the demand

  • This isn’t always an obvious equation

Too much demand

  • Application slowdowns and outages

Too much supply

  • You’re paying too much

Requires a balanced approach

  • Add the right amount of people
  • Apply appropriate technology
  • Build the best infrastructure

People

Some services require human intervention

  • Call center support lines
  • Technology services

Too few employees

  • Recruit new staff
  • It may be time-consuming to add more staff

Too many employees

  • Redeploy to other parts of the organization
  • Downsize

Technology

Pick a technology that can scale

  • Not all services can easily grow and shrink

Web services

  • Distribute the load across multiple web services

Database services

  • Cluster multiple SQL servers
  • Split the database to increase capacity

Cloud services

  • Services on demand
  • Seemingly unlimited resources (if you pay the money)

Infrastructures

The underlying framework

  • Application servers, network services, etc.
  • CPU, network, storage

Physical devices

  • Purchase, configure, and install

Cloud-based devices

  • Easier to deploy
  • Useful for unexpected capacity changes

Recovery Testing

Test yourselves before an actual event

  • Scheduled updates sessions (annual, semi-annual, etc.)

Use well-defined rules of engagement

  • Don’t touch the production systems

Very specific scenario

  • Limited time to run the event

Evaluate response

  • Document and discuss

Tabletop Exercises

Performing a full-scale disaster drill can be costly

  • And time-consuming

Many of the logistics can be determined through analysis

  • You don’t physically have to go through a disaster or drill

Get key players together for a tabletop exercise

  • Talk through a simulated disaster

Fail Over

A failure is often inevitable

  • It’s “when”, not “if”

We may be able to keep running

  • Plan for the worst

Create a redundant infrastructure

  • Multiple routers, firewalls, switches, etc.

If one stops working, fail over to the operational unit

  • Many infrastructure devices and services can do this automatically

Simulation

Test with a simulated event

  • Phishing attack, password requests, data breaches

Going phishing

  • Create a phishing email attack
  • Send to your actual user community
  • See who bites

Test internal security

  • Did the phishing get past the filter

Test the users

  • Who clicked?
  • Additional training may be required

Parallel Processing

Split a process through multiple (parallel) CPUs

  • A single computer with multiple CPU cores or multiple physical CPUs
  • Multiple computers

Improved performance

  • Split complex transactions across multiple processors

Improved recover

  • Quickly identify a faulty system
  • Take the faulty device out of the list of available processors
  • Continue operating with the remaining processors

Backups

Incredibly important

  • Recover important and valuable data
  • Plan for disaster

Many implementations

  • Total amount of data
  • Type of backup
  • Backup media
  • Storage location
  • Backup and recovery software
  • Day of the week

Onsite vs. Offsite Backups

Onsite backups

  • No Internet link required
  • Data is immediately available
  • Generally less expensive than offsite

Offsite backups

  • Transfer data over Internet or WAN link
  • Data is available after a disaster
  • Restoration can be performed from anywhere

Organizations often use both

  • More copies of the data
  • More options when restoring

Frequency

How often to back up

  • Every week, day, hour?

This may be different between systems

  • Some systems may not change much each day

May have multiple backups sets

  • Daily, weekly, and monthly

This requires significant planning

  • Multiple backup sets across different days
  • Lots of media to manage

Encryption

A history of data is on backup media

  • Some of this media may be offsite

This makes it very easy for an attacker

  • All the data is in one place

Protect backup data using encryption

  • Everything on the backup media is unreadable
  • The recovery key is required to restore the data

Especially useful for cloud backups and storage

  • Prevent anyone from eavesdropping

Snapshots

Became popular on virtual machines

  • Very useful in cloud environments

Take a snapshot

  • An instant backup of an entire system
  • Save the current configuration and data

Take another snapshot after 24 hours

  • Contains only the changes between snapshots

Take a snapshot every day

  • Revert to any snapshot
  • Very fast recovery

Recovery Testing

It’s not enough to perform the backup

  • You have to be able to restore

Disaster recovery testing

  • Simulate a disaster situation
  • Restore from backup

Confirm the restoration

  • Test the restored application and data

Perform periodic audits

  • Always have a good backup
  • Weekly, monthly, quarterly checks

Replication

An ongoing, almost real-time backup

  • Keep data synchronized in multiple locations

Data is available

  • There’s always a copy somewhere

Data can be stored locally to all users

  • Replicate data to all remote sites

Data is recoverable

  • Disasters can happen at any time

Journaling

Power goes out while writing data to storage

  • The stored data is probably corrupted

Recovery could be complicated

  • Remove corrupted files, restore from backup

Before writing to storage, make a journal entry

  • After the journal is written, write the data to storage

After the data is written to storage, update the journal

  • Clear the entry and get ready for the next

Power Resiliency

Power is the foundation of our technology

  • It’s important to properly engineer and plan for outages

We usually don’t make our own power

  • Power is likely provided by third-parties
  • We can’t control power availability

There are ways to mitigate power issues

  • Short power outages
  • Long-term power issues

UPS

Uninterruptible Power Supply

  • Short-term backup power
  • Blackouts, brownouts, surges

UPS types

  • Offline/Standby UPS
  • Line-interactive UPS
  • On-line/Double-conversion UPS

Features

  • Auto shutdown, battery capacity, outlets, phone line suppression

Generators

Long-term power backup

  • Fuel storage required

Power an entire building

  • Some power outlets may be marked as generator-powered

It may take a few minutes to get the generator up to speed

  • Use a battery UPS while the generator is starting