Resiliency and Recovery
Resiliency
High Availability
Redundancy doesn’t mean always available
- May need to be powered on manually
HA (high availability)
- always on, always available
May include many components working together
- Active can provide scalability advantages
Higher availability almost always means higher costs
- There’s always another contingency you could add
- Upgraded power, high-quality server components, etc.
Server Clustering
Combine two or more servers
- Appears and operates as a single large server
- Users only see one device
Easily increase capacity and availability
- Add more servers to the cluster
Usually configured in the OS
- All devices in the cluster commonly use the same OS
Load Balancing
Load is distributed across multiple servers
- The servers are often unaware of each other
Distribute the load across multiple devices
- Can be different OSes
The load balancer adds or removes devices
- Add a server to increase capacity
- Remove any servers not responding
Site resiliency
Recovery site is prepped
- Data is synchronized
A disaster is called
- Business processes failover to the alternate processing site
Problem is addressed
- This can take hours, weeks, or longer
Revert back to the primary location
- The process must be documented for both directions
Hot Site
An exact replica
- Duplicate everything
Stocked with hardware
- Constantly updated
- You buy two of everything
Applications and software are constantly updated
- Automated replication
Flip a switch and everything moves
- This may be quite a few switches
Cold Site
No hardware
- Empty building
No data
- Bring it with you
No people
- Bus in your team
Warm Site
Somewhere between cold and hot
- Just enough to get going
Big room with rack space
- You bring the hardware
Geographic Dispersion
These sites should be physically different from the organization’s primary location
- Many disruptions can affect a large area
- Hurricane, tornado, floods, etc.
Can be a logistical challenge
- Transporting equipment
- Getting employee’s on-site
- Getting back to the main office
Platform Diversity
Every OS contains potential security issues
- You can’t avoid them
Many security vulnerabilities are specific to a single OS
- Windows vulnerabilities don’t commonly affect Linux or macOS
- And vice versa
Use many platforms
- Different applications, clients, and OSes
- Spread the risk around
Multi-Cloud Systems
There are many cloud providers
- Amazon Web Services, Microsoft Azure, Google Cloud, etc.
Plan for cloud outages
- These can sometimes happen
Data is both geographically dispersed and cloud service dispersed
- A breach with one provider would not affect the others
- Plan for every contingency
Continuity of Operations Planning (COOP)
Not everything goes according to plan
- Disaster can cause a disruption to the norm
We rely on our computer systems
- Technology is pervasive
There need to be an alternative
- Manual transactions
- Paper receipts
- Phone calls for transactions approvals
These must be documented and tested before a problem occurs
Capacity Planning
Match supply to the demand
- This isn’t always an obvious equation
Too much demand
- Application slowdowns and outages
Too much supply
- You’re paying too much
Requires a balanced approach
- Add the right amount of people
- Apply appropriate technology
- Build the best infrastructure
People
Some services require human intervention
- Call center support lines
- Technology services
Too few employees
- Recruit new staff
- It may be time-consuming to add more staff
Too many employees
- Redeploy to other parts of the organization
- Downsize
Technology
Pick a technology that can scale
- Not all services can easily grow and shrink
Web services
- Distribute the load across multiple web services
Database services
- Cluster multiple SQL servers
- Split the database to increase capacity
Cloud services
- Services on demand
- Seemingly unlimited resources (if you pay the money)
Infrastructures
The underlying framework
- Application servers, network services, etc.
- CPU, network, storage
Physical devices
- Purchase, configure, and install
Cloud-based devices
- Easier to deploy
- Useful for unexpected capacity changes
Recovery Testing
Test yourselves before an actual event
- Scheduled updates sessions (annual, semi-annual, etc.)
Use well-defined rules of engagement
- Don’t touch the production systems
Very specific scenario
- Limited time to run the event
Evaluate response
- Document and discuss
Tabletop Exercises
Performing a full-scale disaster drill can be costly
- And time-consuming
Many of the logistics can be determined through analysis
- You don’t physically have to go through a disaster or drill
Get key players together for a tabletop exercise
- Talk through a simulated disaster
Fail Over
A failure is often inevitable
- It’s “when”, not “if”
We may be able to keep running
- Plan for the worst
Create a redundant infrastructure
- Multiple routers, firewalls, switches, etc.
If one stops working, fail over to the operational unit
- Many infrastructure devices and services can do this automatically
Simulation
Test with a simulated event
- Phishing attack, password requests, data breaches
Going phishing
- Create a phishing email attack
- Send to your actual user community
- See who bites
Test internal security
- Did the phishing get past the filter
Test the users
- Who clicked?
- Additional training may be required
Parallel Processing
Split a process through multiple (parallel) CPUs
- A single computer with multiple CPU cores or multiple physical CPUs
- Multiple computers
Improved performance
- Split complex transactions across multiple processors
Improved recover
- Quickly identify a faulty system
- Take the faulty device out of the list of available processors
- Continue operating with the remaining processors
Backups
Incredibly important
- Recover important and valuable data
- Plan for disaster
Many implementations
- Total amount of data
- Type of backup
- Backup media
- Storage location
- Backup and recovery software
- Day of the week
Onsite vs. Offsite Backups
Onsite backups
- No Internet link required
- Data is immediately available
- Generally less expensive than offsite
Offsite backups
- Transfer data over Internet or WAN link
- Data is available after a disaster
- Restoration can be performed from anywhere
Organizations often use both
- More copies of the data
- More options when restoring
Frequency
How often to back up
- Every week, day, hour?
This may be different between systems
- Some systems may not change much each day
May have multiple backups sets
- Daily, weekly, and monthly
This requires significant planning
- Multiple backup sets across different days
- Lots of media to manage
Encryption
A history of data is on backup media
- Some of this media may be offsite
This makes it very easy for an attacker
- All the data is in one place
Protect backup data using encryption
- Everything on the backup media is unreadable
- The recovery key is required to restore the data
Especially useful for cloud backups and storage
- Prevent anyone from eavesdropping
Snapshots
Became popular on virtual machines
- Very useful in cloud environments
Take a snapshot
- An instant backup of an entire system
- Save the current configuration and data
Take another snapshot after 24 hours
- Contains only the changes between snapshots
Take a snapshot every day
- Revert to any snapshot
- Very fast recovery
Recovery Testing
It’s not enough to perform the backup
- You have to be able to restore
Disaster recovery testing
- Simulate a disaster situation
- Restore from backup
Confirm the restoration
- Test the restored application and data
Perform periodic audits
- Always have a good backup
- Weekly, monthly, quarterly checks
Replication
An ongoing, almost real-time backup
- Keep data synchronized in multiple locations
Data is available
- There’s always a copy somewhere
Data can be stored locally to all users
- Replicate data to all remote sites
Data is recoverable
- Disasters can happen at any time
Journaling
Power goes out while writing data to storage
- The stored data is probably corrupted
Recovery could be complicated
- Remove corrupted files, restore from backup
Before writing to storage, make a journal entry
- After the journal is written, write the data to storage
After the data is written to storage, update the journal
- Clear the entry and get ready for the next
Power Resiliency
Power is the foundation of our technology
- It’s important to properly engineer and plan for outages
We usually don’t make our own power
- Power is likely provided by third-parties
- We can’t control power availability
There are ways to mitigate power issues
- Short power outages
- Long-term power issues
UPS
Uninterruptible Power Supply
- Short-term backup power
- Blackouts, brownouts, surges
UPS types
- Offline/Standby UPS
- Line-interactive UPS
- On-line/Double-conversion UPS
Features
- Auto shutdown, battery capacity, outlets, phone line suppression
Generators
Long-term power backup
- Fuel storage required
Power an entire building
- Some power outlets may be marked as generator-powered
It may take a few minutes to get the generator up to speed
- Use a battery UPS while the generator is starting



