Disaster Recovery

Disaster recovery is touched on in a couple of the CISSP domains Domain 7 focuses on the recovery planning and strategies.

Recovery Strategies

A recovery operation takes place after availability is hindered. This can be an outage, security incident, or a disaster. Recovery strategies have an impact on how long your organization will be down or would otherwise be hindered. Here are the strategies (design):

Backup storage – have a strategy and policy detailing where your backup data is stored and how long it is retained. Offsite storage is important as well.

Secure offsite storage: having backups stored offsite ensures that if your datacenter goes, your data doesn’t go with it. Third party organizations also offer recovery facilities to enable faster recovery.

  • If earthquake, flood, or fire destroy your data center, your backup data isn’t destroyed with it.
  • Offsite storage providers provide environmentally sensitive storage facilities with high-quality environmental characteristics around humidity, temperature, and light.
  • Offsite storage providers provide additional services that company would have to manage else; such as tape rotation, i.e., delivery of new tapes and pickup of old tapes, electronic vaulting, i.e., storing backup data electronically, and organization, i.e., labeling of all media, dates and times.

Transport via secure courier: This can discourage or prevent theft of backup media while it is in transit to a remote location.

Long term backup storage – Offsite storage providers can be sensitive storage facilities with high-quality environmental characteristics around humidity, temperature, and light. These facilities offer favorable consideration for long-term backup storage.

Additional services – other managed services are generally available like tape rotation, electronic vaulting, and organization of media.

Backup media encryption: For prevent that unauthorized third party from being able to recover data from backup media.

Recovery site strategies – multiple datacenters allow for basic primary, recovery, and regional strategies in house. Thanks to the cloud, it’s a possible to extend your recovery efforts on the fly. Backups to the cloud could be cost effective, however, recovery can be expensive or too slow for needs.

Multiple processing sites – multiple data centers with high speed connectivity is more prevalent now than ever before. You have the ability for running multiple instances of an application across 3 or more data centers. This allows for backup-free designs since the data is stored in at least 3 separate locations with syncing. A processing site can include the cloud.

Data replication: Sending data to an offsite or remote data center, or cloud-based storage provider.

System resiliency – It is the ability to recover quickly. That is, if Site 1 goes down, Site 2 immediately comes operational. Or if a disk drive fails, another spare disk drive quickly is added to the storage pool. System Resilience includes eliminating single points of failure in system designs into critical systems.

High availability – since resilience is focused on recovering in a short period of time, high availability is about having multiple redundant systems that enable zero downtime for a single failure. While clusters are often the answer for high availability, there are many other methods available too. For example, you can use an HA pair. Both high availability and resiliency are needed in many organizations.

Quality of service (QoS) – a technique that helps enable specified services higher priority or service than other services. For example, on a network, QoS might provide the highest quality of service to the phones and the lowest quality of service to social media. QoS is also used by ISPs to control the quality of services for specific partners or customers. For example, an ISP would decide to use QoS to make its own web properties or media offerings perform at its best while hamstring the performance of its competitors. There is plenty of evidence to support this (it’s a short Google query away). This is why net neutrality is an important discussion.

Fault tolerance –  It is the ability of a system to suffer a fault but continue to operate. How can the system have this capability? Via adding redundant components such as additional disks within a redundant array of inexpensive disks (RAID) array, multiple power supplies, NIC (multiple network interfaces), or additional servers within a failover clustered configuration.

Recovery Site Strategies

An organization with 3 or more data centers can have a primary data center, a secondary data center (recovery site) and regional data centers.

Some organizations that have multiple data centers have a different strategy, which includes the following physical sites:

  • Redundant site – same as production site and there is a mechanism to activate a failover to send the traffic to the redundant site. A failure on the production site should be invisible to users and customers.
  • Hot site – a mirror site of the current production site. It should be up within 3 hours after an event.
  • Warm site – has everything needed in terms of hardware to run everything in production, but the data or systems are not up to date. Sometimes, the telecom is not ready as well. It can take up to 3 days for a warm site to be up and running.
  • Cold site – the least expensive option. It’s just an empty site with no hardware or data.
  • Reciprocal agreement – where you agree with another company to host each others recovery if there’s a problem. It can raise some concerns regarding distance, data confidentiality, and etc.
  • Mobile site – essentially datacenters on wheels. These are trucks with data and hardware ready to leave a site if a disaster is announced.

Backup

Difference between following types of backup strategies:

  • Differential: copies only files that have had their data changed since the last full backup. This requires more space than incremental backup. Differential backup doesn’t clear the archive attribute, so the next differential backup will be larger than the previous one.
  • Incremental: copies only files that have changed since the last full or incremental backup. This kind of backup strategy takes more time in restoration but is faster at the backup time. Incremental backup clears the archive attribute, so the next incremental backup will not save the same file if it has not changed.
Backup StrategyBackup SpeedRestoration SpeedSpace RequiredNeeded for recoveryClears backup bit
FullSlowFastLargeLast Full backupYes
DifferentialMediumMediumLargeLast Full backup + Last differentialNo
IncrementalFastSlowSmallLast Full backup + All incremantal since last full backup.Yes

System Availability

RAID is a set of configurations that employ the techniques of striping, mirroring, or parity to create large reliable data stores from multiple general-purpose computer hard disk drives.

RAID 0

Under a RAID 0 system, all data is divided into blocks, and the blocks are written across multiple drives. This is known as striping.

Pros

The advantage of striping is that both read and write speeds are greatly increased. This goal is also achieved without any duplication, so the entire storage capacity of each drive is used efficiently.

Cons

The downside of RAID 0 is that it doesn’t offer much protection against data loss. If any of the drives fail, the data on that drive cannot be recovered.

RAID 1

All data is stored twice. First, it’s stored on a data drive or drives. Then it’s stored again on a mirror drive or drives.

Pros

RAID 1 is used to prevent data loss. If one drive fails, the data can be recovered because there’s already a copy of it. In addition, RAID 1 has the same read and write speeds as a single drive system.

Cons

RAID 1 requires that half the storage capacity be used on duplicated data. RAID 1 doesn’t offer any of the performance benefits of RAID 0. RAID 1 is only as fast as the slowest drive.

RAID 5

RAID 5 requires at least three drives. A checksum parity is created. This is a calculated value that can be used to rebuild data mathematically.

The data and the checksum parity of the data are then written across all drives. If any one of the drives fail, the missing data can then be recovered using the checksum.

Pros

RAID 5 offers fast read speeds but is slower at writing. It protects against drive failure without requiring data duplication.

Cons

Repairing a failed drive is a complicated process that takes time. In addition, if more than one drive fails, data will be lost. This makes a RAID 5 system vulnerable to data loss during the time it takes to replace a failed drive.

RAID 6

RAID 6 is identical to RAID 5, except parity data is written on two drives instead of one. This requires a minimum of four drives, but the advantage is that two drives can now fail without data loss.

The idea behind RAID 6 is that if one drive fails, it’s highly unlikely that more than one additional drive will fail before the first failed drive is repaired.

This means that by accounting for a situation where two drives have failed simultaneously, data is protected in almost all cases.

Pros

RAID 6 is just as fast at reading as RAID 5 but it is much better at protecting against data loss.

Cons

RAID 6 is slower at writing than RAID 5. The process for replacing a drive is still time-intensive.

RAID 10

RAID 10 combines RAID 1 and RAID 0. Data is mirrored across multiple drives to protect against data loss, and striping is added to increase read speeds.

Pros

RAID 10 allows the data from a failed drive to be recovered faster than in a comparable RAID 5 or RAID 6 system.

Cons

  • Nested RAID levels – also known as hybrid RAID, combine two or more of the standard RAID levels. RAID 50 is 5+0, combines the straight block-level striping of RAID 0 with the distributed parity of RAID 5

Power and Interference

Electrical Power is a basic need to operate. Here are the problems you can encounter with commercial power supply:

  • Blackout – complete loss of commercial power.
  • Fault – momentary power outage.
  • Brownout – an intentional reduction of voltage by a power company.
  • Sag/dip – a short period of low voltage.
  • Spike – a sudden rise in voltage in the power supply, during a short period of time.
  • Surge – a rise in voltage in the power supply, during a long period of time.
  • In-rush current – the initial surge of current required by a load before it reaches normal operation.
  • Transient – line noise or disturbance is superimposed on the supply circuit and can cause fluctuations in electrical power.

You can mitigate the risk by installing a UPS. UPS have a limited power and can send power to connected systems for a short period of time. To be able to have power for days, a diesel generator is needed.

Noise can occur on a cable:

  • Transverse Mode happen when there is high charge difference between hot and neutral.
  • EMI and RFI are caused by others electrical device, light source, electrical cable, etc.

Implement Disaster Recovery (DR) Processes

The general process of disaster recovery include:

  1. Responding to the disruption
  2. Activation of the recovery team
  3. Ongoing tactical communication
  4. Assessment of the damage
  5. Recovery of critical assets and processes

DRP is focused on IT and it’s part of BCP. There is 5 methods to test a DRP:

  1. Read-through – where everyone involved read the plan. It helps find inconsistencies, errors, etc.
  2. Structure walk-through – also known as table-top exercise. It’s where all of the involved parties role play their part, by reading the DRP and following a fake scenario.
  3. Simulation test – is when the team is asked to give a response to a virtual disaster. The response is then tested to make sure the DRP is valid.
  4. Parallel test – is where the DRP is tested for real. If there is a second site, it is activated, etc. The parallel test should never impact production.
  5. Full interruption test – is when the production is shutdown to test the DRP. It’s rarely done due to the heavy impact on production.

Business Continuity Planning

BCP is the process of ensuring the continuous operation of your business before, during, and after a disaster event. The focus of BCP is totally on business continuation and it ensures that all services that the business provides or critical functions that the business performs are still carried out in the wake of the disaster. We can say that business continuity is a strategy while disaster recovery is a tactic.

As you can see below, there are similar steps between business continuity design and disaster recovery:

  • Plan for an unexpected scenario: Form a team, perform a Business Impact Analysis for your technologies, identify a budget and figure out which business processes are mission-critical.
  • Review your technologies: Set the recovery time objective and recovery point objective, develop a technology plan, review vendor support contracts, and create or review disaster recovery plans.
  • Build the communication plan: Determine who needs to be contacted, figure out primary and alternative contact methods.
  • Coordinate with external entities: Communicate with external units such as the police department, government agencies, partner companies, and the community.

BCP should be reviewed each year or when significant change occurs. BCP have multiple steps:

  1. Project initiation is the phase where the scope of the project must be defined.
    • Develop a BCP policy statement.
    • The BCP project manager must be named, they’ll be in charge of the business continuity planning and must test it periodically.
    • The BCP team and the CPPT should be constituted too.
    • It is also very important to have the top-management approval and support.
    • Scope is the step where which assets and which kind of emergency event are included in the BCP. Each services of the company must be involved in these steps to ensure no critical assets are missed.
  2. Conduct a BIA. BIA differentiates critical (urgent) and non-essential (non-urgent) organization functions or activities. A function may be considered critical if dictated by law. It also aims to quantify the possible damage that can be done to the system components by disaster.
    • The primary goal of BIA is to calculate the MTD for each IT asset. Other benefits of BIA include improvements in business processes and procedures, as it will highlight inefficiencies in these areas. The main components of BIA are as follows:
      1. Identify critical assets
        • At some point, a vital records program needs to be created. This document indicates where the business critical records are located and the procedures to backup and restore them.
      2. Conduct risk assessment
      3. Determine MTD
      4. Failure and recovery metrics
  3. Identify preventive controls
  4. Develop Recovery strategies
    • Create a high-level recovery strategy.
    • The systems and service identified in the BIA should be prioritized.
    • The recovery strategy must be agreed by executive management.
  5. Designing and developing an IT contingency plan
    • Where the DRP is designed. A list of detailed procedure to for restoring the IT must be produced at this stage.
  6. Perform DRP training and testing
  7. Perform BCP/DRP maintenance

Leave a Reply

Your email address will not be published. Required fields are marked *