In this article, we will cover some fundamentals around disaster recovery and data backups. Managing backups is one of the most important tasks that a sysadmin is responsible for. Designing, implementing, monitoring and managing backups can take up a lot of valuable time. The idea is to implement the best solution for your organisation, but this should only be one small piece of the puzzle. We should also look at hardware redundancy, software programs that can help restore files and folders quicker than that of the typical restore from back up and also data archiving solutions that can be easily accessed by users when requested. Where possible, implement solutions to allow users to retrieve their own data to take some of the workloads away from the I.T department.
Let’s start with Data backups. Whether you have your servers and data stored on-premise or in the cloud, backups must be a priority. In the event of a disaster, you need to ensure that your systems and data can be fully restored.
A full back of this data is an option to capture everything that is required to perform a full restore. But if you were to do a full backup every day then this could result in capturing a lot of data. Another option to prevent the accumulation of this data is you could perform a weekly full backup and then a differential backup every other day. A differential backup can be used to back up the data that is changed since the last full backup. This way we would only restore the full back up and then overlay the differential backup. This option will obviously take longer to restore but will often be a better option when disk space is an issue.
Another option to make these file even smaller is an incremental backup. An incremental backup is going to back up all changes since the last successful full, differential or incremental backup. Even though these backup files are a lot smaller you have to take the good with the bad. The bad being if you were to restore from incremental backups you would first have to restore the full back up and then each incremental backup since the last full back up. The good being you the backups will be quicker and take up less disk space.
Most people opt to use incremental backups which will create smaller and quicker back jobs. The incremental backup solution is recommended simply because you back up more often than you will need to restore. It is more important for your backup process to be smaller than your restore process.
One of the biggest breakthroughs when it comes to backups was the introduction of a technology called the Volume shadow copy service. VSS allows you to backup files when they were in use. Before VSS was implemented, Sysadmins would have to ensure that all users did not have any active files open at the time otherwise the file was not backed up. This revolutionised how and when backups were performed.
Snapshotting is another backup option that is worth noting. This technology takes a snapshot of an entire system to a file which will include all system and program files. When a server dies then it can be restored quickly back to its running state along with all of the files and systems services that were running at that time.
Now that we have covered backups, lets now discuss a disaster recovery on a larger scale. Time and money must be spent to ensure disaster recovery is planned out thoroughly to ensure that you can restore business continuity as quickly as possible.
The first thing you should is meet and work with management and create a risk assessment. This will answer some important questions such as how long can the organisation be down for while the systems are being recovered, how long do we need to keep backups for and what’re the most important systems on our network that we need to have up and running first.
The concept of a disaster recovery plan is to document the process you will take to restore the systems step by step including all equipment and vendors needed for the recovery process. The plan will document the different levels of disaster married with the affected systems and a different plan for each of these levels. Once a solid disaster recovery plan has been created, the plan should be tested at least once a year.
There are 3 types of DR sites that we need to be familiar with here.
A vital piece of disaster recovery is choosing what recovery method will be used and the DR site type. In the scenario that a disaster occurs where you are hosting your systems, whether it be your server room or data centre, you will need to bring up a totally different site either permanently or temporarily.
- Cold Site – this is a site you have identified to be where you will bring up your equipment. You will not actually house any equipment in this location. It’s just a rented location with sufficient room, cooling and power ready to go for you when you need it.
- Warm site – is basically a cold site but will include the required equipment ready in place in the event of a disaster. You will still need to perform restores to a warm site but no procurement of equipment is needed.
- Hot site – This is where you will have live hardware in a room almost identical to your production environment. Your data is synchronised in real-time to the hot site. Minimal work needed to switch over to the hot site for the business to continue on as normal.
Lets talk briefly about trying to avoid a disaster recovery all together. Best practices and failover should be in place where possible to ensure that redundancy within your systems and network. Prevention is always better than a cure.
Things to think about to create more redundancy in your network –
- Ensure your servers have dual power supply’s.
- If you do have dual power supplies then plug each power supply into two different UPS for extra redundancy.
- Monitor your backups.
- Ensure everything that is required for a disaster recovery plan is being backed up.
- Test a restore from these backups regularly.
- Install fault-tolerant memory. Some servers allow you to set aside spare sticks of memory as a set in case a stick becomes faulty.
- Set up a detailed monitoring and maintenance schedule.
- Implement a high availability stack with your firewall. This will involve either have two working in tandem or one ready as a failover.
- Implement a backup WAN link.
- Ensure your air-conditioning unit or cooling system is services regularly
- Implement a change request process.
Disaster recovery is an important process to have in place and depending on your organisation will depend on the type of backup you will do and whether you will invest in a remote site ready to go in the event of a disaster. All business should have a detailed disaster recovery plan in place and time and money should be invested in this process to prepare yourself in the case there is a disaster event in the future.