Google IT Cert – Week 23 – Data Recovery and Backups
This is Course IV, week 5 of the Google It Support Professional Certificate program from coursera.org. This week is all about data recovery and backups, and you will not be doing your job unless these systems are designed, implemented, tested, updated and documented regularly. Mission critical to almost any organization.
Planning for Data Recovery
The good people at Google this week are going to walk me through whatever data recovery is. It must have something to do with computers. Ah, yes: electronics break all the time and saving data is an important part of life.
What Is Data Recovery
Data recovery, they tell me, is “the process of trying to restore data after an unexpected event that results in data loss or corruption.” This kind of event can be the result of mechanical failure or hackers destroying stuff, the result is unusable data.
Software can sometimes be used to extract data from failed hardware.
But if you have backed up your data on another device then you may be able to keep your job.
Businesses with data backup systems in place will be much more able to handle system failures than those without.
When an “event” occurs the objective is to resume normal operations in the shortest amount of time while minimizing the disruption.
Having a thorough disaster plan and procedures in place is necessary for effective disaster management.
Disaster plans must involve regular backups of all critical data.
If something does go wrong many organizations will generate a post-mortem documenting how the disaster plan was implemented and any problems encountered during the recovery process.
Reading: Gitlabs Data Recovery
Read about an outage and data loss incident at GitLab.
Backing Up Your Data
The first consideration in your backup planning process should be what data to backup. It may be tempting to back up everything but it may not be practical. Consider backing up all critical data that cannot be obtained somewhere else.
This will include:
- Email Databases
- Sales Databases
- Financial Spreadsheets
- Server configurations
- Other databases
The organization will need to pay for every file backed up, which will mean paring down the backup to only important information. It may not be necessary to backup everyone’s computers if they are all operating from a fileshare—you could end up paying to backup everyone’s download folder.
The organization should also consider future growth in determining a backup solution.
Backups can be created locally or by using a cloud storage service or setting up your own offsite backup server.
Local backups are convenient because they don’t require any bandwidth outside the local network. Accessing the data is much faster. But the system is vulnerable to local disasters, as well.
An offsite backup is more resistant to local failures, of course. And it allows you to have your data in multiple locations. Offsite backups also need bandwidth outside the local network, and all that data will need to be secured.
Encryption is necessary for all backups, as they will often contain massive amounts—or all of—an organization’s critical data. This means usually a TLS encryption for transmitting backups and “at rest” encryption.
Naturally, there are some trade-offs between different backup solutions. You may be able to easily buy a big enough NAS and start sending backups to it. But this may not be an adequate, long-term solution. You could then start using a cloud backup service and have both on- and offsite backups running concurrently, which is a best practice.
In addition to size, the backup time period will also be a factor. How long will backups need to be kept?
An old, inexpensive backup medium is magnetic data storage tape, which functions much like cassette audio tape but for computer data. While cheap, it is slow compared to modern hard drives and solid state drives. Tape is usually used for long-term archival data storage.
For some reason at this point in the video we stop discussing media and start discussing software. Okay, then…
RSync is one of many, many backup tools available. RSync transfers and synchronizes files between locations and computers, and has support for SSH for secure connections.
There is also the Time Machine backup utility that comes with the MacOS, and the Backup and Restore tool in Microsoft Windows. Both of these offer file backups and full system image backups.
Reading: Backup Solutions
It should go without saying that backups need to be tested, not just created. Restoration procedures should be documented thoroughly and available to everyone who may be called upon to restore systems.
Being sure that the backup procedure and restoration procedure, as outlined in documentation, is critical to backing up data.
This is called disaster recovery testing. This could be simulated once per year, to test and preparedness and procedures needed to respond to different scenarios.
Types of Backup
There are different ways to perform regular backups of data that is constantly changing. A full backup duplicates all files in the source. This means that many files, like operating system files that rarely change are being backed up along with production data like financial documents or Word files. This can be an inefficient backup scheme.
A more efficient approach is to make an initial full backup, then create subsequent differential backups, which only make changes to data that has been modified since the full backup. This saves storage space and bandwidth, as well as time needed to perform the backup.
Common practice is to do infrequent full backups and frequent differential backups.
There is also an incremental backup, which backs up only the data within files that has changed since the last incremental backup. This is even more efficient but may only allow full restoration if all incremental backups are available.
Backups can be organized in archives, which allow the file structures to be organized. These archives can be compressed, which means processed using complex algorithms to make the archives take up less space on disk. Compressing backups means that restorations will take longer.
RAID arrays (Redundant Array of Independent Disks) take multiple physical disks and combine them into large virtual disks. Different RAID configurations are called RAID Levels. RAID allows the creation of large storage spaces while minimizing the risk of disk failure.
RAID is not a backup solution, it is a storage solution with some hardware redundancy on some RAID levels.
Reading: RAID Level
Here’s the Wikipedia article on RAID.
Backing up individual client devices can be challenging, as there will probably be many more than infrastructure devices.
A common solution is a cloud backup and sync service, like Dropbox, iCloud, and Google Drive.
Disaster Recovery Plans
What’s a Disaster Recovery Plan
It is very much like what it sounds like… Documented procedures on how to react and respond to an emergency or disaster from “the operational perspective.” The plan should include actions to be taken before, during, and after an unexpected event, in order to minimize disruption.
The recovery plan will include preventive measures to minimize the impact of disasters, like backups systems and hardware redundancy.
Detection measures will alert the IT teams that an event has occurred that can impact operations. Learning of a disaster and notifying anyone involved is critical to minimizing data loss and downtime.
Many systems will be connected to battery backup power, which will allow admins time to shut the systems down “gracefully” before they lose power. These systems should be sending out alerts whenever there are unexpected power loss events.
Other warning systems can include environmental sensors for server rooms, flood sensors, temp/humidity sensors, and smoke detectors, as well as having established evacuation procedures.
Corrective or recovery measures are enacted after a disaster has occurred. This includes rebuilding systems and restoring data from backups.
When a system in a redundant scheme fails, it is called a single point of failure, because now there is only one system remaining, and one more failure would take down the system. That is bad.
Designing a Disaster Recovery Plan
There is no universal disaster recovery plan because every organization has different needs and resources.
Begin planning by performing a risk assessment, which allows developing a clear understanding of the organizations priorities in the case of disaster. Any system that is not redundant should be evaluated closely in a risk assessment.
Develop strong backup and recovery systems, along with a “good strategy.” This includes thorough documentation of the data recovery and restoration from backup process.
Redundancy doesn’t only apply to data and power systems—it is critical for hardware and communication systems.
All important systems and procedures should be thoroughly documented and all documentation needs to be accessible to those who need it. Periodical verification of documented procedures is a good way to be sure that documentation is continually updated.
Detection measures will, ideally, quickly detect and alert IT teams when services and systems go down, or when there are abnormal environmental situations. Systems and services that rely on the internet should be configured with two connections, so that if one goes down it will fail over to the secondary connection.
Any “absolutely vital” systems should be closely monitored with detection measures.
Any monitoring systems need to be tested regularly, along with reaction times and disaster responses.
The disaster plan will need to contain documentation and/or links to documentation. It is also important that the documentation is available even if the server it is on goes down. Think about it!
A lot of these quizzes seem like they are to test you on quiz-taking ability. There are a lot of “choose all that apply” multiple choice questions. One list included “preventative measures” (correct) and “preemptive measures” (incorrect). Just because the guy never said “preemptive” doesn’t mean that its not the same thing as “preventative.” A dumb distinction.
What’s a Post-Mortem
Mistakes are a common and exciting part of life. With any luck, we can, sometimes, learn from them. A post-mortem report is an analysis of an incident, event, or project to learn how it went.
A post-mortem will document what lead up to the event, the event itself and the response to it, and the aftermath, highlighting what went well and what did not.
“The intention isn’t to punish or shame,” he says. But to learn from the incident and to improve responses to future events.
Other teams may learn from your experiences, and make the organization better.
Writing a Post-Mortem
Let’s look at what goes into a typical post-mortem.
- Start with a short summary of the incident. What happened, how long it lasted, how it was fixed. They point out here to be careful about time zone usage. Good point.
- A timeline of key events. Include every action and attempted fixes. The timeline should end with the actions taken that resolved the situation.
- Detailed explanations of the root causes that lead up to the incident. This can include areas of operations that need to be improved, like systems need to be tested more before deployment.
- A full accounting of recovery efforts, similar to the timeline that precedes it. This should include more details about actions taken, including the rationales behind steps taken.
- Specific actions to be taken to avoid the same scenario. This will include any steps that may be taken to improve system monitoring that could help avoid future incidents.
Analyzing what went well is just as important as what went wrong. If a failover system worked as planned, that is worth noting. These justifications can help convince the stingy bastards in finance why you needed to spend money on redundant systems.
Pretty exciting quizzes this week. Yes. Really.