How do you do backup testing? In normal cases we backup data from SQL Server (either using SQL Server Agent, or another 3rd party backup agent) to Disk and then to Tape or Directly to tape. We setup alerts on backups to make sure the backup job completed successfully or failed? What else we can do, we can implement verification test after backups to make sure backups are valid. Is that really enough?
What other things should be considered? Every part of the backup and restore process needs to be tested to confirm you have good backups. It should be part of policy to test this as often as possible with minimum at quarterly.
** Assuming using SQL Server Agent Backups **.
1) Confirm backups (Email alerts, manual checks).
2) Confirm backups are going to Tape directly or are being picked up from Operating System backup (Email alerts, manual checks).
3) Request a restore from Tape (make sure it is from off-site location and not Virtual Tape Library (VTL)).
4) Run Verification on the backup to make sure no issues with tape drive and backup file is still valid).
5) Restore the database in a testing environment.
6) Check security and GET END user to try it out.
7) Verify ALL infrastructure items are there, configured, and working (tape backup, tape agent, sql backup, security, disk space, disk availability, etc.)
So Step 1-2 are normal; but that is not where backup confirmation ends. You should occasionally request restore, verify it and restore it in testing environment. Step 6 is seem a bit strange but it just as much part of restore process, I restore the database I am done but if no one can still connect? Then have I really recovered from disaster? Step 7 is explained later….
So why am I writing about this? This is basic DBA duties that should be hard-coded into our brains. Well I thought it was hard coded in my brain but sometimes because of other project constraints, time, money, etc. Process is skipped well in one such case this happened to me.
I configured SQL Server, all standard jobs; disk backups and email alerts. Everything works nice I did not get any errors. Backups were going off server that has been production for a while setup before my time on the team. So I assume everything was correct as we have been using this dump location for a while. This was new server so I didn’t get chance to run verification steps listed above after the server was up, assuming things are going well, I got busy on other tasks. Well that was bad decision on my part, because I ASSUMED! things were configured when they were not. This is were #7 comes into play, I assumed and was left in an awe when I was told this dump location doesn’t have any tape agent installed on it.
I got contacted by client, because of a bug in the application it allowed the client to delete their ENTIRE WEBSITE (Content Managed System) with SQL Server Backend. I found out 1 day after the fact so I did not have backups on disk any more. I contact the Storage guys to ask for restore and found out they DON’T HAVE ANY tape backups for my dump location. So I quickly made copy of what Transaction Log Backups, Data backups I hand on hand to see if I can get some kind of restores. But found the transaction log backups needed the full backup that has been over written already.
Since this was a complete data lose I had to turn to 3rd party vendor tools for data-recovery in non-standard manner. I used the ApexSQL Log tool to read the transaction log to recovery engineer all the records that were deleted. However using this tool found few annoyances that you’ll have to watch out for:
1) The tool will no generate any recovery script for binary type objects. In file format or other wise.
2) If you have large varchar columns or such the system will try to be creative and change the Enter/Carriage Return to Char(10) + Char(13) string concatenation. Problem with that is for very large varchar columns SQL Server could not prase the SQL String successfully any more, for more please read here.
3) In default settings the code does not generate the GUIDS, so if your database heavily relies on this this setting has be turned off before reverse engineering.
4) While the tool is reading transaction log it chugs through alot of memory, for example reverse engineering, transaction logs with over 1M transactions with 1.5GB ram program crashed on me every time. So I processed the transaction logs in small chucks.
5) I found the tool to be extremely slow when selecting/unselecting and the filter functionality to be limiting.
So aside from these limitations this tool still saved me, it couldn’t recovery enough to get site working. But it recovered enough that CMS product vendor was able to recreate the pages because of the redundant nature of the data.