Brent Ozar is hosting a webcast today about disaster recovery horror stories and I thought back to my “favorite” horror story and how much I have learned since then. I have been working with SQL Server since 1997, but I was not a real dba until I took a job at large insurance company in 2000. At that time, a lot of large companies only trusted SQL Server to be the back end to their websites, but did not trust it to run their actual business.
This company was like many their website kind of grew up without any real structure or plan. However, the company realized that they needed add structure to their site and I was the dba on site to help roll out the changes to the database. The updates were expected to take 3 hours to complete and we got started at 10 AM. Now I was not given these updates before hand, so I did not know that updates used cursors instead of set based queries to update hundreds of tables with millions of rows.
Let me stop here and give you some background, at this time had two very large dogs. A super friendly German Shepherd mix and a pit bull / mastiff mix named Truck who weighed about 100 pounds and was very protective of me and our house. I was obsessed with these dogs, they were my life. Truck passed away 2 years ago and I still talk about him constantly.
A 3 hour update took closer to 9 hours, but then the problems started. The website no longer served pages, every page was giving a timeout errors. Of course, today I would immediately know what to do and what the problem was, but then was a different story. A conference call was created with every vp at the company on the line, web developers, hardware guys, network guys and me and the lead dba. The lead dba had less experience then I did. Everyone was off checking their own thing and no one could figure out what was happening. By about 10 hours in, I was in a panic, my dogs needed to go out and be feed. I was new to the city and did not have any friends or family to go check on my boys. The VP of my division would not let me leave, but because I was so upset he said that he would go let my dogs out. I told him that he did not understand that Truck was a very scary dog. I continued to work while the VP went to let my dogs out.
I finally got the bright idea to run a trace on the database and I noticed that most queries took 10 ms to execute but this one took 10 minutes to run and it seemed to be called all the time. Duh! I was young and stupid. This was the early days before change control made their way to sql server, so I just added a couple of indexes to this table and poof, the site loaded. The site came up, but that was about it. It was performance nightmare. About this time, is when the VP called from outside of my house to say he was not going in, but found a neighbor kid who did yard for me to brave it. Truck knew the neighbor kid and let him in, but hated that VP.
So I worked for 16 hours running traces, adding indexes until the site worked. After that I was the human performance tuning wizard.
The post script to this story is the actual worse disaster I had, but is far from my favorite story. I had scheduled to apply SP2 on one of our large SQL 2005 one Friday night. We had tested SP2 and applied on about 15 servers prior to this install I did not anticipate any issues. Because we have international customers our maintenance window is Saturday morning at 2 AM, a time I hate. Sleepy DBAs make bad decisions, but it is part of the job. Before I started I took a full system backup, and then ran the SP2 installer. Some of you may have encountered the issue with SP2 where it failed due to a previous installation. So I went out found the Windows installation cleanup tool and removed anything I thought may be the problem. I reran the SP2 installation and it installed without issue. I thought I was finished and just rebooted the server. But the server would not come back up, the master database was corrupt. How , Why, I am not sure. So I try to restore from my backup, but I am unable to because I was getting an error that the backup was not readable. I knew then that I was in for a long night.
Another back story, Truck, my beloved dog had cancer. He was undergoing chemo and I had hoped that he was doing better. I knew there would be no cure, but still I hoped. The day of the upgrade I noticed that he was a little lethargic, but that sometimes happened after chemo.
Truthfully, I don’t remember exactly what I did, but I was able to get the server backup. I think I restored the master database backup on another server as regular database and then made a backup of the restored database. This new backup worked, but then I found out there were problems with MSDB. At this point, I had been up for 24 hours and was exhausted. So I made the call to go home get some sleep and then deal with the MSDB problem the next day.
The next day, Truck was still lethargic, but truthfully I was so consumed with my broken server that I did not pay much attention. I continued troubleshooting the msdb issue, but made the decision to scrap the entire install and reinstall SQL Server on new hardware, it was just screwed up. I scheduled this to happen Sunday night at 10PM.
Thankfully, the re-build went off without incident and the server was backup and running. The next morning, a new DBA (Ryan) was starting so I would have help and not have to deal with this kind of thing alone in the future. I went home and discovered that Truck had taken a turn for the worse. I rushed him to the emergency vet and stayed most of the night up with him. I called our CTO and asked him to welcome Ryan and get him settled, because I had to take Truck to cancer vet first thing in the morning.
The vet told me that he would need to run some tests on Truck and that I should go into work. So I went to work welcome Ryan and started to deal with the fallout of the weekend. A few hours later the vet called and told me that Truck was bleeding internally and that it was time. I left work to say goodbye to Truck and then went home where I planned to stay for the next few days mourning the worldest greatest dog. However, the life of dba is not always convenient.
On the day my beloved Truck died, I had to go back into work to help a customer who had encrypted all their data using the server master key, which we do not support and do not backup. When the server was rebuilt, a new master key was generated. I am no encryption expert and I was in no mood to deal with this situation, but Ryan had been on the job less than 6 hours and I was the only dba, so I called up Microsoft to help me try to recover the master key from the old server which was not easy. On the worse day of my relatively young life, I spent on the phone with pss.
When things wrong, it is rarely convenient.
No comments:
Post a Comment