It’s Difficult To Be A System Administrator

[WARNING: Geeky Content Ahead]

Sometimes, things go well – most of the time, usually. But then, you have things like massive power outages when you’re in the middle of doing something to prevent catastrophic data logs when the power goes out. And what you were doing in the middle of the power outage actually CAUSES catastrophic data loss. 😦

Two weeks ago, one of my hard drives on my personal file server failed. It’s a 1.5TB hardware RAID1 (mirror) array. For those who don’t know what that is, I have a special device in my server that allows me to build redundant arrays so that in case of the failure of a single disk, no data is lost. That’s what I have: two 1.5TB drives mirrored to provide full redundancy. One of them failed and the server started to beep… telling me “fix my drive!” So I did. I got a new drive and it spent the next 20 hours or so rebuilding the 1TB of data that is on the drive. Problem solved!

Then, I did something dangerous. I started to think.

I thought “Hmmm… that hard drive failed way too early in it’s lifecycle. What if my other ones fail to?” You see: the drive I just replaced was only my personal data [which is very important] but it wasn’t part of my infrastructure. I could use that data on any machine, but I’d have to rebuild a ton of stuff to get to it if I had many more failures.

Just so that you understand an overview of my “home server farm”, I have [for purposes of this discussion] 2 host servers that serve up virtual machines (VMs). I have somewhere around 30 VMs spread across these hosts. Most are work-related, helping me to design and build solutions for my customers and do research, and some are for my personal use, such as a desktop, web server, email, and the file server mentioned above. There’s also a few machines to maintain the farm – Domain Controllers, DNS, Certificate Authorities, etc.

Each one of these computers is actually a file on one of the two host boxes – and those files are rather large. If something were to happen, say a power outage, there’s no guarantee that my battery power would last long enough for me to shut down the machines cleanly.

With all that in mind, and the idea that in this instance a “server” is actually a big file on the order of 40GB to 200GB in size, I thought that making the disk upon which these files sit a mirrored array would be a smart thing to do. Which it is.

So: on Friday afternoon, I began the process of creating mirrors on two of my servers. One server, with most of my work VMs on it, has no RAID card. On that one, I used the Windows OS to mirror two 1TB drives after shutting down the machines and moving the files off of one. Once I had that started, I moved to the other server, my personal one, and did the same thing – shut down the VMs and moved them to another disk. Then, I repurposed the vacant disk and joined it to the one which now held the VMs and began building a mirrored volume.

Now, with the RAID card, I can do this on the fly while the disk is available. Before I turned the machines on, it said it would take about 2.5 hours. I turned the machines on. Now, it said it would take about 20 hours.

I should have left them off.

Well after 2.5 hours later, at about 10:30 that night, the power went off. 14 or so hours later, it came back. I went down to power things on. It all “looked” okay for a while – I was getting email again, but the web server was wonky and slow and some other things were just kind of weird.

Looking further, it appeared that the rebuild had to be restarted since it had lost power. I restarted it. [Note: the work server using the Windows OS RAID simply came back on automatically and began rebuilding the mirror and it completed with zero errors.] About 2 hours later, there was a loud obnoxious beeping from the server closet. The rebuild had failed and the drive simply dropped offline. Gah!

All my VMs disappeared for a moment. Rescanning the array with the utility made it come back, but now I was very worried. Since the VMs were all off, I copied all the files to a second drive and build the array from scratch [after several attempts to find and fix whatever bad secords or corrupt tables were on the drives]. I moved the files back after it completed. I turned on the VMs only to have half of the machines not come back – the half that mattered of course. One domain controller, my web server, desktop and the email server were the biggest losses. I had the old disks, but I had to actually reinstall the OS on all of them and begin the slow, painful process of restoration.

Which is where I am today. I have the web server working [obviously] and we now have email with empty mailboxes. I have a recovery database ready to go, but there are issues with the old database so I need to finish patching the Exchange server so I can get it to the same version that it was before so the recovery tools will work properly. That’s what I’m doing now.

All of this work has taken 5 days or so to get things back up. I now have most of the critical VMs housed on RAID drives. I just need one more to complete the process.

At least I learned more about doing Exchange server mailbox recovery.

Power Outage

Many of you may have noticed the site down for several days. That is directly due to the power outage. Not that we’ve been without power, mind you. Power came back within 12 hours of loss. (Yay! Air Conditioning!) What happened was that I was in the middle of moving critical files from one disk to another when the power failed. The failure damaged several of my servers including the web server and the email server. I have no email for now, but hope to have it up soon – even if I have to do without the old stuff.

But, as you can now see: the web site is up and running.

Thank you for your patience!