crash! 0H N03Z! – failure weblog

on monday morning i checked my email then left for work. a couple of hours later, i went to log in to check my mail via ssh…connection refused. eh? tried again, same thing. i tried going to my website. could not connect to server. uh oh. logged into the admin site and made a serial console connection to my server…it was spewing mysql write errors. uh oh! i went back to the 1and1 admin site and told it to reboot my server. i watched it through the serial console, and when it got to the hard drive it barfed and said it couldn’t mount the hard drive due to disk errors. run e2fsck. UH OH. i went back into the admin console and told it to reboot into recovery mode, then used the serial console to log in and ran e2fsck on the partition with my linux installation (i.e., the os, data, etc…basically everything). it started complaining about missing inodes, moving stuff into lost+found, sizes not matching… UH OH! i went through the e2fsck hitting “y” every time and cringing a little bit. fortunately, it made it through without going crazy and losing an insane number of inodes and paths. i got to the end and mounted the partition. it mounted. i did an “ls” and i could see the directories. at this point i was pretty sure my hard drive had just vomited and was in the process of failing, but i wanted to take this opportunity to copy everything i could think of i needed before calling support. i quickly started going through the system and copying everything important i could think of to another server (web directories, web config files, mail directories, home directories, mysql data files, plesk files, some files in /etc, and so on). i managed to get everything without seeing errors. yea!
you see, i’m one of those i.t. professionals who takes care to behave like a professional at work, but in my personal computer life i live life on the edge — i don’t do backups. so i had the possibility of losing all of my blog stuff (like i did in the great disk crash of sept 2004), personal files, genealogical data i’d entered into a database, etc. even after i managed to copy all of the files i did, it was possible i might still lose a lot of blog data because i hadn’t made any real backups of the data — all i had was an old backup i’d done in mid to late 2006, and the static pages. i could recreate with that, but i’d lose database comment info, entries that were saved but unpublished, etc. but i had a copy of data files themselves, and as long as the database was in a stable state when the system was rebooted then the data files might not be corrupted.
after pulling all the files i could think of, i actually went ahead and rebooted the server into normal mode, just to see if the fsck might have fixed things. i watched it boot through the serial console, but as it was booting the startup scripts started complaining about not finding “ls” and other basic commands. it looked like some important inodes got borked. it got to the login prompt and i was able to log in, but i couldn’t do most commands other than those built into the shell itself (like “echo”). i also started seeing some data seek and read/write errors. something was definitely fubar.
i called support and talked to a guy and explained what had happened and what i’d done, concluding with the statement that my hard drive appeared to be failing. he responded with “do you have any proof for believing that?” i told him i could send him an email with some of the fsck inode issues and other stuff, and he said okay. i emailed it to him. i also discussed whether they could save the hard drive so i could try to get data off of it if i wanted. he actually said they could do that. (back in 2004, they weren’t too interested in helping me with that failing hard drive…they just put in a new one and told me “tough luck”.) he said it’d be awhile before the techs could look at it, so i left for supper with jack (who’d come over to eat supper with me).
when i got back, the support guy had sent an email. he said he had forwarded my email to the techs and they saw no reason to believe it was a hard drive problem, so i should run badblocks or re-image the server. i immediately got annoyed, because there was no way i was going to rebuild on a hard drive that definitely appeared to be failing. but then i saw he’d sent another email about 30 minutes later, and it said to ignore the first one and they’d be replacing my hard drive. awesome!
i decided at this point to make a shift: i don’t really use armyoftexas.com, diablostejanos.com, and republicoftexas.org much; they don’t have much of anything in the way of web files or email accounts; and pretty much no special stuff (other than the blog i’d installed on diablostejanos.com) — so i decided to move them all to google apps. that’s right: those three domains are now being run off of google apps. that may change at some point, but i figured now was as good a time as any to make the shift and see how google apps is. i’ll write more about that sometime.
i didn’t see an email before i went to bed, so i figured i’d just work on the server tuesday. tuesday morning i got up and still didn’t see an email. but when i got into work, i found out they’d emailed my work email around 1am to say they had re-imaged my system per my request. uh…what? “re-imaged?” i logged in through the serial console and the system was spewing hard drive errors. further, there were different partitions (and in what appeared to be a weird partitioning pattern) on the system. so it appeared they wiped my data and re-imaged my original drive, and not installed a new one. *sigh*
i called support and explained my situation and talked to a new guy. i guess he decided it was beyond him, because within a few minutes he said he was going to transfer me to the server support group. (yea! someone that can better understand the situation!) a guy answered and i talked to him. he said the ticket said they’d pulled the hard drive and replaced it, but he logged in and saw the disk errors. he said it seemed surprising it’d be another bad drive, but he’d look into it. i asked him about slaving my old drive, but he said the servers only have space for one drive. but they could either trade out drives, or give me access to a test server with the old drive for a day or two for me to try and pull data. i thanked him and started the wait. he said it’d probably be 1 to 4 hours.
about 5 hours later, i got an email that my server was being re-imaged per my request. at this point i didn’t care what the email said, as long as i didn’t get on and see drive errors. i logged in and everything looked stable. and so i started the process of reconfiguring and restoring web, mail, etc. i spent tuesday evening and into early wednesday morning getting bohemianphotography.com, leifeste.net, and failure.net (including this blog) back up and functioning (minus database stuff).
tonight i finally started playing with the database files, and i’m happy to report none of the database files were corrupted and (as far as i am aware) all of my stuff is back up and functional.
so that’s what it’s like living life on the i.t. edge. rock and roll.

Leave a Reply Cancel reply