Thursday, June 14, 2007

Ubuntu Edgy Server Troubles

One thing leads to another but finally it all worked out (for now).

First, I found that my VIA C3 Linux Software RAID server (scroll down in this link for details) couldn't take a simple cp backup without taking a dive. Unreal.

I got a reasonable deal on a 500GB external drive. USB 2 of course. That was my first problem. The junk board didn't have USB 2 connections so I had to upgrade it with a two port USB 2 PCI card. Great.

My first attempt at backing up my /data & /home partitions with a simple tar command (no compression) ended in a kernel crash. Next I tried an arcane cp | tar type command (to maintain file permissions). I will have to see if I can find it again. Anyway, it ended the same way. What's going on here?

Time for some custom kernel action. I built a couple on my VMWare build server (basically the same OS sans the RAID setup, and the kernel source & build tools installed.) I have the VMWare build box on the AMD64 box--much faster kernel builds. It's very easy to do. Customize you configuration and use these instructions to build your packages. Nothing to it. I got rid of extemporaneous fluff that wasn't needed for my C3 box, configured it so that Reiserfs and RAID 1 & 5 were compiled into the kernel (as opposed to modules), and hoped that would do the trick. No dice.

I was even reduce to running simple cp -av commands for the back up, I mean really, what OS is going to choke on a copy command? Still no luck. It still broke.

Ok, one more shot with software and then we are trying something drastic. I grabbed the latest stable linux source from kernel.org and again compiled yet another kernel. Surely the bugs were gone and I would see success. Nope. It was apparent (and I had a hunch all along that this was the case) that the processor just could not deal with the load. Sure, there was enough memory on the board but it just wasn't going to happen. The C3 is just too weak.

Drastic Measures


I pulled my Windows 2000 drive out of the AMD64 box (with all my neato PCI cards in it: RME Hammerfall; NTSC/HDTV tuner card; etc.) and put my RAID disks into the case. I used a standard (none C3 specific) kernel to boot and everything was running smoothly--except no ethernet. Oh brother...

The kernel modules for the NIC was loaded but ifconfig showed nothing but the loopback. I was forced to track down the configs to make it work. You would think it would all be in /etc/network but not quite. I will spare you the details and get to the point: you make changes in two places--or perhaps one really (if you have the right NIC driver loaded). In my case I made changes to /etc/network/interfaces & /etc/iftab.

Here is what interfaces looks like (I have a static IP address for it, makes like easier on my network & it is a server--if you believe it or not):

# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo eth1
iface lo inet loopback
# This is a list of hotpluggable network interfaces.
# They will be activated automatically by the hotplug subsystem.
# The primary network interface
iface eth1 inet static
address 192.168.0.5
netmask 255.255.255.0
broadcast 192.168.0.255
network 192.168.0.0
gateway 192.168.0.1


The only real change I made here was replacing eth0 with eth1. Even without modifying the next file, it would come up after a lengthy wait while the system autoconfigured the NIC. The next file /etc/iftab could have remedied this issue without even touching the previous file I believe--you'll see why shortly:


# This file assigns persistent names to network interfaces. See iftab(5).
# eth0 mac 00:00:0d:33:50:83
eth1 mac 02:R2:D2:C3:P0:07


I could have just changed the MAC address of the NIC for eth0 and I'll bet everything would have come up just fine. Instead, in my case, eth0 disappears and eth1 takes over. No big deal. It comes up just as it should. I may go back and "fix" it for semantics I suppose but I doubt it since if it ain't broke, don't fix it.

Ok, so the network is now working, but I forgot to mention during all of the crashes when the drives were in the C3 box I was doing reiserfsck check on the md devices all the time--even on the external drive. I will have to say ReiserFS is pretty dang solid. When it did come up with errors reiserfsck --fix-fixable /dev/mdx would generally do the job. Unfortunately, since you cannot unmout your root filesystem that one was a bit of a trick. A couple of times though --fix-fixable didn't work and I had to use the more drastic --rebuild-tree (absolutely back up your partition before running this command). Since I was now able to actually get a good backup onto the external HDD without the OS dying, it was a good time for --rebuild-tree where needed. It was nice and fast, and it worked without any issues on the partition that needed it.

So now you would think all was right with the world. The server running smoothly on nice fast hardware--life is good. However, I had this nagging feeling that / needed to be checked--since it had been crashed several times. You can't unmount root when it's running of course, so an extensive test required outside intervention. Enter everyone's favorite LiveCD, Knoppix.

I wanted to get Knoppix to assemble /dev/md0 (my RAID1 root device) with mdadm but I was never able to get it to work for me. Instead I ran a simple reiserfsck check on each unmounted partition that was part of the array. It's RAID1 (mirroring) so each disk is simply a copy of the others. The first two disks came out fine, no corruptions. The third, hdg1 was not so fortunate. reiserfsck wanted a --rebuild-tree on that one. Uh, no thanks... since I had no idea what that would do to the other two drives when it was rebooted without Knoppix. Enter the aforementioned mdadm, an extremely useful & powerful tool for RAID administration even if it is not well documented (at least I should say, I didn't find everything on it that I was looking for even though "google is [my] friend.") I decided that the simplest fix would be to "fail " the drive with the corrupted partition, reformat it, and then put it back into the array. I don't know about you kids but this was what I came up with (via this article) and it worked perfectly well for me:

# Fail the corrupted partition
mdadm /dev/md0 --fail /dev/hdg1 --remove /dev/hdg1

# Reformat the partition
mkreiserfs /dev/hdg1

# Add the partition back into the RAID device
mdadm /dev/md0 --add /dev/hdg1

The OS knew right what to do and added hdg1 right back in as the spare drive for the RAID1 device (which was the initial configuration of this particular configuration--two drives mirroring with one spare).

Ok, so now I am confident that all my partitions are clean, even the external drive, and my RAID1 & 5 devices are working properly. What else could possibly go wrong now?

I really cannot say what change(s) I made that caused the following disaster but the recovery was really quite simple in the end--even if it did take almost all day to finally stumble upon it. Here's the list of things that went terribly wrong in no particular order:
  • Samba ceased to start automatically
  • Webmin ceased to start automatically or not at all.
  • /etc/mtab & mount would only display the root filesystem as being mounted (even though nothing had been changed in /etc/fstab). mount -a would not resolve the issue either, but even though they were not listed /home & /data were still available. Very, very strange. Totally disconcerting.
  • Swap would not automatically start.
I would manually start Webmin and proceeded to change various boot time options. One option that caused me terrible grief was the checkfs.sh and/or checkroot.sh options at boot time. Do yourself a favor and don't enable these in Webmin. The box would lock up hard and keep the root filesystem in a read-only state. This causes a world of grief because the OS needs writing capabilities. In this state you couldn't even manually start some of the services you wanted. Fortunately I found a command that would undo this particular disaster:

mount -w -n -o remount /

I've seen this before, but in the midst of my sorrows, I couldn't remember it. Once / was back in rw mode, I could then start Webmin again. Play around with boot time settings, and still have problems. Something was wrong with the init scripts. Something terribly wrong, and I hadn't a clue how to fix it.

I decided that I would try to find a package through aptitude search that would allow me to reset the init scripts to "factory default" since I had obviously blown mine up somehow. Wonderful tool aptitude, highly recommended. This is what I did:

sudo -s
apt-get update
apt-get upgrade

# After finding what I was looking for
# via aptitude search I issued the
# following command

aptitude remove initscripts

# Ugly messages follow about breaking
# things and uninstalling other important
# packages but I went ahead with the removal
# anyway (making note of the other packages
# that were getting whacked as well).
#
# Time to put everything back based on my
# notes of the
packages removed

aptitude install initscripts ubuntu-minimal system-services upstart-compat-sysv
reboot

Reboot... The moment of truth... And there it was! Everything, was back to normal. All the filesystems mounted--showing up in /etc/mtab & when the mount command was executed. Samba & Webmin started automatically. "No runs, no drips, no errors." "All was right with the world."

And there you have it. Many disasters and many (recovery) lessons learned. Using Knoppix to run reiserfsck on the root partition RAID1 drives was nice. I wish I could have figured out how to make it assemble the RAID device under Knoppix though. That would have been even better, and a necessity to checking a RAID5 device if your OS blows up completely.

Now, I have to find a permanent home for the drives (an the OS that lives on them). A box with good airflow & a CPU /MoBo /RAM with enough capacity to meet my seemingly simple needs. I'm not leaving them in the AMD64 box. I need to put that back together so we can use it as the big workstation with all the toys in it.

Well, when all is said and done, at least when I move the drives again I should be better prepared for some of the weirdness that may occur...

1 comment:

Anonymous said...

Opps wrong blog post

Brilliant