Saturday, June 21, 2008

Where's GRUB?! Ubuntu 8.04 LTS Server & RAID1

Where to begin? (I guess check out my previous post?)

First, my release to release upgrade to Ubuntu 8.04 LTS server was going along fairly well. The upgrade from Edgy (6.10) to Feisty (7.04) went fairly well, except that it dropped one of my drives from the RAID arrays—easily remedied. Just added the partitions back into their respective RAID devices.

Next up, time to move from Feisty (7.04) to Gutsy (7.10) and if all went well, the final move to Hardy Heron (8.04 LTS).

All did not go well in the upgrade from 7.04 to 7.10—although, I must admit here that most of it was my fault. This time I did the upgrade the "official" way.

Network upgrade for Ubuntu servers (recommended)

If you run an Ubuntu server, you should use the new server upgrade system.
  1. enable the "dapper-updates" repository
  2. install the new "update-manager-core" package - dependencies include python-apt, python-gnupginterface and python2.4-apt.
  3. run "sudo do-release-upgrade" in a terminal window
  4. follow the steps on the terminal window
This approach seemed to work just fine, and since my box is headless I even ran it over SSH without any issue (even with the warning that doing the upgrade over SSH is probably not ideal).

When I rebooted however the box had no network connectivity. ifconfig revealed only the lo interface. eth0 was gone. sudo lshw showed that the NIC was disabled for some reason. I finally tracked the problem down to /etc/udev/rules.d/70-persistent-net.rules that had "updated" eth0 to eth1 for whatever bizarre reason. I simply change eth0 to eth1 in /etc/network/interfaces and ran sudo /etc/init.d/network restart and all was well again.

Next (again), one of my drives was missing from the RAID array devices. It should be a simple matter of adding them back in via the Webmin Linux RAID module. However, I wasn't paying enough attention and it appears that I tried to add a partition already in use to it's own array (why Webmin would even list the partition to add when it is already in use is questionable). It is possible that I am totally confused on this point but when I did a cat /proc/mdstat it showed the array "rebuilding" so slowly that I was sure something was definitely wrong.

Here my brilliance really shines through. Since the other partitions were delaying sync until the first one finished, I thought I would stave off as much damage as I could by shutting down the box. I can't recall if I tried this via telinit 0 or if I simply powered off the box in my haste. At any rate, I really wreaked havoc on my /home & /data RAID5 partitions. / on md0 (RAID1) was fine. The important partitions did not fair so well. reiserfsck --rebuild-tree did it's best to salvage the carnage but a lot of damage was done. I quickly determined that a restore of /home & /data from my external backup drive would be necessary. [big sad sigh]

Well, if I was going to go through that grief I figured I might as just rebuild the whole box with Hardy Heron from a fresh CD install using the ext3 file system instead of reiserfs since any further development of reiserfs is almost certainly at an end.

And thus it began.

Installing from the 8.04 LTS is really quick and fairly painless, aside from manually setting up the RAID partitions--even that goes pretty fast though (once you've got through it about a dozen times). This is where most of my troubles began. I would set up the RAID arrays & partitions during the install but it would go crazy. The arrays would start rebuilding before the process was completed. RAID devices would show up that weren't even added during the partitioning process. On & on the troubles went.

I can't tell you how many times I tried getting things to work and how many different approaches I took to the problem. I will save you the gory details. The fix is rather arcane and it took forever to figure out. Google was not my friend on this matter. Am I the only one to have these issues? Lucky me...

Here is the problem: Even though I would completely delete the partitions and format them with ext3 instead of reiserfs it didn't fix anything. What was happening is that the install program was seeing the old RAID superblocks from my original setup and using them to rebuild arrays during the install process. This had dreadful effects. I had to get rid of those old RAID superblocks and start fresh. Enter Knoppix.

Initially, I thought I would completely wipe the three drives with the following from Knoppix CLI command:

dd if=/dev/zero of=/dev/hda (hde & hdg)

As each drive is 320GB, this would have taken FOR-EV-ER. Forget it. Next please...

Note: If you ever need to use the following procedure, don't delete your partitions before running this command (because you won't be able to).

You can probably do the same thing from the install CD by exiting to a shell, but Knoppix booted to init 2 was fine for my purposes...

If you are doing this from the install CD use the following first:
make sure the RAID devices are not mounted (i.e. umount /dev/md0 etc.)
sudo mdadm –stop /dev/md0 (repeat until all RAID arrays are stopped, i.e. md1, md2, etc.)

Using either Knoppix or install CD, kill the RAID super-blocks:
mdadm –-misc –-zero-superblock /dev/hda1 (or sda1 if the distro installed shows your IDE drives as SCSI.)

Repeat for each RAID partitions on each of the drives! For example:

mdadm –-misc –-zero-superblock /dev/sda1
mdadm –-misc –-zero-superblock /dev/sda2
mdadm –-misc –-zero-superblock /dev/sdb1
mdadm –-misc –-zero-superblock /dev/sdb2

You get the idea...

OK. The partitioning problem is solved. Your back to installing via CD and partitioning is working just as you want it to. The rest of the installation process runs smooth as glass.

Enter problem two...

Upon completion of the installation , the system WILL NOT BOOT?!?!

And when I say it won't boot, I mean not at all. Grub doesn't even try to load. I was stuck at "Booting CD" and it just hung there!

Unbelievable. I tried reinstalling Grub from the install CD in Rescue Mode to no avail. I tried nuking the MBR on each drive via dd if=/dev/zero of=/dev/hda bs=512 count=1 (same command for the other two drives, hde & hdg) and then attempted to install Grub again... Nothing I did mattered. It would not boot!

At this point I gave up on installing Ubuntu 8.04 LTS Hardy Heron server edition.

I was beaten.

Since Dapper Drake 6.06 LTS is still supported (until June 2011) and I had the install disc. I decided to give it a go--what the heck, after all the time I'd wasted already, why not give it a go?

It installed perfectly. It booted perfectly. It updated via apt-get perfectly.

After I determined I wasn't dreaming and I had not as yet spent the hours it would take to restore the backup files to the server drives, I figured I would try a network upgrade from 6.06 to 8.04 LTS (since you can skip all the intermediate releases when going from one LTS version to the next).

It worked!

Everything appears to be in order. It booted up just fine. The network started properly. The RAID arrays are running. cat /proc/mdstat indicates no problems with them. All is well so far.

I just had to make a slight change in the (official) upgrade process:

sudo apt-get install update-manager-core

sudo do-release-upgrade --mode=server

Without --mode=server it didn't think there was an upgrade available.

Time to install Webmin (it just makes life easier). Don't use apt-get to install Webmin, get it from the main site. This is what I did to get it rolling:

sudo nano /etc/apt/sources.list

Add the following lines:

deb hardy universe
deb-src hardy universe
deb hardy-security universe
deb-src hardy-security universe
# deb hardy-backports main restricted universe multiverse
# deb-src hardy-backports main restricted universe multiverse

Next run the following:

wget -v http://some-mirror/sourceforge/webadmin/webmin_some-version_all.deb

md5sum webmin_
some-version_all.deb and check it against the hash listed at

Follow these instructions when ready:

sudo apt-get update

sudo apt-get install perl libnet-ssleay-perl openssl libauthen-pam-perl libpam-runtime libio-pty-perl libmd5-perl

sudo dpkg --install webmin_some-version_all.deb

And that's it. I restored my files to the /data & /home partitions, configured my Samba shares and all is right with the world.

For now...

Wednesday, June 18, 2008

Ubuntu Upgrade & Software RAID

You never know what you are going to get when you do a distribution upgrade.

Somewhere along the way I "upgraded" my little fileserver from Ubuntu 6.06 LTS (Long Term Support) to 6.10 (NOT Long Term Support). Why? I do not know. Support for 6.06 LTS will end June 2011 but support for 6.10 ended April 2008!! Unfortunately, I only realized this yesterday. No wonder I hadn't seen any package updates for some time. :-(

Time to upgrade, NOW!

Ubuntu's latest is another LTS release, 8.04. So my big plan is go to somehow get from 6.10 to this latest LTS release. Supposedly you can do it if you upgrade a release at a time. Away we went using Method 2 here...

Changed my sources in /etc/apt/sources.list replacing all instances of edgy to feisty.

sudo apt-get update

sudo apt-get dist-upgrade

sudo apt-get -f install

sudo dpkg --configure -a

sudo telinit 6

It rebooted, so I thought I was probably safe.
sudo lsb_release -a showed me that I was indeed upgraded to 7.04. All was right with the world...

Not so.

My RAID5 & RAID1 devices were missing a drive! What happened?!

It turns out that upgrading turned two of my IDE drives into SCSI devices and left one as a regular IDE device. Bizarre. The missing RAID drives were the partitions from the one IDE device that was left. I used Webmin to simply add the appropriate partions from the IDE drive to the SCSI RAID devices and that was it. It worked without sending my data to /dev/null.

The RAID5 devices rebuilt and the RAID1 was back to mirroring with a spare. All was right with the world.

My question is, what's going to happen during my upgrade from 7.04 - 7.10 - 8.04? Will I be so lucky. I was pretty freaked out when my IDE drives "turned" into SCSI devices. Just weird, and why the heck did it leave one of the IDE drives as such instead of making it a SCSI device? I do not know...