Lion and hard drive failure

Posted by Pierre Igot in: Macintosh
April 18th, 2012 • 8:54 am

Yesterday morning, when I went back to my computer after a good night’s sleep, I immediately noticed that something was wrong. My first scheduled SuperDuper! backup had not completed (the backup was stuck at the last stage, ejecting the disk image). Things were a bit slow. Several applications were frozen solid. It is not the first time this has happened to me. As I indicated in this post a couple of months ago, there is an on-going issue with Lion and SuperDuper! backups. The difference this time was that, while the applications were frozen, they were not saturating my cores with excessive CPU usage. They were simply “quietly” frozen.

I force-quit them and decided to reboot my machine, just to make sure everything was back to normal. And that’s when things started to go really wrong. The first attempt to reboot resulted in a solid gray screen that would not go away. I then did a hard reset. After that, the login screen took a long while to appear, but it did eventually appear. Unfortunately, while the mouse pointer was moving, none of the login buttons were responding to mouse clicks, and the screen was not responding to keyboard input either.

After a couple more hard resets that didn’t yield any improvement, I tried booting with the Option key down. Here again, things took an abnormally long time, but eventually the screen with the various boot volume options did appear. The expected volumes were on the screen, but when I clicked on my Recovery HD partition, nothing happened. The button turned dark gray, but the booting process didn’t start.

I rebooted again, also holding the Eject key down to open the disk tray, so that I could insert my Lion DVD. Once again, things took a very long time, but eventually the tray did open. I inserted the DVD, and eventually got to the boot volume options screen again, but here again nothing worked.

I then zapped the PRAM, by holding the Option, Command, P and R keys down during startup. It worked, but here again, I found that the length of time between each chime was abnormally long. And unfortunately zapping the PRAM didn’t help. The machine still wouldn’t boot.

I was starting to get really worried. What if I had a general hardware failure and not just a disk failure? At that point, I was remembering my experience of a couple of years ago, when I had to fix a MacBook that wouldn’t even boot from disk and discovered that a simple hard drive failure was sufficient to make the whole machine unusuable, even when attempting to boot from an optical disk. So I was not ruling out the possibility that an internal disk failure was making the whole machine unusable, but I was still quite worried.

I wasn’t worried about loss of data. I had nightly backups of all my important stuff and surely I didn’t have several different drives failing on me at the same time. My worry was about the Mac Pro itself and having to replace the entire machine, which is no longer under warranty.

Then I started unplugging external stuff in an attempt to bring the machine back down to a simpler state. And one of the first things that I unplugged was an old external FireWire hard drive that I have been using for years to copy some less important files.

As soon as I did that, the Mac Pro started booting, not from my normal startup volume, which is an SSD drive, but from the internal backup of that SSD drive, which is a regular hard drive. The entire booting process was very long (more than 10 minutes), in part because I failed to intervene in time to stop Lion from reopening all the applications and windows that I had left open at the time of the last save, and also because a hard drive is slower than an SSD and the very first boot from a backup system volume, even one that is “blessed” as a booting volume, probably requires the building of all kinds of hard drive caches that help speed up subsequent reboots.

Once the booting was complete, I noticed that the backup was from the day before last, which confirmed that the scheduled SuperDuper! backup of my startup volume had also failed during the night. But I was then able to go to System Preferences, reselect my SSD volume as the startup volume, and restart from it without any further difficulty.

And sure enough, while the external FireWire drive was still on but not connected by FireWire to the Mac Pro, it started to make the typical, ominous noises that a failing hard drive makes, including those little “beeps” that a normal hard drive never makes.

Phew! I was mightily relieved, not just because I didn’t have a hardware failure in my Mac Pro or in any of its internal drives, but also because the SSD volume was intact and I didn’t lose my last 24 hours of work either. (And I had a very recent backup of that external hard drive too, so I wasn’t worried about any data loss.)

Still, the questions remain: Why did the simple failure of an external drive that was only connected to the Mac Pro by FireWire cause all this grief? Why is Mac OS X unable to deal gracefully with such disk failures? How come the failure of an external drive is enough to prevent a Mac Pro from booting from its internal volumes, and to slow everything down to an abysmal crawl, even something so basic as zapping the PRAM?

In light of my similar experience with the MacBook two years ago, I can only conclude that this is a flaw in recent Macintosh machines, possibly because of the switch to an Intel-based architecture. (In all my troubleshooting years, I don’t remember ever having to deal with such interference. If a hard drive failed, I was always able to boot from CD in no time and isolate the problem.)

I also cannot help but wonder about what the future holds. With regular hard drives, you can still actually hear the failures and hear when things are back to normal and the hard drive is churning away. As soon as I unplugged the defective FireWire drive, I started hearing all kinds of internal hard disk activity, and I also started hearing the beeps from the hard drive inside the FireWire enclosure, indicating a serious hard drive failure.

But with SSD drives, we no longer have such auditory clues. As far as I know, a defective SSD drive is just as silent as a functional one. I guess I will see (not hear) what happens when I experience my first serious SSD drive failure. It hasn’t happened yet. I was really worried that this was what was going on this morning, even though my SSD is less than a year old, simply because it is newer technology and I am not entirely sure how reliable it really is.

It turns out that my internal SSD drive was not the problem. But I really wish that Apple would put more effort into ensuring that Lion degrades gracefully when a disk failure starts happening, and provide better clues about the source of the problem. Right now, even an experienced troubleshooter such as myself has to go through many minutes of deep anxiety before he or she is able to circumscribe the problem and rule out more serious hardware failures. It is simply impossible to get a proper sense of how long any process is going to take. A computer that only opens the tray of its DVD player several minutes after the Eject key was pressed looks very much like a computer that will never open the tray of its DVD player. A computer that fails to respond to a click on the Recovery HD icon on the startup screen for several minutes looks very much like a computer that will never boot from the Recovery HD. Etc. There is no way for the troubleshooter to differentiate between extremely slow and dead, unless he’s willing to wait for several minutes every time he tries a simple action such as a key press or mouse click, even in the absence of any noise coming from the machine indicating some semblance of activity. This is simply not right.

Comments are closed.