Panther freezes (continued)
Posted by Pierre Igot in: MacintoshMay 10th, 2004 • 6:47 am
It seems that the troubleshooting process regarding my periodic Panther freezes is not yet over…
I thought that my stock internal hard drive (an IBM 120 GB hard drive) was the problem. I installed a new Seagate hard drive, did a fresh install of Panther, and transferred all my stuff from the old drive to the new drive.
I left the old hard drive in there, hoping to be able to reformat it and use it as a secondary drive for less important stuff.
Yesterday, however, I got a system freeze again. Since the old hard drive was still in there and its partitions mounted on the desktop, it was still possible that it was involved in the freeze. So I opened up my G4 and physically removed the hard drive for good. I rebooted and repaired the new hard drive with Disk Warrior, and got back to work.
Last night, as usual I left the computer on and went to bed. Normally, the screen saver kicks in after 15 minutes and the displays go to sleep after 20 minutes. The system is set so that the CPUs and the hard drives never go to sleep.
This morning when I got back to the machine, I pressed the space bar. The displays came on and displayed the last frame of the screen saver — but then the login dialog box asking for my password never came on. No matter what I tried, the system appeared to be frozen again, displaying the screen saver image and not responding to anything. I pressed on one of the displays’ Power buttons for a second, and about 20 seconds later the displays went to sleep. But then I wasn’t able to do anything else. Pressing the space bar again didn’t bring the displays back to life.
I ended up doing a hard reset.
Now, it could very well be that this is a separate problem. It’s not the first time I have experienced problems waking from sleep with Panther (or previous versions of Mac OS X). I certainly hope that it is just a coincidence that the problem occurred within 24 hours of my removing the old hard drive.
I guess the next test will be to see if the freezes occur again while I am working. If they do, then it means that the old IBM hard drive was probably not at fault. If they do not, then it means that the freeze when waking from sleep is a separate issue.
We shall see.
But I must admit that I am not enjoying this untimely return to the old days of regular system-wide freezes. Mac OS X was supposed to get rid of such problems for good.
May 10th, 2004 at May 10, 04 | 5:21 pm
Pierre check your machine’s memory, my ibook started randomly freezing, after a few searches on google i started to suspect that something had gone wrong with the Crucial 512mb RAM stick that I had added to the machine (7-8 months earlier).
Crucial were great I just gave them a phone call said I was getting random freezes and that I thought there was something wring with my Crucial memory stick, I sent it back to them and they sent me a replacement by return of post.
May 10th, 2004 at May 10, 04 | 5:23 pm
I should of added that the the switch appeared to stop the freezes I was previously getting.
May 10th, 2004 at May 10, 04 | 9:49 pm
Paul: I did think of this, and ran the full-blown hardware tests from the Apple Hardware CD that came with the G4. The tests didn’t detect any problems with the memory. I suppose it’s not a 100% guarantee that nothing is wrong with my memory, but I am waiting for the next freeze before I start removing my RAM…
May 11th, 2004 at May 11, 04 | 7:57 am
Pierre my RAM also passed the hardware tests!
May 11th, 2004 at May 11, 04 | 10:52 pm
OK… Well, we’ll see if the freezes start occurring again now that I’ve physically removed the old hard drive. (None so far.)
May 13th, 2004 at May 13, 04 | 1:51 am
Pierre:
Bummer!
Memory? Yeah, it can go bad in strange ways. Here’s a story about a prototype portable PC way back in about 1982. The DOS (sorry!) command
C> dir
listed every file, as it should. However, what should have been an entirely equivalent command
C> dir *.*
listed nothing. (The “*” is a wildcard.) Long story short: A memory failure transformed the second command into
C> dir *.~
where “~” was some character prohibited to filenames, maybe –I forget– a nonprintable. As far as I recall, the memory tested out perfectly — this was a very specific pattern failure that wasn’t anticipated by the tests. Now, this problem occurred a long time ago, with a different memory technology, but …
Sorry, do you have third-party memory installed? If so, that is the next suspect. You might want to replace it exactly with some memory from a very reputable vendor and not the cheapest you can get. If this test disproves memory is a factor, and you’ve planned well, you can resume operation with 2 times the previous memory — a luxury you may not need, but it can’t hurt. If you have nothing but official Apple memory, I think a failure is much less likely, but you might want to try this test anyway.
By the way, I’ve got three drives, one for data, one for backup, and the third hosts OS X. As far as I can determine, the backup drive, when unused, plays no part in daily operations — it really can’t do much harm or good when it is spun down.
Henry
May 14th, 2004 at May 14, 04 | 5:45 am
Good story :). Yes, I do have extra RAM installed, and no, it’s not from Apple (although these days this is far from being a 100% guarantee, based on what I have been reading lately). However, it comes from a very reputable vendor (OWC) that I have been using for the past several years, and when a module from them isn’t defective from the get-go, it usually works for years without problems. (I have had only one module for a client’s iBook failing after a couple of years, and they replaced it without problems.)
The bottom-line, however, is that, since I’ve removed the defective HD entirely, except for that screen saver freeze the day after, I haven’t had a single freeze since. (That’s 5 days.) I guess that’s a pretty good sign that the RAM is not at fault. But I’ll remain vigilant. (I’ve just bought another Seagate 160 GB hard drive that I’ll install in the G4 over the week-end.)
May 14th, 2004 at May 14, 04 | 5:47 am
The thing to note with the defective hard drive is also that in my Energy Saver control panel I’ve disabled the option to spin down hard drives, because I find it rather irritating each time I have to wait for a drive to spin up. So that means that my defective HD was always spinning all that time.
May 14th, 2004 at May 14, 04 | 6:25 am
It is simply very hard to believe a hard disk could be at fault — but maybe you’ve found the exception. The data on a hard disk is very well protected. You’re almost certain (what’s completely certain?) of getting a hard failure indicator before you’ll get scrambled data from a disk — provided, of course, that the data recorded was OK.
Let me see — the worst thing you can do to an operating drive is to jolt it physically, although recent drives may have such efficient anti-jolt measures (based on g-force sensors) that this may not be much of a threat any more. Operating it out of the specified temperature range isn’t very good for it — remember the important point is the drive temperature, not the room’s. Spinning eventually wears down the drive, so you want to let it rest if you aren’t using it. On the other hand, I think that the spin-up and spin-down processes also impose some stressess on the drive, so you don’t want to do these too often. My compromise is to let the drive spin down. If I encounter too many of those annoying spin-up delays, I increase the wait-before-spin-down time. It’s a balancing act, to be sure.
I’m beginning to suspect secondary effects, e.g. you might be operating near the limit of a power supply. Is your box lightly versus fully loaded with respect to drives, PCI cards, host-powered USB peripherals, etc.?
May 14th, 2004 at May 14, 04 | 6:55 am
I don’t think my box is particularly loaded. It has no extra PCI card at the moment, and the majority of my USB devices are powered either by my external USB hub or by the USB hubs that are built into my two displays.
May 14th, 2004 at May 14, 04 | 8:36 am
Oh, well, it was worth a try.
Well, if your power supply is marginal anyway, without a heavy load. Don’t mean to make you paranoid, I’m just looking for all the angles.
May 14th, 2004 at May 14, 04 | 11:12 am
No problem :). I appreciate your suggestions. Best to keep all options open in such situations. It wouldn’t be the first time I would have missed a rather obvious potential source of the problem. Such is the life of the troubleshooter… (Can’t really add the “professional” qualifier here, since there’s no such thing as a troubleshooting profession as far as I can tell…)
May 15th, 2004 at May 15, 04 | 2:14 am
With respect to these boxes, they are so incredibly complex… With the help of a colleague, a piece of test equipment (a chip emulator), and a can of “freeze-it” spray, I found the memory pattern sensitivity I mentioned above in about four hours. I doubt this would be possible 20 years later with much more complex computers. About the only option is to swap out components and hope something clear results.
May 25th, 2004 at May 25, 04 | 2:46 am
Wow, sorry to hear you have so many problems!
Brilliant idea to check the drives in your office machine.
Have you checked with the folks who make CCC? Keep in mind that it may use low-level tricks to accomplish its task, and this may quite reasonably conflict with some normal modes of usage — or a diagnostic tool.
So, why are you cloning so often, anyway?
I’m not sure what the sound means either, but based on the phrasing, I expect that it is a recalibrate followed by a re-seek — or something like that. The disk heads are somewhere arbitrary; a command sends them straight back to track 0, where the drive verifies that it is, indeed, getting to track 0, then a command sends the directly out to the maximum track. Maybe the fourth sound is a final seek to some data track. This is done when the driver (or the disk) gets lost and needs to make sure it knows where the heads really are.
The decision good sector/bad sector is made inside the drives. But you could get data transfer errors caused by faulty connections.
I’m wondering if you have something mechanical wrong, say, a bad connector or cable on your ATA bus. I would definitely do some contact cleaning, which generally means using a spray can of ozone-depleting chemical on all the contacts. If that fails, a new cable.
One somewhat expensive way of proceeding is to put in a standard SCSI card and drive into your machine. (I run MacOS X on a drive connected to such a card and let ATA handle data drives. I’ve almost never had unexplained crashes and none that seem to be connected to hard disk issues.) This would give you fairly definitive information about the reliability of your ATA subsystem.
May 25th, 2004 at May 25, 04 | 3:46 am
Volodya: As you said, it sounds like a hard disk problem, but then you are having it with several drives, and only on that machine. Did you try removing the suspect drive entirely? In my experience, the freezes have completely disappeared since I physically removed the faulty drive from the Mac. (I have the exact same model as you do.)
I am not particularly adept at English sound transcription either :-). I remember hearing that noise very soon after I got the brand new G4 — and I always thought it was “normal” hard disk noise for this particular model, until recently when the freezes started occurring and I was able to make the connection. Please note that it was not a 100% link. Sometimes I would get the noise and no freeze, sometimes I would get a freeze with no noise. But there was some link between the two. My guess is that the noise was a sign of unusual hard disk activity, and that type of unusual activity might it more likely to hit whatever bad thing there was in the hard drive that would cause a freeze. But other (silent) types of HD activity were also able to trigger a freeze. In any case, I have not heard the noise since removing the faulty IBM drive altogether. The Seagate replacement just makes regular hard disk noises, which is, quite frankly, more reassuring.
As for the processes going down to 0% in top in the Terminal, that would be consistent with a hard disk problem, wouldn’t it? I never ran top while the freezes where occurring, and I would only notice them when the system actually became frozen, by which time there was no way I could switch to Terminal or Activity Viewer. So it’s quite possible that we are talking about the same kind of freezes just the same.
As for CCC and permissions repair, I suspect that they were associated with the freezes because they are HD-intensive activities.
Henry: The problem with SCSI is that it’s not always well-supported by Apple (in system updates, etc.).
May 25th, 2004 at May 25, 04 | 2:16 am
I have a Power Mac G4 (MDD 2×1.25, Summer 2002 model) that exhibits symptoms similar to some of those Pierre describes. I only have one monitor and have not had any wake-up issues. But I do have a problem seemingly associated to hard disk access, which has by now induced me to buy two extra internal hard drives.
I used to have random freezes since day one, about twice a week. Shortly before the warranty ran out, I contacted Apple and had my computer looked at. The repair outlet replaced a RAM stick (Apple-installed) even though my Apple Hardware Test and Tech Tool Pro 4 used to give memory a clean bill of health (many times over). This has not eliminated freezes. One sure-fire way to induce a freeze was to attempt a clone of a largish partition. (I mostly use Carbon Copy Cloner because it allows to more or less start where you left off crashing, although other cloning methods also lead to a freeze.) Also repairing privileges used to bring the Mac down fairly often.
Next time the repair outlet said they found bad blocks on one of the disks and offered to replace it. They used Norton, and said on one run it reported bad sectors, on another run it just crashed. My Tech Tool Pro used to crash (and freeze the Mac) reliably when tasked with surface scan. Reformatting a disk (w/o zeroing) made little difference. So far this sounds like a disk problem (to which, mind you, all three of my disks were subject), but hear this: When I put any of the same disks into my computer at the office (Power Mac G4 QS 0.867, Summer 2001 model) they perform like champions and clone flawlessly.
Anyway, since then I’ve read up on bad sectors, and found out that this is something that should be handled and eliminated by the hard disk itself except for the cases when a bad sector occurs in the header. If you’ve got bad sectors in the header, you reformat the hard disk with zeroing all data – this should take care of bad sectors in header. So I did that on one of my disks and put my working partition on that one. Well, that was three days ago, nothing has frozen yet, but it is early days yet. This disk is currently on the ATA66 bus.
Some data that may be relevant: As I said, running Carbon Copy Cloner is a reliable way to bring down my Mac. I used to run top alongside it to keep track of what is happening. Here is what I noticed: Whenever Tech Tool Pro kicks in with its routine inspection and CCC is running, this brings down the Mac within ten seconds. I’ve switched off any automatic periodic tests by TTP since then. CCC still crashed (taking the Mac with it), but less often.
Since I started to live on zeroed land three days ago, I have not started up CCC yet – figured out I’d see first what happens to random crashes.
Freezes comes in two shapes – one where all processes except WindowServer go down to zero CPU usage (at least for the period while the Terminal still works), and when you have the process kernel_tas go up to 100% CPU usage, other processes to 0. I don’t have a PowerBook or anything else to ping the MDD.
My current guess as to what’s wrong is that the ATA100 bus (or all of them) is either faulty or unusually sensitive to the quality of sectors – what appears to be a good sector to my office computer is a bad sector for the MDD, but what do I know.
I do not recognize the “blip shling plonk gzzz” sound, but then I am not at all adept at Anglophone sound transcription. I’ve had one new sound since I installed the second hard drive (Western Digital), or maybe since I replaced the power supply by Apple’s quieter version in April ’03 – I don’t really remember. I can only describe it as what you hear from a compressed air container when it discharges surplus air through a safety valve, but much quieter of course. It is also not dissimilar to a bee or large fly trapped in a matchbox and not happy about it. In my experience, though, it is not connected with imminent freezes.
Any comment?
May 26th, 2004 at May 26, 04 | 6:08 am
Pierre and Henry, many thanks for your comments.
Henry writes:
<blockquote>Have you checked with the folks who make CCC? Keep in mind that it may use low-level tricks to accomplish its task, and this may quite reasonably conflict with some normal modes of usage?or a diagnostic tool.</blockquote>
CCC typically freezes while running ditto, an old Unix utility which I guess is low-level. I may indeed write to Mike Bombich and ask how CCC coexists with harddisk diagnostics.
<blockquote>So, why are you cloning so often, anyway?</blockquote>
When you take your computer to the repairman, you feel safer if you leave a disk with a copy of all your stuff safely at home. Also, when you are troubleshooting (in my case this often involves questions like “Do I get a random crash in the space of a week”) you are often tempted to try different drives.
<blockquote>The decision good sector/bad sector is made inside the drives.</blockquote>
Thanks, I did not know that.
<blockquote>
I?m wondering if you have something mechanical wrong, say, a bad connector or cable on your ATA bus. I would definitely do some contact cleaning, which generally means using a spray can of ozone-depleting chemical on all the contacts. If that fails, a new cable.
</blockquote>
I thought I did that every once in a while. Cables being the prime suspects is a fresh idea for me. I thought they were relatively simple devices. Isn’t there other stuff in or around ATA busses that is at least equally likely to misbehave?
<blockquote>One somewhat expensive way of proceeding is to put in a standard SCSI card and drive into your machine. (I run MacOS X on a drive connected to such a card and let ATA handle data drives. I?ve almost never had unexplained crashes and none that seem to be connected to hard disk issues.) This would give you fairly definitive information about the reliability of your ATA subsystem.</blockquote>
I would expect this to work, but I am still looking for a solution rather than a workaround. (Perhaps I’m hopelessly optimistic.) I would prefer to do without SCSI for the reasons that Pierre cites, and also because, ironically, SCSI drives are supposed to be more sensitive to bad sectors than ATA. As for using this as a diagnostic manoeuvre, I do not feel I have exhausted the cheaper options yet.
At the moment I am trying to determine whether my ATA66 is any more stable than the ATA100 bus – I used to have my system on ATA100 exclusively.
–continued in next posting
May 26th, 2004 at May 26, 04 | 6:10 am
continued from previous posting–
Pierre:
<blockquote>Did you try removing the suspect drive entirely? In my experience, the freezes have completely disappeared since I physically removed the faulty drive from the Mac.</blockquote>
My problem is that all three of my drives are “suspect”. But some, indeed, are more suspect than others.
It is still not clear to me whether the noises you and I hear are the same. Does your noise fit my description? Re the relation of noise to freezes, I guess I should have put it stronger: I observe no correlation between the two. Neither is it clear to me that the source of the noise is a hard drive rather than the power supply or one of the fans.
Pierre, I understand you have not had random crashes for a while now that you have removed the faulty disk. Have you tried zeroing it? It may still be usable after that.
I would tentatively conclude now that although Pierre’s freezes and mine feel pretty much the same to the user, their reasons are different – Pierre had a faulty disk while my MDD has problems with ATA.
I came across something on MacBidouille that sounds as though it may have something in common to our problems:
<blockquote>Cela fait 6 mois que je cherche ce qui plante ma machine depuis la mi-décembre. A cette date, j’ai changé pas mal de chose dans ma machine : Jagar -> Panther, carte PCI, second disque dur, quelques softs… ce n’était donc pas simple de trouver et je ne suis peut être pas rapide :-)<br/>
J’ai retrouvé une news qui décrivait des problèmes de stabilité de machines car les disques durs sont livrés par Apple en “cable select”, et bien moi c’est le contraîre : je possède un G4 1GHz FW800 avec 2 disques dur SEAGATE et un combo PHILIPS CD5301 et je prends 3 kp par jour si je ne mets pas tout le monde en “cable select”…</blockquote>
Unfortunately my poor French does not allow me to understand the key last phrase.
Used to have lots of random crushes in my youth.
cheers,
Volodya
May 26th, 2004 at May 26, 04 | 6:42 am
Volodya: I haven’t tried reinstalling the faulty drive and reformatting it with zeroing yet. I might do that in the future, but I have many more pressing things to do first :).
The key thing in my noise is that it used to happen only from time to time. It was not a continuous noise! I have spent more than enough time listening to the various noises made by my G4 (it is the infamous “Wind Tunnel” model after all), but this particular noise was definitely not coming from a part that is always on in the same state, such as the power supply. It was a sudden burst of noise that would just last a couple of seconds, and my best guess was that it was coming from the hard drive. I haven’t heard it since removing the drive, so that confirms my guess.
Interestingly, I now also remember another intriguing noise that I used to hear with that faulty drive. It was a more low-level, grating, continuous noise that would often last several minutes and go away as mysteriously as it had come on. I was never sure where this noise was coming from, but I haven’t heard this noise since replacing the drive either, so it probably came from the same drive. (This noise too was there from the very beginning.)
As for the MacBidouille post, the last sentence just means that he’s getting 3 kernel panics per day if he doesn’t use the “cable select” configuration with his 2 hard drives and combo CD recorder. I don’t really see why he doesn’t want to use cable select. That’s what I use, and I haven’t had any problems with it.
May 26th, 2004 at May 26, 04 | 6:52 am
Yes, I’m also talking sudden bursts of noise, about one second long. Not all power supplies are always in the same state, some of them also have a fan of their own. But if removing the hard drive removes the noise, that sounds like solid evidence.
Yes, I think Apple tells us to always use cable select on MDDs. I just wonder if changing that configuration could somehow miraculously change our fortunes.
May 26th, 2004 at May 26, 04 | 7:41 am
Volodya:
Good reason to clone! I guess I’ve simply been lucky. I do keep backups, but of my data only — everything but MacOS X.
Cables do break when stressed –this does happen– but connectors are the most suspect. I have fixed a number of computer problems by simply removing and then reconnecting. Sometimes, it’s a matter of a slight bent “finger”, or a bit of detritus, or even some metal corrosion. (I’ve even seen cases where corrosion creates a kind of diode,meaning that electricity will flow in one direction, and not in the other! I found this by testing a cable wire in both directions.)
As for other sources of trouble, in general, the electronics are provided on the motherboard or a PCI card. The interfaces are big integrated circuits that rarely fail a little bit — it’s usually all or nothing. Since your drives are working most of the time, I think the interfaces are probably OK.
Unlikely but possible causes: some kind of unusual environmental problem (a radio transmitter or stamping mill next door?) or a problem with your power supply involving only a voltage sent to disks — maybe the amount of current available is just barely enough, and unusual activity causes the demand to transiently exceed the supply. Similar drives or even different copies of the same model may have slightly different demands.
The basic physical drives are the same, SCSI or not. It’s just the electronics that differ. There are “more” electronics and a bit more intelligence built in to SCSI drives, and they cost a bit more. (That’s one Apple stopped using SCSI as a standard.)
I’d still recommend testing with every utility you can afford. Try to establish a stable configuration with one drive that works and figure out what’s wrong with secondary drives. (Sorry, you’re probably already frustrated about doing that!)
Briefly, about noise: Pierre knows I had trouble tracking down a problem with noise in my machine. Eventually, I got out an old Apple microphone (any with the right connector will do, as long as it is all plastic), downloaded a “live recording” utility, and turned my Mac into an audio self-diagnostic machine. Eventually it was clear: the video card fan was making an awful racket.
May 26th, 2004 at May 26, 04 | 7:24 pm
Henry:
thanks for your posting. I think I was very lucky to google on to such knowledgeable posters as yourself and Pierre.
Come to think of it, the frequency of freezes used to (temporarily) change either for better or for worse after transportation, and for the better after long periods without use. This I think lends credence to your mechanical fault theory.
Do you know if the ATA66 and the ATA100 busses use the same make of cable? I might try swapping the two sometime down the line. Do you know if Apple’s ATA cables are a standard item, something you can buy in a shop, or are they custom-made? I’ve noticed that a cable is made of twenty or so threads, and a segment on one of the threads is deliberately cut out.
I would not suspect my environment: the location is not that exotic, and the machine performed similarly at three different locations.
It does sound plausible that SCSI and ATA drives use the same platters. I also understand that SCSI tends to be snappier. I think I read in TTP4’s manual, though, that SCSI drives won’t automatically map out bad sectors.
In a few days I’ll update to 10.3.4 unless it is reported to be seriously problematic, restart the computer and indeed disconnect all the drives on ATA100, not just unmount them. As Pierre suggests, this should provide for a cleaner experiment.
You also suggest there may be an issue with power supply to the drives – this would be particularly likely if you have more than one drive on the same bus, non? (The power to both drives on any of the buses goes through the same cable on MDD.) Are there any other likely issues associated to multiple drives on a single bus?
Keeping data-only backups is not always sufficient. I think most users have lots of custom stuff in the System’s library – for example I’ve got tetex that takes much quality time to configure properly, and probably other stuff I am not even aware of. There are also oodles of custom items in places like /usr. All in all, I believe the mantra “if you’ve got your home directory, you are set to go” is nothing but a beautiful myth.
May 27th, 2004 at May 27, 04 | 2:30 am
It’s possible something mechanical is at fault. Do you have enough freezes in total to interpret the apparent change after transporting your system? (For debugging, I hope so; in general, I hope not!)
Unfortunately, I have no knowledge of the ATA66 and ATA100 buses on your system — I’m not up to date on hardware. I think you are talking about ribbon cable with IDC (insulation-displacement connectors) on both ends. The cut has to do with drive select — it has been common to cuts and/or twists on PCs for a long time. All these are standard components and I can make most cables like this — but I think twice before trying to do so in cases like this for fear that the manufacturer has some hidden issues that can’t be easily met by off-the-shelf parts.
Historically, SCSI drives have always been more intelligent, and I think mapping out bad sectors has been included in SCSI drives for a long time. ATA may have improved in the meantime, so the actual capabilities are now similar. Also, there may be more than one level of mapping, one inside the drive and invisible, and another outside. There are remarkably many layers involved in getting data to and from spinning disks!
As far as the power supplies: historically, at least, drives have used both +12V and +5V taken from the main power supply. There isn’t much use for +12V except for mechanical beasts like disk drives and fans, so if there’s a problem with that voltage, you might only see it on disk operations. As I’ve said before, it would have to be a subtle, marginal problem — not a gross failure. The regulation of the +12V — the amount of smoothing done by the power supply– is generally more relaxed because most drives don’t care. But maybe yours do. This is an unlikely cause of your problems but worth pursuing.
Hmmm, just thinking out loud. Do your drives have shorting plugs on them? (You may or may not know the following: A shorting plug is a little piece of plastic containing two metal sleeves spaced at a standard distance apart. The sleeves are connected electrically. The standard spacing matches that of small pins sticking out of circuit boards in various places on your disk drive. Operational options are set by electrically connecting pairs of pins, or not.) Sometimes I’ve been able to fix electronic problems like this by simply removing shorting pins and replacing them exactly as they were, or sometimes even moving them around without removing them completely. WITH THE POWER OFF! These days, all this hardware is virtually microscopic, so don’t be afraid to use a magnifier, and be very, very careful: the plugs are remarkably easy to drop (and some have small jet thrusters that fly them forcefully out of your grip!) If you have any doubts, don’t attempt this.
If you don’t have access to electronic test equipment, the most you can do is reduce the load on the supply, which often is enough to change operating parameters for the better. Use fewer drives, try to arrange for fans to be on “low” speed or off — if these are possible, which I’m not sure they are.
I don’t have any illusions — I’m sure recovering from loss of my system files will be a pain, but it can be done. I’ve chosen to use my limited resources on my data files.
May 28th, 2004 at May 28, 04 | 6:06 am
Henry,
The “shorting plugs” are also known as jumpers, aren’t they? I typically have no problems changing their position. Still, if jumpers were the problem, I’d not expect it to affect _all_ of my drives.
So +12V is what a drive wants. (My physics courses took place a long time ago – does -12V make sense? Or does + mean direct, not alternate current?) How much deviation is acceptable? Is, say, +11.75V still OK? Then I guess one would really be interested in whether the voltage stays within reasonable bounds under load – I have little idea as to how one would measure that.
All my drives are different makes, even different manufacturers (I seem to prefer the devils I don’t), so if “most” drives don’t care too much about the quality of current then at least one of mine should not, either. Under sufficient desperation I suppose I could put in the old really loud power supply to see if that changes anything.
Here is a quote from TTP4 manual: “SCSI drives do not automatically lock out bad blocks”. Do you suggest one should qualify this by saying “at higher levels”?
May 28th, 2004 at May 28, 04 | 7:21 am
Volodya:
“Shorting plug” is the specific term for the little pieces of plastic. “Jumper” is much more generic and may indicate a permanent modification (E.g. “I fixed that board with 2 cuts and 3 jumper wires”) or simply a gap between two pieces of copper on a printed circuit board (“I set the operational mode of the board by soldering jumper J3 and leaving J1 and J2 alone”) or multiple pins that can be connected variously with solder (horrors!), shorting plugs, or wire-wrap wires. Sorry you asked?
What I was getting at is sometimes the shorting plugs don’t make a good connection, so one setting on one drive might operate intermittently. Who knows, maybe it will work better when the ambient humidity is unusually high — or low. I typical “re-seat” (remove and replace) each jumper on equipment that I’m trying to troubleshoot. I _do_ drop these plugs, but I have a large collection of them in my prototype stores.
I just downloaded the product manual for a “typical” 20 GB ATA hard disk. It requires +5VDC +/- 5% and +12VDC +/- 10%. ( I think much older drives did also need -12V, but that requirement has gradually disappeared.) If you measure +11.75V under load (with the disk drive running) with a DC meter, that’s about 2% low, well within tolerance, and probably OK. But there could be excess ripple on that or the 5V supply you’re not seeing, or you could be on the edge of “current limiting”. That’s where the supply says, “I can’t give you any more without overheating, so I’m going to limit what I’m giving”. The electronics being supplied often continue to work, or _almost_ work. I’ve seen this problem more than once in my career.
This is all guesswork on my part. Ideally, you’d have a friend with a similar computer or an obliging dealer with whom you could swap supplies — or drives. Or a friend who could test each of your devil-you-don’t-know drives on any kind of computer. I could certainly test your drives very easily. I’m in the San Francisco Bay area, which might not be so convenient.
“Loud” power supply — I suppose you mean the fan is loud. Who knows, it might be electrically quieter.
As for the quote from TTP4, I don’t claim to be an expert, and I certainly wouldn’t want to contradict such an authority. In general, it simply makes sense for some bad-sector processing to happen invisibly on drives, and I’ve heard generally that it does, so I don’t worry about the distinction. For your troubleshooting purposes, I think it is sufficient to assume that the drive will either deliver “perfect” data, or tell the computer it can’t read the sector(s) requested. Unrecoverable read errors reaching me as the operator occur with extreme rarity on hard drives –unlike floppies– so i think I would automatically replace any drive which generated even just one such error.
May 30th, 2004 at May 30, 04 | 12:10 am
Henry:
Thanks for clarifying the relations between shorting plugs and jumpers – I value order in one’s concept hierarchy.
I am still under the impression that I am looking for a problem that affects anything on the ATA100 bus rather than any individual drive(s).
“It requires +5VDC +/- 5% and +12VDC +/- 10%.” That’s interesting – the power cable leading to the drive has only two connectors. Does it mean that the five volts are delivered to the drive via the ribbon cable?
I still can’t get my head around voltage as a signed number. Suppose you get a pair of connectors on which you’ve got +12V. If you succeed in swapping them than that should give you -12V – this is the only interpretation I can come up with. This make the sign kind of a matter of perspective. What am I missing?
Do you claim there are ways by which a mortal can measure voltage ripple and “current limiting”?
How does one distinguish unrecoverable read errors from situations where the drive tells the computer “sorry”? Do we know for a fact that Macs and OS X can handle either of these occurrences gracefully?
I’ve updated to 10.3.4, disconnected the drives on ATA100 and had a look inside the case while at it. The ribbon cables on 100 and 66 are identical and bear the same Apple part number sticker. The ATA100 cable does not look physically damaged or dangerously bent, all connectors look sturdy, and the connection of the cable to the motherboard appeared properly seated. Still I disconnected it, blew canned air on everything and put it back.
Since then I’ve heard the short buzzing noise (that may or may not be the sam as what Pierre hears) again, although it was perhaps quieter. I used to think it was produced by one of the presently disconnected drives. Now one would have to think it’s any drive at all or the power supply. As far as I can tell, it is not inconsistent with an emergency spindown of a less than properly balanced fan.
June 3rd, 2004 at Jun 03, 04 | 12:56 am
Volodya:
(Sorry, I’ve been overwhelmed by a press of end-of-school-year events.)
It takes one wire for each voltage and one wire for “return” or “ground” or “earth” (term varies by region) so it would appear your drives are being supplied with only one voltage, most likely +5. Strange, but –I guess– possible with very modern drives, as the technology advances. I would expect the Mac wiring to be backward compatible, though.
No, power would not be delivered by the ribbon cable. That would place the data at risk and impose limitations on the amount of power that could be supplied — those wires are tiny.
Signed number? Think of the electricity as a flow of particles: The +5 wire supplies ‘x’ electrons per second at a “pressure” of 5 volts. The -5 wire (if there were one) well, sucks ‘x’ electrons per second at a “pressure” of 5 volts. That’s oversimplified, but sufficient.
(I’m willing to go into much more detail, but we might want to go offline to spare Pierre and his other readers.)
Electronics types would simply “hang a scope” (an oscilliscope) on the power wire and look for anything other than a flat line on the screen. This is a non-analytic method– specialized equipment is needed for “official” measurements of ripple.
As for “current limiting”, here’s how I’ve experienced it: Doing some circuitry experimentally, on a test bench, with a special “bench” power supply. Most of these supplies have adjustable power supplies and many have adjustable limits on the amount of current they will supply. The experimental circuitry starts behaving strangely– do I have the current limit adjustment set too low? If setting it higher fixes the problem, there’s the answer. Since most of these supplies have current meters, you can sometimes see the power go up as the circuitry uses more power, and abruptly fall as the limit is reached.
Since computer power supplies have current limiting built in, there’s always the possibility the same thing is happening, but difficult to prove, as measuring requires inserting the meter in the power suppy circuit, which is inconvenient.
I believe MacOS has numerous levels of internal re-tries, which means if it signals an error to you, the data it is trying to get is probably irretrievable. (Exception: some floppy disk/floppy disk drive combinations simply won’t work, find another drive, and there’s no problem.)
Good job with the cables. I would still recommend spraying some contact cleaner on all the connector points.
Please consider recording the sound and sharing it. You’ve got a fairly nice recorder right there…
June 4th, 2004 at Jun 04, 04 | 6:04 am
Henry:
Unfortunately, in view of one of the recent entries in the Betalogue, Pierre may still be interested in this correspondence.
What you tell me about the flow of electrons is consistent with my understanding. Let me try to state this to see if I got it right: The direct current power supply involves two wires, of which one is charged and the other is neutral. The charge on the charged wire can be positive or negative. Hence the sign in the description of the voltage.
If this adequately describes the situation, there is another thing I don’t understand: in the case of +5V, would the above scenario not make the drive positively charged as a whole? I thought that would not be such a good thing. A symmetric negative charge on the other wire sounds better balanced to me. I know that in alternate power supply, one of the wires is neutral and the other oscillates, but this kind of keeps the appliance that is being fed neutral “on average”.
(I am embarrassed bothering you with questions I feel I should have asked back at school. But you have been so obliging with your previous explanations.)
Your explanations suggest that I should start investigating the fortitude of the power supply if I find that it is the total number of the drives attached to the system that causes problems. I suppose you would know what a low-end oscilloscope costs?
What I’ve been getting at re read errors is this: in my (and perhaps in Pierre’s) case, the machine’s way of telling me there are problems with disk access is a freeze. I have never heard of a Mac articulately telling the user that it encountered a hard disk read problem in normal operation, as opposed to when running a diagnostic utility.
As for recording the sound – at present I don’t own a microphone. I may yet get one for a different purpose. I hope a microphone intended for voice should still be able to handle the somewhat quieter computer grunts.
June 4th, 2004 at Jun 04, 04 | 7:24 am
Volodya:
About electricity: Talking about DC (i.e. constant) voltage, there are in fact two wires. One is the supply voltage, say, +5V. It supplies electrons with “lots” of energy. These electrons go through a piece of equipment, say, a disk drive, which extracts some of that energy. The relatively weak electrons left over must go somewhere –there has to be a complete circuit– so these are returned to the power supply via the ground wire.
()Note: I’m talking about negatively charged electrons coming from a “+” supply. This is a physicist’s view. An electrical engineer just talks about “electrical current”, which flows from “+” to ground. These are equivalent, but the difference is sometimes confusing. If this paragraph confuses you, please forget you saw it!)
The drive doesn’t accumulate charge because all the electrons are eventually returned to the power supply, albeit with a lower amount of energy.
Alternating current (A.C.) is a different story, because the “+” wire supplies higher-energy electrons at a voltage that varies regularly with time, positive to negative. The “ground” wire still collects lower-energy electrons, as before.
I should be embarrassed: I have a graduate degree in related technology, and I can’t do much more than wave my hands about this stuff. At least my explanations will be practical, if not entirely correct.
Don’t buy an oscilliscope unless you have another use for it. If you can correlate the load on the power supply –in this case, the number of disk drives in operation– with the frequency of errors, you may then suspect that the power supply is running up against a limit or having problems smoothing the voltage at higher loads. Time to do a supply swap.
It has been a very long time since I’ve seen a Mac give a “hard” (unrecoverable) error message for a hard disk drive read. (I’ve seen it regularly when reading floppy disks, however!) There are undoubtedly some number of “soft” recoverable errors occurring routinely, but we never see them. I think it would actually be a good idea to make the soft error statistics available to users, but it might also unduly alarm some people, so I can see the reason for not doing so.
A diagnostic utility wants to tell you about all kinds of errors. It is probably using the disk in an atypical way, so the error information isn’t exactly comparable.
No, there’s no excuse for a freeze. Unix should be able to at least give you a panic message if it encounters an unrecoverable error getting some esstential data. (You know that a Unix “panic” is now spelled “You need to restart your computer…” right? )
Maybe we ought to talk about the basics of what a “freeze” really is. Are you clear on that?
You don’t need a fancy microphone. I’ve got several from multiple generations of Macs I’ve owned, some of which arrived with cute little plastic-shrouded microphones. (Don’t want to put any metal in among the electronics!) Just make sure you have the right plug — and almost anything with the right plug ought to work. (Well, don’t be reckless about it.) Alternatively, I could go to the neighborhood Radio Shack and pick up a cheap microphone in a minute. This may not be practical for you, depending on where you are.
June 4th, 2004 at Jun 04, 04 | 9:52 am
Henry:
“Negatively charged electrons coming from a + supply”? I thought electrons carried a negative charge, so a – wire had a surplus of electrons, and a + wire, a shortage of electrons. Accordingly, the + wire should suck up electrons. I have not been familiar with the engineer’s viewpoint, but I appreciate that their “direction of the current” can be a matter of terminology.
I don’t think talking about high- versus low energy electrons is didactically helpful. I suppose one could ascribe a potential-like energy to individual electrons, but I understand that when things are that small they cease to be quite Newton.
I thought a decent analogy might be two vessels with different levels of water. When you connect those by say a pipe, you can just as well put a wheel inside that the passing water would rotate, facilitating storage and retrieval of data. Now, even though all the water that comes into the pipe eventually gets out, you still get water in the pipe at a higher level than in the vessel with the lower level. That’s why I feel uneasy not about the drive accumulating the charge, the more charge the longer it operates, but rather about the charge being there at least while the drive is in operation. But perhaps this effect is too small to matter?
About which one of us should be more embarrassed – it is a curious coincidence that I have a PhD in math, but physics, let alone practical applications thereof, has always been a difficult subject for me.
Ok, I now also remember a Mac presenting a message telling it had problems reading a DVD.
i suppose hard drive manufacturers may have a view on collecting low-level error statistics that is different from yours.
To panics and freezes. The “please restart your computer” window is the civilized face of the kernel panic since 10.2. I’ve had that once, but that dated back to the time that I had that RAM module that is now replaced by a hopefully healthy one. I understand kernel panics are caused by hardware malfunction, bugs in the Mach layer, out-of-spec operating conditions, or misalignment of celestial bodies.
Twice during the lifetime of the patient I also had text splattering across the screen that started with something about corrupt stack and something else about one of the CPU’s not being happy. Also before the RAM module was replaced.
All the rest of the problems I had I would call freezes. (I have to confess I am not clear about distinguishing freezes from crashes.) These are pretty much as described by Pierre elsewhere: You get the beachball on the frontmost application, you try and fail to force-quit it, you cannot launch any new applications, the windows of the applications running stop updating, and everything gradually becomes unresponsive, including the Terminal. (In these cases I don’t know any better than to do a hard restart.)
I have no idea about the mechanics behind such a freeze, but after much observation I am fairly certain it strongly correlates with disk activity.
June 4th, 2004 at Jun 04, 04 | 11:04 am
In my experience, a freeze is exactly that: a sudden interruption, where everything on the screen, with the exclusion of the mouse pointer, becomes completely frozen. I am no longer experiencing freezes that come on gradually, first with an application that locks up and then spreading to all other applications and eventually the system as a whole. The freezes I am experiencing now (the last half dozen anyway) are definitely sudden and complete.
I agree that the terminology needs to be clear, and is not totally clear at this point. A “crash” can be (I think) an application unexpectedly quitting (that would be an application crash) or a system-wide freeze or a kernel panic. You can also have an application freeze (the application becomes completely unresponsive, but not the rest of the system, and you get the spinning pizza only in that application) or a system-wide freeze (as described above).
I haven’t experienced many kernel panics in my life with OS X. The last one occurred a couple of months ago and was clearly due to my fiddling around with FireWire cables. :)
June 5th, 2004 at Jun 05, 04 | 1:47 am
Volodya:
It’s a challenge to explain practical electrical concepts without circuit drawings!
The issue with negatively charged electrons coming from the + terminal is always troublesome, but there are simply two opposite conventions used in electrical engineering and in physics. It happens this way: if in teaching you mention this issue, people get confused. If you do not mention it, it is often true that someone brings it up, and you must recover from not mentioning it… and people get confused.
As far as high-energy versus low-energy, it is simply true that the current passing through the disk drive results in a transfer of energy (power) to the disk drive. Conceptually, at least, this energy is available on the + side wire from the power supply, and effectively exhausted on the – side.
That seems to be the most useful model, but a number of others are possible. Yes, the analogy to water is very common but I decided (perhaps unwisely) to do without it this time.
I’m surrounded by mathematicians! Last night our dinner guests were a visiting professor of a math from Norway and his family. His specialty is topology.
I’ve long been interested in understanding the problem of theoretical versus practical knowledge. My favorite example is engineering control(systems with feedback loops), where there is some considerable and difficult theory to master and quite a bit of practical knowledge necessary.
Disk errors: Again, I think there are so many levels of reporting and retries below what is user-perceptable, it is difficult to know what’s going on. I have written low-level disk drivers, but that was many years ago, and my knowledge is undoubtedly out of date.
…
June 5th, 2004 at Jun 05, 04 | 2:46 am
Volodya:
Yes, consumers are electrical engineers, not physicists.
That said, it ultimately doesn’t matter. You can understand electronic circuits in a number of equivalent ways, including imagining that electricity is based on positively-charged current flow. As long as you are consistent, the result is the same.
(Want to bend your mind some more? All computer technology is based on semiconductors, right? Current flow in semiconductors doesn’t occur unless there are “holes” –unoccupied places– for the electrons to occupy. So far that’s intuitive… but in fact, semiconductors are understood by accounting for the fact that both electrons and holes move. That’s right, something that isn’t there … moves. Shiver! )
June 5th, 2004 at Jun 05, 04 | 2:16 am
Henry:
If engineers’ and physicists’ conventions are indeed at loggerheads, I suppose consumer products follow the engineers’ view? So when you buy a, say, AA battery its contact labelled + carries a surplus of electrons, and the – contact is understaffed by electrons? One of the more bizarre things I’ve learnt this week.
June 5th, 2004 at Jun 05, 04 | 2:37 am
Volodya and Pierre:
Freezes, etc:
I’m no expert, but here’s my strong impression: Inside Unix kernels –the most fundamental system operating instructions– there are a number of places in the code that execute the code:
panic();
This is done when there is simply nothing more useful to do than save some information, inform the user that Unix cannot continue, and enter an exit-less loop so a system reset is necessary. If things are really bad, the kernel may simply execute a processor “halt” instruction. This is done if the capability to do panic processing is in question.
(Yes, Apple started out with more-or-less traditional panic processing, but I think quite reasonably decided that most users simply need to know their machine needs to be reset.)
What else can go wrong? Where do I start? How about some hopefully useful observations:
o I’m not sure about the PowerPC, but in general it is possible for processor to be halted and the mouse pointer to still move. Moving the mouse causes a data interrupt, which in many cases can temporarily wake up a processor and keep it awake long enough to move the pointer on the screen. When this is finished the processor probably returns to the halted state.
o The spinning pizza is displayed only when an application (or the Finder?) starts something that may take an indeterminate time to finish. Since it takes CPU attention to spin the pizza, an indefinitely long pizza spin is probably due to some resource that should be available not becoming so — a resource completely out of the control of the requesting program.
o It’s possible for Unix to be operating but the Finder to be broken. This is a case where the rlogin/ssh connection from another machine is often be useful.
o Changing some basic hardware configuration is definitely confusing to the Unix kernel, though this is changing. You should be able to attach or remove USB devices at will. I guess this is not true for FireWire, according to Pierre’s experience. It is definitely not true for SCSI; detaching a external SCSI device _will_ crash (panic) the system.
o When you get a notice “Application ‘x’ unexpectedly exited” or the equivalent, this means that the application did its own version of a panic. It didn’t crash; it simply couldn’t continue and gave up.
o In a “pre-emptive multitasking system with hardware memory management” the system is protected in several ways against what can go wrong in an application. If the application enters an endless loop, the system can and does always get control of the processor. If an application starts executing nonsense code, it can only mangle the contents of a particular region of memory assigned to it.
o Memory “leaks” can cause degradation of performance until the system comes to an apparent standstill. There are grey areas between memory belonging to the system versus specific applications. If applications request the use of more and more memory, the memory isn’t ever released for use, and it falls into a grey area, the system may simply run out of operating memory. (It takes memory to get memory; if there’s not enough …)
o Unix depends on a very busy schedule of perhaps hundreds of tasks at any one time, all of which are competing for resources. This scheme works very, very well, but it isn’t unvulnerable.
<sigh of exhaustion>
June 5th, 2004 at Jun 05, 04 | 3:56 am
Henry,
I called my father (physics PhD, engineering experience). He concurs that for engineers, “electricity flows from + to -” and physicists don’t really mind that, but flatly contradicts you that physicists and engineers understand + and – in opposite ways. The flow of electrons has the direction opposite to the flow of “electricity”. According to him, when a direct current power supply powers a device, the electrons travel from its – contact via the device back to the + contact. Even on a commercially available battery.
I wonder if you can point to some authoritative source supporting your claims. And, yes, I agree that it does not matter.
As for the electrons sharing their energy while passing through the device being powered – I wonder how you would be explaining alternate current with this analogy. Would you say something like, the electrons passed through the device giving off most of their energy, and now they pass back giving off yet more of their energy because, you see, their energy is high again?
On the other hand, the movement of holes so far sounds clear and intuitive.
From your explanation of freezes combined with my (and perhaps Pierre’s) experience, it would appear that most of our freezes are due to unexpected unavailability of reading/writing resources. Does that make sense?
June 5th, 2004 at Jun 05, 04 | 5:15 am
Volodya:
I’m happy to be contradicted. I have a working knowledge of this stuff, not any authoritative training! I don’t claim to be consistent, just (mostly) practical. Also, I think it takes a bit more practice than I have to be an effective teacher of this material.
AC? That’s simply the more efficient way of moving power long distances through wires. For electronic devices, it’s nearly always converted to DC before use. Of course, there are electromechanical devices such as motors and bells that operate directly from AC.
I think the picture for the transfer of energy via AC can be understood by piecewise consideration through the cycle, consider “small” intervals where the voltage is close enough to DC.
<Sound effect: head scratching>. About the sharing of energy: I’m desparately trying to avoid using Ohm’s law, which describes the relationship between voltage, current, and resistance, and the associated power law. Probably a mistake, especially with a mathematician.
Before I rectify that mistake, let me try this: It requires energy to raise the potential of an electron in a circuit to a higher voltage or a molecule of water in a pipe to a higher elevation. I think the potential in both cases is considered to be extrinsic property — the electron or water molecue itself exhibits no specific changes in the process. When the electron works its way through a disk drive, or the water falls over a wheel, energy is extracted, and the electron/water molecule that emerges has lower potential.
Ohm’s law states:
V = IR
where V = the supply voltage, I = the current, and R = resistance.
To a reasonable first approximation, a disk drive can be considered as simply a resistor. (A resistor is a circuit element that presents a fixed, known obstacle to the flow of current. The water analog would be a pipe filled with a porous solid)
By putting the resistor in a circuit with the power supply, you get current flow, according to Ohm’s law rearranged:
I = V/R
The power transfer is given by
P = (I**2)R
where (I**2) denotes “I squared”. In a real resistor ‘P’ watts (U.S. unit) of heat will be emitted by the resistor. In the case of the real disk drive, ‘P’ watts of power are used to do disk drive functions, less inevitable losses to heat. (Risky intuitive interpretation: ‘V’ disappears in the equation because that’s what’s “used up” in the process — the voltage/potential.)
Changing subjects: Memory leaks _may_ be the problem. I’m not very knowledgeable about this issue, so I’ll turn you over to Google. The first useful link I found was
http://www.macwrite.com/criticalmass/mac-os-x-memory-leaks.php
but I’m sure there are a lot more.
June 6th, 2004 at Jun 06, 04 | 5:24 am
Henry:
Thank you for your explanations. Now I understand the difference between +5V and -5V. Ohm’s law is indeed something I once knew. Still I don’t remember enough of the material to deduce your formula for power from Ohm’s law. As for your risky intuitive interpretation – I don’t see how voltage is “used up” in any sense stronger than the sense in which the current is also “used up”. The equation P=IV follows from the ones you cite and involves both quantities.
Memory leaks appears to be a very interesting subject – now that I think of it, certain applications like Mozilla exhibit suspicious behaviour in this respect. However I do not see any evidence that they are connected to freezes in our cases. I have 1GB of memory and it does indeed happen every once in a while that only 13MB or so are left free – I run X Resource Graph which gives me a good overview of what happens to memory at each moment. Still, with 15 or so Safari windows open, many of them with multiple tabs, and six or more applications open, I tend not too blame OS X too harshly for that. This does not correlate with freezes – they are as likely to occur with memory half-full.
Can I ask you why you suggest memory leaks can lead to freezes? From what I gathered, their negative effect mainly manifests itself in performance degradation.
June 6th, 2004 at Jun 06, 04 | 6:22 am
Volodya:
I was afraid of that — I have no idea of the derivation of the formula for power, but it makes intuitive sense, especially if you express it as P=IV.
I mention memory leaks as a possibility that’s worth investigating. Probably most applications are fairly well tested and don’t have this problem, but there’s always the rare coincidence for high-quality application, or the expected failures of a lower-quality product. (Not that any of us would accept anything but the best quality — but there are is a huge amount of code running routinely on our systems, some of which we might not endorse if we knew more about it.)
My idea for the mechanism is this: It requires some memory to get more memory. It may be possible for the system to become crippled in unexpected ways as the amount of primary memory available decreases.
Stepping back: Debugging even processor with a program of only several hundred bytes, as is required sometimes in my work, can be dauntingly difficult. We’re talking about many orders of magnitude more complexity. What I try to do with difficult bugs is a combination of intense concentration on whatever direct evidence I’ve got, while continuing to search for new explanations, and also looking for patterns. In this case, I would like to hear more about what is going on when you have these crashes. And what isn’t going on. More often in the morning? Less often in the summer?
June 6th, 2004 at Jun 06, 04 | 7:40 am
Henry:
From anecdotal evidence it appears that 12 or so MB is still believed by OS X to be sufficient for memory management and perhaps other essential administration. I can see that a miscalculation here could indeed lead to harm. My impression though is that is not what causes my freezes. Also, I have a hunch that if memory leaks were the reason then this would be a more widespread problem than just Pierre and me.
The freezes I experienced were all connected to either intensive disk activity (cloning, permissions repair etc.) or were “random”. “Random” could be switching Finder windows, asking Word to zoom, or just freezing while not doing anything useful other than listening to a web stream. I have been asking myself whether I see any pattern to this, and the answer is unfortunately no.
I have noticed that OS X occasionally accesses the hard disk for no reason apparent to the user, and this is not always explained by the need to dump part of the memory contents to disk. This leads me to believe that disk access problems may still offer a consistent explanation for “random” freezes as well.
There is one other bit of evidence I can report – sometimes freezes would follow in rapid succession – within two or so minutes of restarting. I would then run fsck (it would never report anything wrong), reset PRAM, or reset-nvram and reset-all in open firmware, repair permissions (nothing seriously wrong ever found there either), and as likely as not the machine would be pacified for a while.
In the meantime, while using a single zeroed disk on ATA66, I have not had any freezes or other reasons to restart in the past eight days since updating to 10.3.4, which beats any previous records by the ATA100 bus (just under one week). I’ll give it a week or so more, and then start actively looking for trouble.
June 7th, 2004 at Jun 07, 04 | 3:05 am
Volodya:
All you need is one application or system program with a memory leak to see your system go more or less rapidly downhill. Such leaks may also be tied to specific configurations — maybe you and Pierre share one of those configurations, and I’m doing something sufficiently different.
Intense disk activity in of itself shouldn’t have any particular effect. With Unix, you really can’t correlate disk accesses with application activity, as it was in OS9 and before. Unix accesses disk drives according to its own schedule and needs. As far as I’ve observed, you can
I can see how freezes during something as innocuous as permissions repair would lead you to be suspicious of your disks. I think you are on the right track with your single zeroed disk on ATA66. I believe you said that no diagnostic had ever given you an error message about any disk, right? When you are ready to try your ATA100 bus, maybe you’ll need some better/different tools to look for errors.
On the other hand, some of your evidence still suggests some kind of electromechanical problem, such as dirty contacts on disk cables. (Or even problems on seemingly unrelated connections. I’d be tempted to turn off my machine and clean and/or reseat every connection I could find.)
(I’ve noticed iTunes seems to continue playing very reliably even when the rest of the system is struggling — it seems very robust. )
I’m sorry you’re struggling with this — it must be very aggravating. I’ve lost count of the number of Macs I’ve owned, but I do recall never having such difficulties.