Comment 67 for bug 574910

Revision history for this message
Rod (rod-vagg) wrote :

I've been putting up with the high load averages for a few months now on our production system. I've also been experiencing what I thought was an unrelated problem but I've come to suspect is tied up with this bug: every now and again the system would appear to lock up and become unresponsive but because I often keep an SSH session open to the server I can see that it's still running and load averages have spiked to over 50 and nothing can be killed and only simple processes can be started (ps). I have been able to just wait it out in the past and it fixes itself but because this is an important production system my best option is to force a restart (it usually responds to a 'reboot').
This happened every couple of weeks but recently it seems to have been happening more often. As far as I can recall this new since Lucid so I'm suspecting that it's related to this load reporting problem.
It's happened 3 times now in the last week and is becoming increasingly frustrating so I've restarted this system with one of the test kernels posted here (aki-84b75ded). I can confirm that this has fixed the original load average bug and the system has been running for 24 hours with no appreciable problems. I can report back here if the same load spike problem happens again, if it does then I guess its a new bug but I wouldn't be confident pinning it on Lucid in particular. I guess if it doesn't show up again we can assume that (a) the originally reported problem caused wider problems and (b) the new kernels have fixed those problems.