Hightrees had a call to say that there was no network or internet access and that a flashing “Red” light had appeared on the server and sure enough the machine would get as far as the BIOS and then shutdown. The flashing light indicated an issue with the PSU or Cooling.
Popping the case open, nothing immediately obvious, all fans manually spinning, not nasty burning smells indicating a burnt out PSU, power up the server and everything works okay ?!
Reboot, machine won’t start.
Swap UPS Feeds
Reboot, machine won’t start
Swap Power leads
Reboot, machine starts, then shuts down.
An hour later, a message is finally displayed on the console that a sysem fan is “missing” and the system is shutting down.
After stripping every fan fitted in the server, it turns out that one fan (actually two fans combined into one unit), had a loose wire. It looked like it must have been a dry joint which had finally given up the ghost and was randomly making/dropping the connection cauing the randomness of the server failures.
After a quick internet surf and call round and there were plenty of spares available, unfortunately, none locally based, so the question was
- leave the site down and have lots of unhappy users until the new part arrived. – not really a viable option
- reconfigure the network and use an alternative server, do-able but would probably take 2-3 hours
- attempt a “bodge” ?
Due to the design of the fan, there was about a 4mm gap where it was just possible to squeeze a modified soldering iron (modified as in use a bench grinder to butcher the soldering iron tip and then hand file to the right profile!). The connecting wire was about 1.5mm in diameter and there was an IC chip carefully placed right next to the solder joint. This was going to a be a one shot attempt.
40 minutes later, after much moving of the worklight, waiting for the hand shaking to stop, the smell of burning flesh and plastic and a modicum of profanity, there was a connection that was solid and no visible signs of accidently damage to any of the nearby circuit components.
Back to site, carefully re-assemble the fans in the holder, re-assembly the server and ….. SUCCESS !
The server has now been “up” for 18 hours and will taken down outside office hours when the spare part arrives.
With an unlimited IT budget, this would obviously never occurred as there would have been a redundant server available, a shelf full of spare parts and an engineer available within a 4 to 8 hour window, but we working in the real world here…..sometimes you just have to do the best with the options you have available.