Multiple Power Supply Unit (PSU) Fan Failures on Sun Fire 3800-6800 Servers may Result in Platform Outage |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Product : | Sun Fire 3800 Server Sun Fire 4800 Server Sun Fire 4810 Server Sun Fire 6800 Server
|
| Bug Id : | 6405762
|
| Date of Workaround Release : | 11-DEC-2006
|
| Date of Resolved Release : | 22-MAR-2007
|
Impact
Power supply fan failures on Sun Fire 3800-6800 servers can go undetected and may lead to a platform outage if a platform suffers multiple PSU fan failures.
Contributing Factors
This issue can occur on the following platforms:
- Sun Fire 3800 with PSU p/n 300-1441 (A145) and 300-1529 (A145E)
- Sun Fire 4800 with PSU p/n 300-1460 (A153)
- Sun Fire 4810 with PSU p/n 300-1459 (A152)
- Sun Fire 6800 with PSU p/n 300-1459 (A152)
Products not affected:
- Sun Fire 4800 without the affected PSU (Listed above)
- Sun Fire 6800 without the affected PSU (Listed above)
- Sun Fire 4900/6900 Servers
PSU model numbers can be determined through use of one of the following methods:
Utilizing the "showboards" command on the platform:
sc> showboards
Slot Pwr Component Type State Status Domain
---- --- -------------- ----- ------ ------
SSC0 On System Controller Main Passed -
SSC1 On Present Spare - -
ID0 On Sun Fire 4800 Centerplane - OK -
PS0 On A153 Power Supply - OK -
PS1 On A153 Power Supply - OK -
PS2 On A153 Power Supply - OK -
.
.
.
Reviewing the sc-extended Explorer data file showenvironment_-tv.out:
$ grep "Power Supply" /<explorer_base_directory>/sc/<SC_Name>/showenvironment_-tv.out
A152 Power Supply 0
A152 Power Supply 1
A152 Power Supply 2
Reviewing the prtfru data in Explorer data file prtfru_-x.out:
$ grep "Power Supply" /explorer_base_directory/sc/SC_Name/prtfru_-x.out
<Fru_Description value="Power Supply (A152)"/>
<Fru_Description value="Power Supply (A152)"/>
<Fru_Description value="Power Supply (A152)"/>
Symptoms
The Power Supply Unit (PSU) does not report a fan failure via System Controller Application (Sc-App) directly.
If a PSU fan stops or slows down it results in the normal forced flow of air through the PSU from the front of chassis to the back to be reduced and reversed but not halted completely:
Platform Air flow normal PSU Air flow fan failed PSU
-------- ------------------- -----------------------
3800 blows air out draws air in (reduced flow)
4800 draws air in blows air out (reduced flow)
6800 draws air in blows air out (reduced flow)
To verify a failed fan on 3800, 4810 and 6800, use a flashlight and look through the PSU vents to see if the fan blades are turning. This is not possible to do on 4800 due to the location of the PSU fan.
An alternative is to examine the air flow at the PSU vent by holding a piece of paper in front of the vent to determine air flow and its direction.
On the 4800 and 6800 PSU, the paper should be drawn into the air intake and held there when the PSU fan is operating normally.
For a 3800 PSU the paper will be blown away from the air vent when the PSU fan is operating normally.
Failed fans are identified when air flow is significantly less when compared with other good power supplies on the same type of platform or when the air flow has reversed its normal direction.
When a fan in a PSU fails the PSU will continue to operate normally but will have an elevated temperature due to reduced and reversed air flow when compared to other PSUs.
ScApp monitors PSU temperatures and only reports warnings if temperatures exceed warning or maximum temperatures:
- Warning threshold is 65 Degrees C.
- Maximum threshold is 78 Degrees C.
In the case of PSU fan failure and depending upon many variables the PSU temperature may not be high enough to trigger ScApp to produce a warning, thus resulting in an undetected fan failure.
If undetected PSU fan failures are allowed to build up within a platform over many months* it is possible to have a platform power loss with very little advance warning.
*Fan failure is largely due to bearing failure as the fans reach end of bearing life and the time between PSU fan failures within a single platform is likely to be months or years.
Important Note:
Patches 114526-08 and 114527-03 (or later) provide firmware which monitors power supply temperature and may provide a warning similar to the following:
WARNING: PS2 temperature is elevated indicating it may have a failed cooling fan.
PS2 48 VDC 0 Temp. 0 value: 42 Degrees C
Contact Sun Support Services to check for PSU fan failure.
In some cases on the 4800 platform, the elevated temperature that may occur in this message can be normal.
On the 6800 platform with these firmware patches, the warning message may not identify the power supply with failed fans when the first fan failure occurs.
Workaround
Power supplies with failed fans should be replaced. To detect failed PSU fans prior to implementing the solution below, please inspect physically for the symptoms described above.
Resolution
This issue is addressed in the following platforms:
- Sun Fire 6800/4800/4810/3800 with firmware 5.19.7 (as delivered in patch 114526-08) or 5.20.2 (as delivered in patch 114527-03) or later
Modification HistoryDate: 22-MAR-2007
- Updated Resolution section
- State: Resolved
Date: 09-APR-2007
- Updated CR list and Resolution section
AttachmentsThis solution has no attachment