Using the Reset Button on A Main System Controller May Cause Domain Outage |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Product : | Sun Fire 3800 Server Sun Fire 4800 Server Sun Fire 4810 Server Sun Fire 6800 Server Sun Fire E6900 Server Sun Fire E4900 Server
|
| Bug Id : | 4378797
|
| Date of Workaround Release : | 21-APR-2005
|
| Date of Resolved Release : | 12-NOV-2007
|
Impact
If the main System Controller (SC) on a Sun Fire 3800, 4800, 4810, 4900, 6800 or 6900 system (with running domains) is reset with the reset button, there is a possibility of a change in hardware configuration which would cause the domains to perform a "fatal" reset. The domains will reset and take action as per the "error-reset-recovery" OBP property, which may include unexpected system outages while domains are recovered.
Contributing Factors
This issue can occur on the following platforms:
- Sun Fire 3800, 4800, 4810, 4900, 6800, 6900 (without recommended SunFire SCApp firmware update 5.12.6)
if the reset button is used on running domains.
The Sun Fire System Controller (SC) periodically queries system ASICs (Application Specific Integrated Circuits) via JTAG buses to read configuration, monitor environmental states and change domain configuration. If the hardware reset button is used during one of these operations, the JTAG bus may be left in an undefined state. This change in configuration can trigger a fatal reset on affected active domains.
To determine the firmware version of the SCApp, use the "showsc" command from the platform shell as follows:
SC> showsc
SC: SSC0
Main System Controller
SC Failover: disabled
Clock failover enabled.
SC date: Thu Jun 01 12:59:45 CDT 2006
SC uptime: 25 minutes 58 seconds
ScApp version: 5.19.6 Build_01
RTOS version: 45
Symptoms
Shortly after the main System Controller has been reset using the reset button, the domains within the system reboot with an error message similar to the following:
ErrorMonitor: Domain A has a SYSTEM ERROR
Workaround
In the case of an SC becoming unresponsive, attempts should be made to confirm connectivity via the serial port and network prior to using the reset button. If the SC appears to be hung:
- Confirm that the SC is actually hung by connecting to the serial port of the SC (with a known good cable)
- Hit "enter" a few times - if no prompt is returned, the SC is hung
- If this is the case, halt all domains, using the Solaris "init 0" command (or "shutdown")
- Reset the SC using the reset button, or power-cycle the whole chassis.
Note: The use of the reset button on running domains should be avoided whenever possible, and the SC should be reset either by the above steps or via ScApp.
Resolution
This issue is addressed on the following platforms:
- Sun Fire 3800, 4800, 4810, 4900, 6800, 6900 with SunFire SCApp firmware 5.12.6 (as delivered in patch 112127-02 or later)
Note: The patch above addresses the software issue for BugID 4378797. The use of the reset button on running domains should be avoided whenever possible.
Modification HistoryDate: 28-SEP-2005
29-Sep-2005:
- Update Relief/Workaround section
Date: 12-NOV-2007
- Updated Contributing Factors and Resolutions sections
- State: Resolved
AttachmentsThis solution has no attachment