Solaris Reboot Triggers Spurious SYSTEM Error in Adjacent Domain |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Product : | Sun Fire 3800 Server Sun Fire 4800 Server Sun Fire 4810 Server Sun Fire 6800 Server Sun Fire E6900 Server Sun Fire E4900 Server
|
| Bug Id : | 6300392
|
| Date of Workaround Release : | 27-JUL-2005
|
| Date of Resolved Release : | 13-FEB-2006
|
Impact
Hardware error pause for AR L2CheckError may be asserted, causing an abrupt halt to processing within a domain, and hardware replacement will not resolve the issue.
Note: Internal testing has shown that L2CheckErrors of the type described in this alert can be reproduced with any firmware version lower than 5.19.7 or 5.20.3 by simulating an IO board DC-DC converter failure.
Contributing Factors
This issue can occur on the following platforms:
- Sun Fire 3800, 4800, 4810, E4900, 6800, and E6900 systems without ScApp firmware 5.19.8 or 5.20.3 (as delivered in patches 114526-09 and 114527-04)
Notes:
- This error is rare, and only occurs in configurations with more than one domain per partition - those running Domain B with Domain A, or Domain D with Domain C.
- Systems running only Domain A, or those running only Domains A and C are not affected by this issue.
- In some cases, this type of error has been preceded by failureof a DC-DC converter on an I/O board in one of the affected domains.(Reproduced in the lab by simulating an IO board DC-DC converter failure).
To determine the version of ScApp on a system, the following command can be run (from the platform shell):
sc0:SC> showsc
...
ScApp version: 5.19.4 Build_01
RTOS version: 45
Symptoms
A Solaris reboot will cause the adjacent domain to fail with error pause. (Adjacent domains are those running within the same partition, either A and B or C and D).
For a case where a Solaris reboot of Domain A causes a failure in Domain B, messages similar to the following may be seen on the SC Platform shell:
Domain Reboot A: Initiating keyswitch: on, domain A.
ErrorMonitor: Domain B has a SYSTEM ERROR
[AD] Event: SF6800.ASIC.AR.CMDV_SYNC_ERR.102420cf
These messages may be seen on the SC Domain B shell:
ErrorMonitor: Domain B has a SYSTEM ERROR
/N0/SB3 encountered the first error
/N0/IB8 encountered the first error
ArAsic reported first error on /N0/SB3
/partition0/domain1/SB3/ar0:
>>> L2CheckError[0x6150] : 0x01808100
AccIncSyncErr [24:21] : 0xc accumulated incoming mismatch
FE [15:15] : 0x1
INCSyncErr [08:05] : 0x8 Ports [9:6] incoming mismatched against internal expected incoming
ArAsic reported first error on /N0/IB8
/partition0/domain1/IB8/ar0:
>>> L2CheckError[0x6150] : 0x18189010
CMDVSyncErr [12:09] : 0x8 Ports [9:6] command valid mismatched against internal expected command valid
PreqSyncErr [04:01] : 0x8 Ports [9:6] prereq mismatched against internal expected prereq
AccCMDVSyncErr [28:25] : 0xc accumulated valid command mismatch
FE [15:15] : 0x1
AccPreqSyncErr [20:17] : 0xc accumulated prerequisite mismatch
[AD] Event: SF6800.ASIC.AR.CMDV_SYNC_ERR.102420cf
In this case, each SB and IB in the failing domain will report AR L2CheckError with either INCSyncErr or CMDVSyncErr. The adjacent domain which was being rebooted may reboot just fine.
Note: ArAsic indicates that this error was detected by the Address Repeater (AR) ASIC (Application-Specific Integrated Circuit) within the Sun Fireplane Switch. The AR L2CheckError indicates unexpected behavior of the switch's distributed arbitration protocol.
The error will be repeatable, a reboot of one domain causing the adjacent domain to fail, until the master system controller (SC) has been rebooted. A failover to the spare SC will have the same effect. Hardware replacement of the various FRUs which contain the Sun Fireplane Switch have no effect.
Workaround
To temporarily work around the described issue, reboot the primary SC with the "reboot" command.
Resolution
This issue is addressed on the following platforms:
- Sun Fire 3800, 4800, 4810, E4900, 6800, and E6900 systems with ScApp firmware 5.19.8 or 5.20.3 or later (as delivered in patches 114526-09 and 114527-04)
Modification HistoryDate: 13-FEB-2006
13-Feb-2006:
- Updated Impact, Contributing Factors and Resolution sections
- State: Resolved
Date: 05-DEC-2006
05-Dec-2006:
- Updated Contributing Factors and Resolution sections
AttachmentsThis solution has no attachment