Solaris Reboot Triggers Spurious SYSTEM Error in Adjacent Domain



Category :Availability
Release Phase :Resolved
Product :Sun Fire 3800 Server
Sun Fire 4800 Server
Sun Fire 4810 Server
Sun Fire 6800 Server
Sun Fire E6900 Server
Sun Fire E4900 Server  
Bug Id :6300392  
Date of Workaround Release :27-JUL-2005 
Date of Resolved Release :13-FEB-2006 


Impact

Hardware error pause for AR L2CheckError may be asserted, causing an abrupt halt to processing within a domain, and hardware replacement will not resolve the issue.

Note: Internal testing has shown that L2CheckErrors of the type described in this alert can be reproduced with any firmware version lower than 5.19.7 or 5.20.3 by simulating an IO board DC-DC converter failure.


Contributing Factors

This issue can occur on the following platforms:

  • Sun Fire 3800, 4800, 4810, E4900, 6800, and E6900 systems without ScApp firmware 5.19.8 or 5.20.3 (as delivered in patches 114526-09 and 114527-04)

Notes:

  1. This error is rare, and only occurs in configurations with more than one domain per partition - those running Domain B with Domain A, or Domain D with Domain C.
  2. Systems running only Domain A, or those running only Domains A and C are not affected by this issue.
  3. In some cases, this type of error has been preceded by failureof a DC-DC converter on an I/O board in one of the affected domains.(Reproduced in the lab by simulating an IO board DC-DC converter failure).

To determine the version of ScApp on a system, the following command can be run (from the platform shell):

    sc0:SC> showsc
    ...
    ScApp version: 5.19.4 Build_01
    RTOS version: 45

Symptoms

A Solaris reboot will cause the adjacent domain to fail with error pause. (Adjacent domains are those running within the same partition, either A and B or C and D).

For a case where a Solaris reboot of Domain A causes a failure in Domain B, messages similar to the following may be seen on the SC Platform shell:

    Domain Reboot A: Initiating keyswitch: on, domain A.
    ErrorMonitor: Domain B has a SYSTEM ERROR
    [AD] Event: SF6800.ASIC.AR.CMDV_SYNC_ERR.102420cf

These messages may be seen on the SC Domain B shell:

    ErrorMonitor: Domain B has a SYSTEM ERROR
    /N0/SB3 encountered the first error
    /N0/IB8 encountered the first error
    ArAsic reported first error on /N0/SB3
    /partition0/domain1/SB3/ar0: 
    >>> L2CheckError[0x6150] : 0x01808100
           AccIncSyncErr [24:21] : 0xc accumulated incoming mismatch
                      FE [15:15] : 0x1 
              INCSyncErr [08:05] : 0x8 Ports [9:6] incoming mismatched against internal expected incoming
    ArAsic reported first error on /N0/IB8
    /partition0/domain1/IB8/ar0: 
    >>> L2CheckError[0x6150] : 0x18189010
             CMDVSyncErr [12:09] : 0x8 Ports [9:6] command valid mismatched against internal expected command valid
             PreqSyncErr [04:01] : 0x8 Ports [9:6] prereq mismatched against internal expected prereq
          AccCMDVSyncErr [28:25] : 0xc accumulated valid command mismatch
                      FE [15:15] : 0x1 
          AccPreqSyncErr [20:17] : 0xc accumulated prerequisite mismatch
    [AD] Event: SF6800.ASIC.AR.CMDV_SYNC_ERR.102420cf

In this case, each SB and IB in the failing domain will report AR L2CheckError with either INCSyncErr or CMDVSyncErr. The adjacent domain which was being rebooted may reboot just fine.

Note: ArAsic indicates that this error was detected by the Address Repeater (AR) ASIC (Application-Specific Integrated Circuit) within the Sun Fireplane Switch. The AR L2CheckError indicates unexpected behavior of the switch's distributed arbitration protocol.

The error will be repeatable, a reboot of one domain causing the adjacent domain to fail, until the master system controller (SC) has been rebooted. A failover to the spare SC will have the same effect. Hardware replacement of the various FRUs which contain the Sun Fireplane Switch have no effect.


Workaround

To temporarily work around the described issue, reboot the primary SC with the "reboot" command.


Resolution

This issue is addressed on the following platforms:

  • Sun Fire 3800, 4800, 4810, E4900, 6800, and E6900 systems with ScApp firmware 5.19.8 or 5.20.3 or later (as delivered in patches 114526-09 and 114527-04)



Modification History


Date: 13-FEB-2006

13-Feb-2006:

  • Updated Impact, Contributing Factors and Resolution sections
  • State: Resolved

Date: 05-DEC-2006

05-Dec-2006:

  • Updated Contributing Factors and Resolution sections



Attachments
This solution has no attachment

 
 
Login Required

You must login and have a valid contract to access Sun's Premium content which includes:

  • Sun Alerts
  • Bugs
  • Patches
  • Solutions
  • White Papers
  • Documentation
  • Support Knowledge

Login Required

You must login and have a valid contract to access Sun's contracted features

Access Legend:

(Login to access)   Sun Contracted Content
(Login to access)   Sun Contracted Feature

Please make use of SunSolve Feedback application by selecting the floating [+] to provide feedback about this specific document.

Search

Article Details
Article ID : 200010
Article Type : Sun Alert
Last reviewed : 2006-12-05
Audience : PUBLIC
Keywords :
Provide feedback  (help)
Page Tools
»  Print This Page
»  Email This Article
»  Bookmark This Article
 
Contact About Sun News & Events Employment Site Map Privacy Terms of Use Trademarks Copyright Sun Microsystems, Inc. | SunSolve Version 7.4.0 #1