System May Hang or Panic Accompanied by "lpost" Messages



Category :Availability
Release Phase :Resolved
Product :Sun Fire 3800 Server
Sun Fire 4800 Server
Sun Fire 4810 Server
Sun Fire 6800 Server
Sun Fire E6900 Server
Sun Fire E2900 Server
Sun Fire V1280 Server
Sun Fire E4900 Server  
Bug Id :4978865, 5054736  
Date of Workaround Release :25-APR-2005 
Date of Resolved Release :01-AUG-2005 


Impact

False indications of hardware failure may be diagnosed incorrectly, and remedial action may lead to unnecessary hardware replacement. Loss of application availability may occur due to either a system panic or hang which may require a "setkeyswitch" cycle to recover.


Contributing Factors

This issue can occur on the following platforms:

  • Sun Fire 2900, 3800, 4800, 4810, 4900, 6800, 6900 and V1280 servers with System Controller (ScApp) firmware versions 5.18.x and earlier without ScApp firmware patch 114526-01, and domains running Solaris 9 without Kernel update patch 117171-14

Notes:

  1. Solaris 7, Solaris 8 and Solaris 10 are not affected by this issue. The Solaris x86 platform is not affected by this issue.
  2. In some cases, use of prtdiag(1M) has been shown to trigger false indications of system hardware failure.

Symptoms

Should the described issue occur, the system may present false indications of system hardware failure. In most cases there is little or no information in showlogs, showerrorbuffer or domain messages to indicate an error. The WARNING: Asynchronous Event message in the console coupled with a system hang or panic are the only indicators. Time-Out (TO) from system bus and/or Privileged (PRIV) code access error(s) messages may also be displayed.

Asynchronous event "lpost" messages or panic messages similar to the following examples (from the platform loghost output) may appear during routine shutdown or reboot:

    {/N0/SB1/P2} WARNING: Asynchronous Event.
    {/N0/SB1/P2} Component under test: /N0/SB1/P2 CPU
    {/N0/SB1/P2}     Unexpected event occurred
    {/N0/SB1/P2} Ino = 00000000.00000000
    {/N0/SB1/P2}  tl  tt         tstate                 tpc               tnpc
    {/N0/SB1/P2} 01  60  00000044.80000604  000007ff.f000bd3c  000007ff.f000bd40
    {/N0/SB1/P2} AFSR = 00000000.00000000
    {/N0/SB1/P2} AFAR = 00000028.04001800
    {/N0/SB1/P2} IMMU SFSR = 00000000.00000000
    {/N0/SB1/P2} DMMU SFSR = 00000000.00000000
    {/N0/SB1/P2} DMMU SFAR = 00000300.14821480
    {/N0/SB1/P2} PState = 00000000.00000814
    {/N0/SB1/P2} Dispatch Control =00000000.00000000
    {/N0/SB1/P2} Data Cache Unit Control =00000000.00000000
    {/N0/SB1/P2} Safari Config. = 0aaa0028.200c0006
    {/N0/SB1/P2} EState = 00000000.0000000b
    {/N0/SB1/P2}  tl  tt         tstate                 tpc               tnpc
    {/N0/SB1/P2} 02  32  00000099.80081402  000007ff.f0006cc0  000007ff.f0006cc4
    {/N0/SB1/P2} 01  60  00000044.80000604  000007ff.f000bd3c  000007ff.f000bd40
    {/N0/SB1/P2}    (TO) Time-out from system bus
    {/N0/SB1/P2}    (PRIV) Privileged code access error(s)

This second example displays another variation of an "Asynchronous Event" message from lpost:

    {/N0/SB4/P1} @(#) lpost 5.17.2 2004/08/13 11:53
    {/N0/SB4/P1} Copyright 2001-2004 Sun Microsystems, Inc. All rights reserved.
    {/N0/SB4/P1} Use is subject to license terms.
    {/N0/SB4/P1} test case reset reason = 00000000.0404ff07
    {/N0/SB4/P1} test case ecache_size=00000000.00800000, tag_size=00000000.00004000
    {/N0/SB4/P1} test case Ecache Mode: 0:3:3
    {/N0/SB4/P1} test case E$ control register = 00000000.00094400
    {/N0/SB4/P1} @(#) lpost 5.17.2 2004/08/13 11:53
    {/N0/SB4/P1} Copyright 2001-2004 Sun Microsystems, Inc. All rights reserved.
    {/N0/SB4/P1} Use is subject to license terms.
    {/N0/SB4/P1} test case reset reason = 00000004.04ff0707
    {/N0/SB4/P1} test case ecache_size=00000000.00800000, tag_size=00000000.00004000
    {/N0/SB4/P1} test case Ecache Mode: 0:3:3
    {/N0/SB4/P1} test case E$ control register = 00000000.00094400
    {/N0/SB4/P1} test case IoSram Add : 0000041c.00900000
    {/N0/SB4/P1} WARNING: Asynchronous Event.
    {/N0/SB4/P1} Component under test: /N0/SB4/P1 CPU
    {/N0/SB4/P1} Task 00000000.00037144 does not exist

This third example displays an "ERROR" message from lpost (The ERROR message was replaced with the WARNING message format due to changes made for bug 4988128, with firmware revisions 5.15.5, 5.16.1, 5.17.1, 5.18.0 and higher):

    {/N0/SB1/P2} Use is subject to license terms.
    {/N0/SB1/P2} test case reset reason = 00000001.04ff0707
    {/N0/SB1/P2} test case ecache_size=00000000.00800000, tag_size=00000000.00004000
    {/N0/SB1/P2} test case E$ control register = 00000000.07c55400
    {/N0/SB1/P2} test case IoSram Add : 00000420.00900000
    {/N0/SB1/P2} ERROR: TEST=Dummy,SUBTEST=Slave Test ID=0.0
    {/N0/SB1/P2} Component under test: /N0/SB1/P2 CPU
    {/N0/SB1/P2} Task 00000000.000374a8 does not exist
    {/N0/SB1/P2} @(#) lpost 5.15.3 2003/09/30 23:01

A second scenario is a system panic, also accompanied by one of the above types of error messages reported in the console logs. System recovery is via panic reboot. Panic messages vary; some examples are:

Example 1:

    panic: failed to stop cpu5

    panic[cpu6]/thread=30005c537c0: bad kernel MMU trap at TL 2

    %tl %tpc              %tnpc             %tstate           %tt
     1  000000000101819c  00000000010181a0  9900001601        068
        %ccr: 99  %asi: 00  %cwp: 1  %pstate: 16<PEF,PRIV,IE>
     2  0000000001008c44  0000000001008c48  4400041401        034
        %ccr: 44  %asi: 00  %cwp: 1  %pstate: 414<MG,PEF,PRIV>

Example 2:

    panic: failed to stop cpu6

    panic[cpu5]/thread=2a100c97d40: bad kernel MMU miss at TL 2

    %tl %tpc              %tnpc             %tstate           %tt
     1  000000000104cf68  000000000104cf6c  4400001603        060
        %ccr: 44  %asi: 00  %cwp: 3  %pstate: 16<PEF,PRIV,IE>
     2  00000000010cf884  00000000010cf888  9900081404        068

Notes:

  1. The Asynchronous Event warning messages may or may not include the "test case reset reason =" line. A test case reset reason code ending in "7" indicates a "red_state" condition.
  2. If other errors are observed in the system logs, these should be investigated as well.

Workaround

There is no workaround for this issue. Please see the Resolution section.


Resolution

This issue is addressed on the following platforms:

  • Sun Fire 2900, 3800, 4800, 4810, 4900, 6800, 6900 and V1280 servers with System Controller (ScApp) firmware version 5.19.0 (for Solaris 9) as delivered in ScApp firmware patch 114526-01 or later and Kernel update patch 117171-14 or later

Note: Kernel update patch version 117171-14 or higher is necessary to resolve BugID 4978865. Both patches must be installed to fully resolve this issue.




Modification History


Date: 01-AUG-2005

01-Aug-2005:

  • Update Contributing Factors and Resolution sections

Date: 05-DEC-2005

05-Dec-2005:

  • Updated Contributing Factors section

Date: 22-MAR-2006

22-Mar-2006:

  • Updated Contributing Factors and Resolution sections

Date: 23-AUG-2006

23-Aug-2006:

  • Updated Contributing Factors and Resolution sections



Attachments
This solution has no attachment

 
 
Login Required

You must login and have a valid contract to access Sun's Premium content which includes:

  • Sun Alerts
  • Bugs
  • Patches
  • Solutions
  • White Papers
  • Documentation
  • Support Knowledge

Login Required

You must login and have a valid contract to access Sun's contracted features

Access Legend:

(Login to access)   Sun Contracted Content
(Login to access)   Sun Contracted Feature

Please make use of SunSolve Feedback application by selecting the floating [+] to provide feedback about this specific document.

Search

Article Details
Article ID : 201301
Article Type : Sun Alert
Last reviewed : 2006-08-23
Audience : PUBLIC
Keywords :
Provide feedback  (help)
Page Tools
»  Print This Page
»  Email This Article
»  Bookmark This Article
 
Contact About Sun News & Events Employment Site Map Privacy Terms of Use Trademarks Copyright Sun Microsystems, Inc. | SunSolve Version 7.4.0 #1