System May Hang or Panic Accompanied by "lpost" Messages |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Product : | Sun Fire 3800 Server Sun Fire 4800 Server Sun Fire 4810 Server Sun Fire 6800 Server Sun Fire E6900 Server Sun Fire E2900 Server Sun Fire V1280 Server Sun Fire E4900 Server
|
| Bug Id : | 4978865, 5054736
|
| Date of Workaround Release : | 25-APR-2005
|
| Date of Resolved Release : | 01-AUG-2005
|
Impact
False indications of hardware failure may be diagnosed incorrectly, and remedial action may lead to unnecessary hardware replacement. Loss of application availability may occur due to either a system panic or hang which may require a "setkeyswitch" cycle to recover.
Contributing Factors
This issue can occur on the following platforms:
- Sun Fire 2900, 3800, 4800, 4810, 4900, 6800, 6900 and V1280 servers with System Controller (ScApp) firmware versions 5.18.x and earlier without ScApp firmware patch 114526-01, and domains running Solaris 9 without Kernel update patch 117171-14
Notes:
- Solaris 7, Solaris 8 and Solaris 10 are not affected by this issue. The Solaris x86 platform is not affected by this issue.
- In some cases, use of prtdiag(1M) has been shown to trigger false indications of system hardware failure.
Symptoms
Should the described issue occur, the system may present false indications of system hardware failure. In most cases there is little or no information in showlogs, showerrorbuffer or domain messages to indicate an error. The WARNING: Asynchronous Event message in the console coupled with a system hang or panic are the only indicators. Time-Out (TO) from system bus and/or Privileged (PRIV) code access error(s) messages may also be displayed.
Asynchronous event "lpost" messages or panic messages similar to the following examples (from the platform loghost output) may appear during routine shutdown or reboot:
{/N0/SB1/P2} WARNING: Asynchronous Event.
{/N0/SB1/P2} Component under test: /N0/SB1/P2 CPU
{/N0/SB1/P2} Unexpected event occurred
{/N0/SB1/P2} Ino = 00000000.00000000
{/N0/SB1/P2} tl tt tstate tpc tnpc
{/N0/SB1/P2} 01 60 00000044.80000604 000007ff.f000bd3c 000007ff.f000bd40
{/N0/SB1/P2} AFSR = 00000000.00000000
{/N0/SB1/P2} AFAR = 00000028.04001800
{/N0/SB1/P2} IMMU SFSR = 00000000.00000000
{/N0/SB1/P2} DMMU SFSR = 00000000.00000000
{/N0/SB1/P2} DMMU SFAR = 00000300.14821480
{/N0/SB1/P2} PState = 00000000.00000814
{/N0/SB1/P2} Dispatch Control =00000000.00000000
{/N0/SB1/P2} Data Cache Unit Control =00000000.00000000
{/N0/SB1/P2} Safari Config. = 0aaa0028.200c0006
{/N0/SB1/P2} EState = 00000000.0000000b
{/N0/SB1/P2} tl tt tstate tpc tnpc
{/N0/SB1/P2} 02 32 00000099.80081402 000007ff.f0006cc0 000007ff.f0006cc4
{/N0/SB1/P2} 01 60 00000044.80000604 000007ff.f000bd3c 000007ff.f000bd40
{/N0/SB1/P2} (TO) Time-out from system bus
{/N0/SB1/P2} (PRIV) Privileged code access error(s)
This second example displays another variation of an "Asynchronous Event" message from lpost:
{/N0/SB4/P1} @(#) lpost 5.17.2 2004/08/13 11:53
{/N0/SB4/P1} Copyright 2001-2004 Sun Microsystems, Inc. All rights reserved.
{/N0/SB4/P1} Use is subject to license terms.
{/N0/SB4/P1} test case reset reason = 00000000.0404ff07
{/N0/SB4/P1} test case ecache_size=00000000.00800000, tag_size=00000000.00004000
{/N0/SB4/P1} test case Ecache Mode: 0:3:3
{/N0/SB4/P1} test case E$ control register = 00000000.00094400
{/N0/SB4/P1} @(#) lpost 5.17.2 2004/08/13 11:53
{/N0/SB4/P1} Copyright 2001-2004 Sun Microsystems, Inc. All rights reserved.
{/N0/SB4/P1} Use is subject to license terms.
{/N0/SB4/P1} test case reset reason = 00000004.04ff0707
{/N0/SB4/P1} test case ecache_size=00000000.00800000, tag_size=00000000.00004000
{/N0/SB4/P1} test case Ecache Mode: 0:3:3
{/N0/SB4/P1} test case E$ control register = 00000000.00094400
{/N0/SB4/P1} test case IoSram Add : 0000041c.00900000
{/N0/SB4/P1} WARNING: Asynchronous Event.
{/N0/SB4/P1} Component under test: /N0/SB4/P1 CPU
{/N0/SB4/P1} Task 00000000.00037144 does not exist
This third example displays an "ERROR" message from lpost (The ERROR message was replaced with the WARNING message format due to changes made for bug 4988128, with firmware revisions 5.15.5, 5.16.1, 5.17.1, 5.18.0 and higher):
{/N0/SB1/P2} Use is subject to license terms.
{/N0/SB1/P2} test case reset reason = 00000001.04ff0707
{/N0/SB1/P2} test case ecache_size=00000000.00800000, tag_size=00000000.00004000
{/N0/SB1/P2} test case E$ control register = 00000000.07c55400
{/N0/SB1/P2} test case IoSram Add : 00000420.00900000
{/N0/SB1/P2} ERROR: TEST=Dummy,SUBTEST=Slave Test ID=0.0
{/N0/SB1/P2} Component under test: /N0/SB1/P2 CPU
{/N0/SB1/P2} Task 00000000.000374a8 does not exist
{/N0/SB1/P2} @(#) lpost 5.15.3 2003/09/30 23:01
A second scenario is a system panic, also accompanied by one of the above types of error messages reported in the console logs. System recovery is via panic reboot. Panic messages vary; some examples are:
Example 1:
panic: failed to stop cpu5
panic[cpu6]/thread=30005c537c0: bad kernel MMU trap at TL 2
%tl %tpc %tnpc %tstate %tt
1 000000000101819c 00000000010181a0 9900001601 068
%ccr: 99 %asi: 00 %cwp: 1 %pstate: 16<PEF,PRIV,IE>
2 0000000001008c44 0000000001008c48 4400041401 034
%ccr: 44 %asi: 00 %cwp: 1 %pstate: 414<MG,PEF,PRIV>
Example 2:
panic: failed to stop cpu6
panic[cpu5]/thread=2a100c97d40: bad kernel MMU miss at TL 2
%tl %tpc %tnpc %tstate %tt
1 000000000104cf68 000000000104cf6c 4400001603 060
%ccr: 44 %asi: 00 %cwp: 3 %pstate: 16<PEF,PRIV,IE>
2 00000000010cf884 00000000010cf888 9900081404 068
Notes:
-
The Asynchronous Event warning messages may or may not include the "test case reset reason =" line. A test case reset reason code ending in "7" indicates a "red_state" condition.
-
If other errors are observed in the system logs, these should be investigated as well.
Workaround
There is no workaround for this issue. Please see the Resolution section.
Resolution
This issue is addressed on the following platforms:
- Sun Fire 2900, 3800, 4800, 4810, 4900, 6800, 6900 and V1280 servers with System Controller (ScApp) firmware version 5.19.0 (for Solaris 9) as delivered in ScApp firmware patch 114526-01 or later and Kernel update patch 117171-14 or later
Note: Kernel update patch version 117171-14 or higher is necessary to resolve BugID 4978865. Both patches must be installed to fully resolve this issue.
Modification HistoryDate: 01-AUG-2005
01-Aug-2005:
- Update Contributing Factors and Resolution sections
Date: 05-DEC-2005
05-Dec-2005:
- Updated Contributing Factors section
Date: 22-MAR-2006
22-Mar-2006:
- Updated Contributing Factors and Resolution sections
Date: 23-AUG-2006
23-Aug-2006:
- Updated Contributing Factors and Resolution sections
AttachmentsThis solution has no attachment