Sun Fire 3800/4800/4810/6800, Sun Fire 12K/15K, Sun Fire V1280, and Netra 1280 Server Domains with 900MHz CPUs May Panic or Hang Due to Incorrect L2 SRAM Parameter Settings |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Product : | Sun Fire 12K Server Sun Fire 3800 Server Sun Fire 4800 Server Sun Fire 4810 Server Sun Fire 6800 Server Sun Fire 15K Server Sun Fire V1280 Server Netra 1280 Server
|
| Bug Id : | 4808603, 4807422, 4809236
|
| Date of Workaround Release : | 31-JAN-2003
|
| Date of Resolved Release : | 17-MAR-2003
|
Impact
Sun has identified an issue with L2 SRAM parameter settings on Sun Fire 3800/4800/4810/6800, Sun Fire 12K/15K, Sun Fire V1280 and Netra 1280 systems. This issue may cause L2 SRAM errors to be produced, which can lead to domain panics or hangs.
Contributing Factors
This issue can occur with Sun Fire 3800/4800/4810/6800 systems which have 900Mhz processors running the following firmware releases:
-
Sun Fire 3800/4800/4810/6800 without firmware patch 112883-05 (firmware 5.14.4)
-
Sun Fire 3800/4800/4810/6800 without firmware patch 112494-08 (firmware 5.13.5)
-
Sun Fire 3800/4800/4810/6800 with any version of firmware patch 112127 (firmware 5.12.x)
This issue can occur with Sun Fire V1280 and Netra 1280 systems which have 900Mhz processors running the following firmware releases:
-
Sun Fire V1280 and Netra 1280 systems without firmware patch 113751-02 (firmware 5.13.0012)
This issue can occur on Sun Fire 12K/15K systems which have 900Mhz processors running the following versions of HPOST:
-
Sun Fire 12K/15K with SMS 1.1
-
Sun Fire 12K/15K with SMS 1.2 without patch 112488-11
-
Sun Fire 12K/15K with SMS 1.3 without patch 114608-01
Symptoms
When this issue is encountered, error messages with one of the below character strings may be experienced.
UCC, UCU, EDC, EDU, WDC, WDU, CPC, CPU (not to be confused with references to a processor), Esynd 0x0071.
Below are descriptions of the resulting system behavior and how the error would be reflected in the message files.
A. Sun Fire 3800/4800/4810/6800, Sun Fire V1280, Netra 1280:
i. System reports many single bit errors from the same location with one or a combination of the character strings listed above and then panics or freezes.
ii. System reports the detection of a double bit error with one or a combination of the character strings listed above and in most cases automatically reboots.
iii. No messages from Solaris, messages from the System Controller only with the character string "ECC Syndrome: 0x071".
iv. Failures detected in POST:
(Seen during the test)
{/N0/SB0/P2} Component under test: /N0/SB0/P2 E-Cache
{/N0/SB0/P2} E-Cache RAM Compare Error J6400
{/N0/SB0/P2} address 00000000.00000028
{/N0/SB0/P2} expected 55555555.55555555
{/N0/SB0/P2} observed 55455555.55555555
(Seen at the end of the basic CPU tests)
{/N0/SB0/P2} E-Cache DIMM J6400 failed
B. Sun Fire 12K/15K:
i. System reports the detection of a double bit error with one or a combination of the character strings listed above and in most cases automatically reboots.
ii. Failures detected in POST:
{SB03/P0} Component under test: /SB3/P0: E-Cache
{SB03/P0} E-Cache RAM Compare Error J4400
{SB03/P0} address 00000000.00010808
{SB03/P0} expected 00000000.00010808
{SB03/P0} observed 00000000.00010c08
FAIL E$Dimm SB3/P0/E0: Failure indicated in CPU MBox Primary service FRU is Slot SB3.
Proc SB3/P0: EpiBecacheR1_sc_tfunc(): Test FAILED
or
RECORDSTOP Detected for Slot SB17
SDI EX17/S0 Master_Stop_Status0[31:0] = 50040108 MStop0[3]: SDI is Recordstopped
SDI EX17/S0 Recordstop0[31:0] = 04018400
Rstop0[16]: R DARB texp request Recordstop (M)
Rstop0[26]: R 1E Slot0 asserted EccErr, enabled to cause Rstop (M)
EPLD SB17 Ecc_Err: Mask= F7 Err= 08 SDC reports EccErr
SDC SB17 EccStatus[31:0] = 0000C041
EccSt[15]: Safari port 0/1 Ecc error logged.
Received by DXs from local Safari port 0, read operation.
DX SB17/DX2 Ecc_Syndrome[31:0] = 00000071
Syndr[ 8: 0]: P01 Data: 071: Probable Double-bit UE within a nibble
Syndr[ 15]: P01 Direction: 0: Safari port to DX (Incoming)
NOTE: Error 071 is a "signal" of an Ecache Uncorrectable Error.
ECC uncorrectable errors detected from Processor Port SB17/P0, no corresponding parity
error in DXs or DCDSs. For multibit errors, the lack of parity error is not
sufficient to infer that the error originated in memory, it could be from the processor
or DCDS/DX link. The syndrome is a "signaling" UE that likely indicates an Ecache error.
FAIL All Ecache on Port SB17/P0: Rstop detected by DXs/SDC.
Primary service FRU is Slot SB17.
iii. A DSMD rstop dump created on the SC during system operation which when examined with redx/wfail exhibits the same signature as the POST RECORDSTOP shown above.
Workaround
The impact of this issue may be reduced by installing Solaris Kernel patches for the following releases:
Resolution
This issue is addressed in the following releases:
Sun Fire 3800/4800/4810/6800:
-
Firmware 5.14.4 (or later) with patch 112883-05
-
Firmware 5.13.5 (or later) with patch 112494-08
Sun Fire 3800/4800/4810/6800 platforms with firmware 5.12.x should be upgraded to a later version (5.13.x, 5.14.x) with the appropriate patch.
Sun Fire V1280 and Netra 1280:
-
Firmware 5.13.0012 (or later) with patch 113751-02
Sun Fire 12K/15K:
Note: All domains must undergo a setkeyswitch standby/on operation after the patch is applied. This will run HPOST at the default level and apply the fix.
Sun Fire 12K/15K platforms with SMS 1.1 should be upgraded to SMS 1.2 (or later) and have the appropriate patch applied.
Modification HistoryDate: 20-FEB-2003
-
Updated Synopsis
-
Updated Contributing Factors
-
Updated Resolution
Date: 17-MAR-2003
-
Updated Contributing Factors
-
Updated Resolution
-
Changed State to Resolved
Date: 20-MAR-2003
Date: 04-APR-2003
AttachmentsThis solution has no attachment