Sun Fire Systems Equipped With UltraSPARC IV+ Processor Modules Running Solaris 9 or Solaris 10 may Exhibit Unnecessary CPU Offlining and Solaris Panics |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Bug Id : | 6342696, 6361055, 6369869, 6380337, 6449180, 6587622, 6380531, 6589208
|
| Product : | UltraSPARC IV+ Processor UltraSPARC IV Processor
|
| Date of Workaround Release : | 07-DEC-2007
|
| Date of Resolved Release : | 24-Jun-2008
|
Systems equipped with UltraSPARC IV+ processor modules running Solaris 9 or Solaris 10 (see below for details)
1. Impact
Systems equipped with UltraSPARC IV+ processor modules running Solaris 9 or Solaris 10 may exhibit CPU offlining and Solaris panics. This may result in unnecessary part replacements.
2. Contributing Factors
These issues can occur in the following releases:
SPARC Platform
- Solaris 9 with UltraSPARC IV+ processors without patch (see Resolution section)
- Solaris 10 with UltraSPARC IV+ processors without patches (see Resolution section)
- Sun Fire 12K/15K/20K/25K with SMS 1.6 (for Solaris 9) without patch (see Resolution section)
- Sun Fire E6900/E4900/E2900/6800/4800/4810/3800 and V1280/V1290 with SCApp prior to version 5.20.0 (see Resolution section)
3. Symptoms
There are two issues described in this Sun Alert:
- Processor offlining
- System panics
The following symptoms may be seen for the issues listed below:
Offlined processors on Solaris 9 Systems:
If the described issue occurs, incorrectly faulted CPU modules would appear in "/var/adm/messages" file:
Dec 26 04:20:28 domnam SUNW,UltraSPARC-IV+: [ID 610039 kern.info]
NOTICE: [AFT0] WDC Event detected by CPU166 at TL=0, errID 0x00566baa.7f6f4b1a
Dec 26 04:20:28 domnam AFSR 0x00000040<WDC>.000001d1 AFSR_EXT 0x00000000
AFAR 0x00000083.daa60990
Dec 26 04:20:28 domnam Fault_PC 0x100260980 Esynd 0x01d1
Dec 26 04:20:28 domnam SUNW,UltraSPARC-IV+: [ID 695758 kern.info]
[AFT0] errID 0x00566baa.7f6f4b1a Data Bit 115 was in error and corrected
Dec 26 04:20:28 domnam SUNW,UltraSPARC-IV+: [ID 765675 kern.notice]
NOTICE: [AFT1] CPU166 offlined due to more than 2 xxC Events in 24:00:00 (hh:mm:ss)
Note 1: The message states that the CPU was offlined for "more than 2 xxC Events in 24:00:00 (hh:mm:ss)"
Note 2: These are all UltraSPARC IV+ CPUs.
Note 3: With the original parameters in place, the SERD (Soft Error Rate Discriminator) will fire and the CPU will be faulted if the diagnostic engine receives as few as three errors against the processor memory in 24 hours.
Offlined processors on Solaris 10 systems:
If the described issue occurs, incorrectly faulted CPU modules similar to the following will appear in "/var/adm/messages" file:
Jul 18 17:33:17 2006 SUNW-MSG-ID: SUN4U-8001-1E, TYPE: Fault,
VER: 1, SEVERITY: Major
Jul 18 17:33:17 2006 EVENT-TIME: Tue Jul 18 16:33:29 CDT 2006
Any of the following fault codes may appear in syslog indicating that a CPU failure has occurred:
SUN4U-8000-XJ
SUN4U-8000-YE
SUN4U-8001-0J
SUN4U-8001-1E
Systems experience a CPU retirement associated with one of the above Predictive Self Healing (PSH) fault codes.
Examine the "ereports" by either running "fmdump -e" on the system or looking at the "fmdump -e" output file, "fmdump-e.out", which is located in the "fma" directory of the explorer output (Explorer 5.4 and later). The "ereports" would be similar to the following:
Jun 29 06:16:07.3395 ereport.cpu.ultraSPARC-IVplus.l3-thce
Jul 09 07:36:02.1984 ereport.cpu.ultraSPARC-IVplus.l3-thce
Jul 09 14:56:40.0253 ereport.cpu.ultraSPARC-IVplus.l3-thce
Jul 10 07:55:44.0387 ereport.cpu.ultraSPARC-IVplus.l3-thce
Jul 10 13:16:00.3768 ereport.cpu.ultraSPARC-IVplus.l3-thce
Jul 10 17:11:02.9757 ereport.cpu.ultraSPARC-IVplus.l3-thce
Jul 10 23:36:03.4403 ereport.cpu.ultraSPARC-IVplus.l3-thce
Jul 10 23:36:04.1830 ereport.cpu.ultraSPARC-IVplus.l3-thce
Jul 11 08:41:46.0181 ereport.cpu.ultraSPARC-IVplus.l3-thce
Jul 18 15:43:18.0820 ereport.cpu.ultraSPARC-IVplus.l3-thce
Jul 18 15:43:26.6932 ereport.cpu.ultraSPARC-IVplus.l3-thce
Jul 18 15:43:52.1065 ereport.cpu.ultraSPARC-IVplus.l3-thce
Jul 18 16:33:29.2605 ereport.cpu.ultraSPARC-IVplus.l3-thce
Jul 19 12:53:16.5410 ereport.cpu.ultraSPARC-IVplus.l3-thce
Jul 20 13:46:29.9825 ereport.cpu.ultraSPARC-IVplus.l3-thce
Note 1: These are all correctable errors as noted by the "ce" suffix. The "ce" in this example is one of a number of possible "ce" codes.
Note 2: These are all UltraSPARC IV+ CPUs.
Note 3: With the original parameters in place, the SERD (Soft Error Rate Discriminator) will fire and the CPU will be faulted if the diagnostic engine receives more than three "ereports" against the processor memory in 12 hours. This is incorrect.
If an "ereport" is more than 12 hours old the SERD is allowed to expire and there is no indictment.
In the above case, the "ereports" would generate a SERD to start and expire several times but in particular there is a set of "ereports" in the above messages in which there are only four "ereports" inside of twelve hours ending at Jul 18 16:33:29.2605.
This would then result in the above CPU fault message in "/var/adm/messages".
System panic issue:
If the system panic issue occurs, messages similar to the following may be seen in "/var/adm/messages" for repeating CEs, possibly in association with a panic:
Jul 20 18:25:38 domain1 SUNW,UltraSPARC-IV+: [ID 272585 kern.info]
NOTICE: [AFT0] L3_THCE Event detected by CPU3 at TL=0, errID
0x000f0529.803f8104
Jul 20 18:26:10 domain1 SUNW,UltraSPARC-IV+: [ID 637063 kern.info]
NOTICE: [AFT0] L3_THCE Event detected by CPU3 at TL=0, errID
0x000f0530.fd1f326c
Jul 20 18:26:37 domain1 SUNW,UltraSPARC-IV+: [ID 720762 kern.info]
NOTICE: [AFT0] L3_THCE Event detected by CPU3 at TL=0, errID
0x000f0537.52a21d84
The particular correctable error may be different than a L3_THCE.
On Solaris 10 systems, the system deals with correctable error floods in a different manner, making them less likely to cause a panic. There may be no messages in "/var/adm/messages" indicating that correctable errors are occurring, however there may be a large number of "ereports" indicating CPU correctable errors prior to the time of the panic.
4. Workaround
There is no work around for these issues. Please see the "Resolution" section below.
5. Resolution
These issues are addressed as follows:
For CRs 6449180, 6361055, 6342696, 6380337, 6380531, 6369869
- Solaris 9 with UltraSPARC IV+ processors with patch 122300-04 or later
- Solaris 10 with UltraSPARC IV+ processors with patches 119578-29 and 125369-03 or later
- Sun Fire 12K/15K/20K/25K with SMS 1.6 (for Solaris 9) with patch 123300-06 or later
- Sun Fire E6900/E4900/E2900/6800/4800/4810/3800 and V1280/V1290 with SCApp 5.20.0 patch 114527-01 or later
For CR 6587622:
- Solaris 9 with UltraSPARC IV+ processors with patch
122300-22 or later
For CR 6589208:
- Solaris 9 with UltraSPARC IV+ processors with patch
122300-28 or later
- Solaris 10 with UltraSPARC IV+ processors with patch
137111-02 or later
Modification History24-Jun-2008: Updated CR list, Contributing Factors and Resolution sections. Resolved.
AttachmentsThis solution has no attachment