Sun Fire Systems Equipped With UltraSPARC IV+ Processor Modules Running Solaris 9 or Solaris 10 may Exhibit Unnecessary CPU Offlining and Solaris Panics



Category :Availability
Release Phase :Resolved
Bug Id :6342696, 6361055, 6369869, 6380337, 6449180, 6587622, 6380531, 6589208  
Product :UltraSPARC IV+ Processor
UltraSPARC IV Processor  
Date of Workaround Release :07-DEC-2007 
Date of Resolved Release :24-Jun-2008 

Systems equipped with UltraSPARC IV+ processor modules running Solaris 9 or Solaris 10 (see below for details)


1. Impact

Systems equipped with UltraSPARC IV+ processor modules running Solaris 9 or Solaris 10 may exhibit CPU offlining and Solaris panics. This may result in unnecessary part replacements.


2. Contributing Factors

These issues can occur in the following releases:

SPARC Platform

  • Solaris 9 with UltraSPARC IV+ processors without patch (see Resolution section)
  • Solaris 10 with UltraSPARC IV+ processors without patches  (see Resolution section)
  • Sun Fire 12K/15K/20K/25K with SMS 1.6 (for Solaris 9) without patch  (see Resolution section)
  • Sun Fire E6900/E4900/E2900/6800/4800/4810/3800 and V1280/V1290 with SCApp prior to version 5.20.0 (see Resolution section)

3. Symptoms

There are two issues described in this Sun Alert:

  1. Processor offlining
  2. System panics

The following symptoms may be seen for the issues listed below:

Offlined processors on Solaris 9 Systems:

If the described issue occurs, incorrectly faulted CPU modules would appear in "/var/adm/messages" file:

    Dec 26 04:20:28 domnam SUNW,UltraSPARC-IV+: [ID 610039 kern.info]
    NOTICE: [AFT0] WDC Event detected by CPU166 at TL=0, errID 0x00566baa.7f6f4b1a

    Dec 26 04:20:28 domnam     AFSR 0x00000040<WDC>.000001d1 AFSR_EXT 0x00000000
    AFAR 0x00000083.daa60990
          Dec 26 04:20:28 domnam     Fault_PC 0x100260980 Esynd 0x01d1

    Dec 26 04:20:28 domnam SUNW,UltraSPARC-IV+: [ID 695758 kern.info]
    [AFT0] errID 0x00566baa.7f6f4b1a Data Bit 115 was in error and corrected
          Dec 26 04:20:28 domnam SUNW,UltraSPARC-IV+: [ID 765675 kern.notice]
    NOTICE: [AFT1] CPU166 offlined due to more than 2 xxC Events in 24:00:00 (hh:mm:ss)

Note 1: The message states that the CPU was offlined for "more than 2 xxC Events in 24:00:00 (hh:mm:ss)"

Note 2: These are all UltraSPARC IV+ CPUs.

Note 3: With the original parameters in place, the SERD (Soft Error Rate Discriminator) will fire and the CPU will be faulted if the diagnostic engine receives as few as three errors against the processor memory in 24 hours.

Offlined processors on Solaris 10 systems:

If the described issue occurs, incorrectly faulted CPU modules similar to the following will appear in "/var/adm/messages" file:

    Jul 18 17:33:17 2006 SUNW-MSG-ID: SUN4U-8001-1E, TYPE: Fault,
    VER: 1, SEVERITY: Major
    Jul 18 17:33:17 2006 EVENT-TIME: Tue Jul 18 16:33:29 CDT 2006

Any of the following fault codes may appear in syslog indicating that a CPU failure has occurred:

    SUN4U-8000-XJ
    SUN4U-8000-YE
    SUN4U-8001-0J
    SUN4U-8001-1E

Systems experience a CPU retirement associated with one of the above Predictive Self Healing (PSH) fault codes.

Examine the "ereports" by either running "fmdump -e" on the system or looking at the "fmdump -e" output file, "fmdump-e.out", which is located in the "fma" directory of the explorer output (Explorer 5.4 and later). The "ereports" would be similar to the following:

    Jun 29 06:16:07.3395 ereport.cpu.ultraSPARC-IVplus.l3-thce
    Jul 09 07:36:02.1984 ereport.cpu.ultraSPARC-IVplus.l3-thce
    Jul 09 14:56:40.0253 ereport.cpu.ultraSPARC-IVplus.l3-thce
    Jul 10 07:55:44.0387 ereport.cpu.ultraSPARC-IVplus.l3-thce
    Jul 10 13:16:00.3768 ereport.cpu.ultraSPARC-IVplus.l3-thce
    Jul 10 17:11:02.9757 ereport.cpu.ultraSPARC-IVplus.l3-thce
    Jul 10 23:36:03.4403 ereport.cpu.ultraSPARC-IVplus.l3-thce
    Jul 10 23:36:04.1830 ereport.cpu.ultraSPARC-IVplus.l3-thce
    Jul 11 08:41:46.0181 ereport.cpu.ultraSPARC-IVplus.l3-thce

    Jul 18 15:43:18.0820 ereport.cpu.ultraSPARC-IVplus.l3-thce
    Jul 18 15:43:26.6932 ereport.cpu.ultraSPARC-IVplus.l3-thce
    Jul 18 15:43:52.1065 ereport.cpu.ultraSPARC-IVplus.l3-thce
    Jul 18 16:33:29.2605 ereport.cpu.ultraSPARC-IVplus.l3-thce

    Jul 19 12:53:16.5410 ereport.cpu.ultraSPARC-IVplus.l3-thce
    Jul 20 13:46:29.9825 ereport.cpu.ultraSPARC-IVplus.l3-thce

Note 1: These are all correctable errors as noted by the "ce" suffix. The "ce" in this example is one of a number of possible "ce" codes.

Note 2: These are all UltraSPARC IV+ CPUs.

Note 3: With the original parameters in place, the SERD (Soft Error Rate Discriminator) will fire and the CPU will be faulted if the diagnostic engine receives more than three "ereports" against the processor memory in 12 hours. This is incorrect.

If an "ereport" is more than 12 hours old the SERD is allowed to expire and there is no indictment.

In the above case, the "ereports" would generate a SERD to start and expire several times but in particular there is a set of "ereports" in the above messages in which there are only four "ereports" inside of twelve hours ending at Jul 18 16:33:29.2605.

This would then result in the above CPU fault message in "/var/adm/messages".

System panic issue:

If the system panic issue occurs, messages similar to the following may be seen in "/var/adm/messages" for repeating CEs, possibly in association with a panic:

    Jul 20 18:25:38 domain1 SUNW,UltraSPARC-IV+: [ID 272585   kern.info]
    NOTICE: [AFT0] L3_THCE Event detected by CPU3 at TL=0, errID
    0x000f0529.803f8104
    Jul 20 18:26:10 domain1 SUNW,UltraSPARC-IV+: [ID 637063 kern.info]
    NOTICE: [AFT0] L3_THCE Event detected by CPU3 at TL=0, errID
    0x000f0530.fd1f326c
    Jul 20 18:26:37 domain1 SUNW,UltraSPARC-IV+: [ID 720762 kern.info]
    NOTICE: [AFT0] L3_THCE Event detected by CPU3 at TL=0, errID
    0x000f0537.52a21d84

The particular correctable error may be different than a L3_THCE.

On Solaris 10 systems, the system deals with correctable error floods in a different manner, making them less likely to cause a panic. There may be no messages in "/var/adm/messages" indicating that correctable errors are occurring, however there may be a large number of "ereports" indicating CPU correctable errors prior to the time of the panic.


4. Workaround

There is no work around for these issues. Please see the "Resolution" section below.


5. Resolution

These issues are addressed as follows:

For CRs  6449180, 6361055, 6342696, 6380337, 6380531, 6369869

  • Solaris 9 with UltraSPARC IV+ processors with patch 122300-04 or later
  • Solaris 10 with UltraSPARC IV+ processors with patches 119578-29 and 125369-03 or later
  • Sun Fire 12K/15K/20K/25K with SMS 1.6 (for Solaris 9) with patch 123300-06 or later
  • Sun Fire E6900/E4900/E2900/6800/4800/4810/3800 and V1280/V1290 with SCApp 5.20.0 patch 114527-01 or later
For CR 6587622:
  • Solaris 9 with UltraSPARC IV+ processors with patch 122300-22 or later
For CR 6589208:
  • Solaris 9 with UltraSPARC IV+ processors with patch 122300-28 or later
  • Solaris 10 with UltraSPARC IV+ processors with patch 137111-02 or later


Modification History

24-Jun-2008: Updated CR list, Contributing Factors and Resolution sections. Resolved.




Attachments
This solution has no attachment

 
 
Login Required

You must login and have a valid contract to access Sun's Premium content which includes:

  • Sun Alerts
  • Bugs
  • Patches
  • Solutions
  • White Papers
  • Documentation
  • Support Knowledge

Login Required

You must login and have a valid contract to access Sun's contracted features

Access Legend:

(Login to access)   Sun Contracted Content
(Login to access)   Sun Contracted Feature

Please make use of SunSolve Feedback application by selecting the floating [+] to provide feedback about this specific document.

Search

Article Details
Article ID : 200634
Article Type : Sun Alert
Last reviewed : 2008-06-24
Audience : PUBLIC
Keywords :
Provide feedback  (help)
Page Tools
»  Print This Page
»  Email This Article
»  Bookmark This Article
 
Contact About Sun News & Events Employment Site Map Privacy Terms of Use Trademarks Copyright Sun Microsystems, Inc. | SunSolve Version 7.4.0 #1