Use of cfgadm(1M) on Certain Systems May Cause Domain Outage, Reporting "L2CheckError" |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Product : | Sun Fire 3800 Server Sun Fire 4800 Server Sun Fire 4810 Server Sun Fire 6800 Server Sun Fire E6900 Server Sun Fire E2900 Server Sun Fire V1280 Server Sun Fire E4900 Server
|
| Bug Id : | 6300392
|
| Date of Workaround Release : | 05-AUG-2005
|
| Date of Resolved Release : | 10-FEB-2006
|
Impact
Use of the cfgadm(1M) command can trigger a domain outage with an "L2CheckError." A loss of application availability due to a system pause from this condition may be misdiagnosed and lead to unnecessary hardware replacement.
Contributing Factors
This issue can occur on the following platforms:
- Sun Fire 3800, 4800, 4810, E2900, E4900, 6800, E6900 and V1280 systems without ScApp firmware 5.19.8 or 5.20.3 (as delivered in patches 114526-09 and 114527-04).
Notes:
- This issue may occur on the systems listed above running Solaris 8, 9 or 10. Solaris 7 does not support the x800/x900 series of Sun Fire Systems.
- This issue will only occur on systems configured for Dynamic Reconfiguration (DR).
An example use of cfgadm(1) causing this condition would be during the configuration of a system board, as in the following example:
# cfgadm -c configure N0.SB2
(see error messages generated in "Symptoms" section)
To determine the version of ScApp on a system, the following command can be run (from the platform shell):
sc0:SC> showsc
...
ScApp version: 5.19.4 Build_01
RTOS version: 45
Symptoms
Output from the "showerrorbuffer" command will display captured error messages similar to the following:
ErrorData[19]
Date: Mon Jun 13 20:55:01 GMT-07:00 2005
Device: /SSC0/sbbc0/systemepld
Register: FirstError[0x10] : 0x0800
SB2 encountered the first error
ErrorData[20]
Date: Mon Jun 13 20:55:01 GMT-07:00 2005
Device: /partition0/domain0/SB2/bbcGroup0/repeaterepld
Register: FirstError[0x10]: 0x0001
ar0 encountered the first error
ErrorData[21]
Date: Mon Jun 13 20:55:01 GMT-07:00 2005
Device: /partition0/domain0/SB2/ar0
ErrorID: 0x10221fff
Register: L2CheckError[0x6150] : 0x00001e00
CMDVSyncErr [12:09] : 0xf Ports [9:6] command valid mismatched
against internal expected command valid
ErrorData[22]
Date: Mon Jun 13 20:55:01 GMT-07:00 2005
Device: /partition0/domain0/SB2/ar0
ErrorID: 0x10221fff
Register: L2CheckError[0x6150] : 0x0000001e
PreqSyncErr [04:01] : 0xf Ports [9:6] prereq mismatched
against internal expected prereq
ErrorData[23]
Date: Mon Jun 13 20:55:01 GMT-07:00 2005
Device: /partition0/domain0/SB2/ar0
ErrorID: 0x10221fff
Register: L2CheckError[0x6150] : 0x1e000000
AccCMDVSyncErr [28:25] : 0xf accumulated valid command mismatch
ErrorData[24]
Date: Mon Jun 13 20:55:01 GMT-07:00 2005
Device: /partition0/domain0/SB2/ar0
ErrorID: 0x10221fff
Register: L2CheckError[0x6150] : 0x001e0000
AccPreqSyncErr [20:17] : 0xf accumulated prerequisite mismatch
and from the output of the "showlogs -d <domain name>" command for the same error:
Jun 13 20:55:01 g1db1-sc0 Domain-A.SC: [ID 427805 local0.crit] ErrorMonitor:
Domain A has a SYSTEM ERROR
Jun 13 20:55:01 g1db1-sc0 Domain-A.SC: [ID 924577 local0.error] /N0/SB2
encountered the first error
Jun 13 20:55:01 g1db1-sc0 Domain-A.SC: [ID 175522 local0.error] ArAsic
reported first error on /N0/SB2
Jun 13 20:55:01 g1db1-sc0 Domain-A.SC: [ID 653352 local0.error]
/partition0/domain0/SB2/ar0:
>>>>>> L2CheckError[0x6150] : 0x1e1e9e1e
CMDVSyncErr [12:09] : 0xf Ports [9:6] command valid mismatched against
internal expected command valid
PreqSyncErr [04:01] : 0xf Ports [9:6] prereq mismatched against
internal expected prereq
AccCMDVSyncErr [28:25] : 0xf accumulated valid command mismatch
FE [15:15] : 0x1
AccPreqSyncErr [20:17] : 0xf accumulated prerequisite mismatch
Jun 13 20:55:01 g1db1-sc0 Domain-A.SC: [ID 250001 local0.error]
[AD] Event: SF4800
CSN: 229H2199 DomainID: A ADInfo: 1.SCAPP.15.4
Time: Mon Jun 13 20:55:01 GMT-07:00 2005
FRU-List-Count: 0; FRU-PN: ; FRU-SN: ; FRU-LOC: UNRESOLVED
Recommended-Action: Service action required
Jun 13 20:55:01 g1db1-sc0 Domain-A.SC: [ID 253130 local0.crit] Domain A is
currently paused due to an error.
This domain must be turned off via "setkeyswitch off" to recover
Workaround
To work around the described issue, use one of the two following options:
a) Reboot the main system controller
or:
b) Manually failover the main system controller
Details on failing over a system controller are beyond the scope of this Sun Alert, and can be found in the "Sun Fire Midrange Systems Platform Administration Manual," (#817-2971-10) found at http://docs.sun.com/app/docs?q=817-2971-10.
Resolution
This issue is addressed on the following platforms:
- Sun Fire 3800, 4800, 4810, E2900, E4900, 6800, E6900 and V1280 systems with ScApp firmware 5.19.8 or 5.20.3 (as delivered in patches 114526-09 or later and 114527-04 or later)
Modification HistoryDate: 25-OCT-2005
25-Oct-2005:
- Updated Contributing Factors
Date: 10-FEB-2006
10-Feb-2006:
- Updated Impact, Contributing Factors and Resolution sections; re-release as Resolved
Date: 05-DEC-2006
05-Dec-2006:
- Updated Contributing Factors and Resolution sections
AttachmentsThis solution has no attachment