Power Cycling an Enterprise 10000 Server Domain May Cause hpost(1M) to Erroneously Fail Some Resources |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Product : | Sun Enterprise 10000 Server
|
| Bug Id : | 4310528
|
| Date of Resolved Release : | 26-AUG-2004
|
Impact
On an Enterprise 10000 Server with System Service Processor (SSP) 3.3, 3.4 or 3.5, hpost(1M) may erroneously fail IOCs or have "Procs time out" during xcall testing after the poweroff and during the bringup of a domain.
Contributing Factors
This issue can occur in the following releases:
SPARC Platform
-
SSP 3.3 (for Solaris 2.6, 7, 8) without patch 108885-11
-
SSP 3.4 (for Solaris 2.6, 7, 8) without patch 110304-07
-
SSP 3.5 (for Solaris 7 and 8) without patch 110498-02
Symptoms
If the described issue occurs, the hpost(1M) failure(s) encountered will be different based upon the hpost(1M) "level" that is run. Noting that hpost(1M) phases are ordered, (xcall, ... io, ... final_config), the level of hpost(1M) that is run determines which hpost(1M) phase will encounter the failure first:
hpost(1M)levels and corresponding symptoms
-------------------------------------------------------------------------
hpost phase level7 to 15 level16 to 23 level24 or higher
-------------------------------------------------------------------------
phase xcall Not run Not run Proc time outs
phase io Not run FAIL IOCs *
phase final_config FAIL IOCs * *
-------------------------------------------------------------------------
* = Subsequent FAILURE mode is indeterminate.
The actual hpost(1M) console outputs (or hpost(1M) log info) for the above failures are as follows:
Example failure for "phase xcall" (hpost level24 or higher):
phase xcall: Interprocessor interrupt tests...
Proc 3.0 timed out on test xcall interrupt vs. proc 3.2 id=0x2C. Test Failed.
Proc 3.2 timed out on test xcall interrupt vs. proc 3.0 id=0x2C. Test Failed.
Arbstop/Recordstop/Timeout recovery (1); rerun starting at:
phase xcall: Interprocessor interrupt tests...
Example failure for "phase io" (hpost level16 [default]):
phase io: I/O controller tests...
{0.0} ERROR: SUBTEST=SYSIO <-> IOPC Synchronization ID=31.2
{0.0} Expected interrupt did not occur,
{0.0} Interrupt Register Address 00000108.00003808
{0.0} testing interrupt number(INO) 21
{0.0} SYSIO Master Sync 2 error:
FAIL IOC 0.0 in all configs: SYSIO test failed
{0.0} ERROR: SUBTEST=SYSIO <-> IOPC Synchronization ID=31.2
{0.0} Expected interrupt did not occur,
{0.0} Interrupt Register Address 0000010a.00003808
{0.0} testing interrupt number(INO) 21
{0.0} SYSIO Interrupt MID Error
{0.0} Expected: 0x1
{0.0} Received: 0x0
{0.0} XOR: 0x1
{0.0} SYSIO Master Sync 2 error:
FAIL IOC 0.1 in all configs: SYSIO test failed
Example failure for "phase final_config" (hpost level7):
phase final_config: Final configuration...
Configuring in 3F, FOM = 67584.00: 12 procs, 8 Scards, 5632 MBytes.
{0.0} Expected interrupt did not occur,
{0.0} Interrupt Register Address 0000011a.00003808
{0.0} testing interrupt number(INO) 21
{0.0} SYSIO Interrupt MID Error
{0.0} Expected: 0x3
{0.0} Received: 0x0
{0.0} XOR: 0x3
{0.0} SYSIO Master Sync 2 error:
{0.0} SYSIO Interrupt MID Error
{0.0} Expected: 0x3
{0.0} Received: 0x0
{0.0} XOR: 0x3
{0.0} SYSIO Master Sync 2 error:
{0.0} *** Error in SYSIO 0xd master sync (2 retries)
FAIL IOC 1.1 in config 3F: Initialization failure.
Note: Immediate, subsequent hpost(1M) runs will most likely not encounter the above failures again. However, the problem will intermittently persist for future hpost(1M) runs until the hpost(1M) patch is applied to the SSP.
Workaround
There is no workaround. Please see the "Resolution" section below.
Resolution
This issue is addressed in the following releases:
SPARC Platform
-
SSP 3.3 (for Solaris 2.6, 7, 8) with patch 108885-11 or later
-
SSP 3.4 (for Solaris 2.6, 7, 8) with patch 110304-07 or later
-
SSP 3.5 (for Solaris 7 and 8) with patch 110498-02 or later
Modification History
AttachmentsThis solution has no attachment