Generic x86 MCA Error Philosophy
Published Revision History
| Version |
Date |
Description |
| 1.11 |
07/09/12 |
|
| 1.7 |
07/07/30 |
First push to OpenSolaris FM Community Website |
Introduction
We describe the Solaris error-handling philosophy for generic (or
"architectural") machine-check architecture (MCA) errors and for
vendor-common machine-specific (non-architectural) extensions.
This will be implemented in an upcoming putback to Nevada, and backported
aiming for Solaris 10 Update 5.
In [Intel_vol3A] and
[AMD_vol2] a generic machine-check
architecture is described. This provides for a generic means of
discovering and enabling the error-detector banks of a processor, and
for collecting error telemetry from these banks and interpreting the
impact of the observed error - whether processor context is corrupt etc.
We will treat [Intel_vol3A] as the
specification for the MCA since it
nicely separates what is architectural (defined and
available as part of the MCA) and what is model-specific.
Between [AMD_vol2] and various model-specific
AMD BKDG (Bios & Kernel
Developer's Guide) volumes one can see that the AMD MCA implementations do
adhere to the Intel architectural aspects of the MCA, perhaps with some
small deviations but remaining compatible for our purposes.
The generic architecture
also provides a limited error classification mechanism that allows
individual observations to be interpreted in a generic fashion for
aspects such as error source - within chip or off chip, within chip
cache hierarchy and which cache, etc. Model-specific documentation
such as
[AMD_BKDG_K8F] may provide detail that
can classify
an error with much more resolution, such as refining "corrected
error from bus/interconnect for a memory read access" to
"corrected multiple-bit ChipKill-correctable ECC from main memory,
data bits 4 and 6 were bad and corrected"; AMD provides this detail
in the public releases of its BKDG, but Intel do not publically
document error types in such detail.
Capturing all model-specific error interpretation is a code-intensive
task. Until such support exists in Solaris for a particular cpu model
the baseline MCA and FMA support for that model is provided by the
generic x86 MCA cpu module cpu.generic. The present document
serves as a specification for the required behaviour of cpu.generic
for it to implement a respectable baseline level of MCA support for all
x86 cpu models supporting MCA.
The cpu.generic cpu module must limit itself solely to MCA
architectural aspects, so that it applies to any x86 cpu model that
conforms to the MCA (defined as indicating both MCE and MCA features
in CPUID information). It must also provide a means to call into
model-specific support (if present) to augment its capabilities such that
model-specific support can be achieved without having to write whole new
cpu modules - simply layering a new model-specific plugin on top
of the canonical cpu.generic should be adequate for most cases.
| CPU Module |
Description |
| cpu.generic |
Default cpu module which applies if no more
model-specific cpu module exists or chooses to
initialize for a given cpu model. The expectation is
that, with model-specific support layering on top of it,
cpu.generic will be the only cpu
module delivered in Solaris for the forseeable future.
|
|
cpu.AuthenticAMD.15
|
CPU module for AMD family 15. This module was introduced
with the original FMA/x64 project putback in which it had
full FMA support while cpu.generic had very little
FMA support. It is expected that this project to
improve cpu.generic will eliminate this
deliverable and recast the model-specific support it
provides as a model-specific plugin.
|
CPU Module Implementations
| Model-specific plugin |
Description |
| cpu_ms.AuthenticAMD.15 |
Detailed model-specific support for AMD family 15 (K8
revisions A through G)
|
| cpu_ms.AuthenticAMD.15.65.2 |
Detailed model-specific support for a particular AMD family 15
cpu with model 65 and stepping 2. It is generally expected
that no or few such fine-grained modules will be delivered -
the family 15 module can special-case particular models
where need be.
|
| cpu_ms.AuthenticAMD |
"Generic AMD" model-specific support. Such a module
may make use of AMD-specific MCA aspects, typically picking
on higher-level features that are common to all existing
or current AMD processor families. For example, K8 revision
F introduced error thresholding registers which are continued
in the next family so such support could live here.
|
| cpu_ms.GenuineIntel.6 |
Detailed model-specific support for Intel family 6 (all models
and steppings, unless a more-specific module exists for
a given cpu in which case that is tried first).
|
| cpu_ms.GenuineIntel |
"Generic Intel" model-specific support.
|
Some possible CPU Model-Specific Module
Implementations
At most one model-specific plugin should ever augment cpu.generic
for a particular chip instance - the most-specific module that exists
and matches the vendor/family/model/stepping combination
and which chooses to initialize for that cpu. The
cpu.generic module should be able to
stand alone with no model-specific support, but the intention is that where no
fine-grained model-specific support yet exists for a chip at least
cpu.AuthenticAMD or cpu.GenuineIntel should assist with
some broad vendor-common model-specific aspects. A vendor-common
module such as cpu_ms.GenuineIntel may choose to provide
full model-specific support for some cpu models; it will require some
internal model detection code to tune its behaviour to the particular
model it finds itself running on.
Scope
This document will detail the required error-handling aspects of the
cpu.generic cpu module together with those of the
vendor-common model-specific
modules cpu.AuthenticAMD and cpu.GenuineIntel.
It will not document error-handling aspects of more-detailed
model-specific support such as that for cpu_ms.AuthenticAMD.15.
We will begin by describing MCA initialization - discovering detector
banks, enabling them, and choosing which error types detected by that
bank are enabled for machine-check exception (#MC).
Next we detail our error event response - what action, if any,
Solaris should take in addition to the logging the error for diagnosis.
Examples are panic, reboot, contract kill, cache flush etc.
Finally we will detail all error types that can be detected, and what error
report classes will be used for these
errors. We will also detail the ereport payload included for each
error report.
We will not document diagnosis algorithms here - they will be covered
in a separate document. We are concerned only with raising the
telemetry upon which the diagnosis engines will operate.
We will also not be documenting the cpu module interface API, nor
the model-specific cpu module interface which layers on top of it.
These will be documented elsewhere, with the implementation of the
requirements laid out in the present document. For our purposes all we
need know is that the overall model is that the cpu module implementation that
has initialized for a given cpu instance (usually cpu.generic)
performs the bulk of the work and for each cpu module API member there
is typically a corresponding model-specifc API member which is called
from within the former to perform additional, model-specific actions.
A Word On Virtualization
In addition to running natively on x86 hardware, Solaris may find itself
as a dom0 to a Solaris xPM hypervisor or a domU for some
potentially unknown hypervisor.
For the Solaris xVM dom0 case, a generic MCA implementation must be
designed for the case in which it a privileged domain above a hypervisor -
i.e., the design must easily extend to reusing most code between native
and dom0 contexts.
For the domU case we may be paravirtualised or running under full
hardware virtualisation - Solaris may be unaware that it does not
own the hardware. An implementation must be sure to fail safely in those
cases in which the hardware does not appear to behave as it would natively.
For example we should not assume the presence of MCA/MCE support from
cpu model information alone (i.e., "we know AMD Opteron supports MCA/MCE")
but must used CPUID information since the hypervisor may choose to mask
some features that would normally be visible via CPUID. Similarly we may
discover apparent MCA/MCE support via CPUID, but should be prepared for
behaviour such as MCG_CAP indicating zero MCA
banks or of all MCi_{CTL,STATUS,ADDR,MISC}
appearing to be read-only or read-as-zero.
A particular design pitfall to avoid is the assumption that every
Solaris logical cpu (as listed by psrinfo) corresponds to a unique
set of bank control/status/address/miscellaneous registers. This is not
the case even when running natively, since in stranded/hyperthreaded
designs the individual strands of a single core will typically share
the MCA banks of that core. In the virtualised case the virtual cpus
presented to a domain may have little or no correspondence to the real
cpus, and any correlation need not be fixed (e.g., a single virtual
cpu may be mapped onto different real cpus at different times according
to the scheduling choices of the hypervisor).
MCA Initialization
The implementation should comply with 14.6 of
[Intel_vol3A] and 9.4
of [AMD_vol2]: we should enable all detector banks in
IA32_MCG_CTL if that register is present,
and in IA32_MCi_CTL enable machine-check exception
for all error types detected by each MCA bank, allowing for a couple
of special cases.
The cpu module interface must only initiate MCA initialization for
cpu instances for which CPUID information indicates both MCA
(machine-check architecture) and MCE (machine-check exception)
support. We therefore exclude some old cpu models from the start
(e.g., AMD K6 and Intel Pentium),
but all recent AMD and Intel processors support these features.
We must initialize via the following steps:
-
Read the IA32_MCG_CAP (MCA global
capabilities register) and note whether it indicates
that the IA32_MCG_CTL is present,
and how many MCA banks exist for this processor.
If IA32_MCG_CTL is present,
initialize it in a later step to enable MCA features -
this
is a deviation from [Intel_vol3A],
however it seems desirable to
initialize the individual banks before we enable all detectors.
It is not an error for IA32_MCG_CTL to be
absent - other initialization below must be performed even when it
is absent. Terminate without further initialization if the number of
MCA banks indicated is zero.
-
If IA32_MCG_CTL is present, write 0 to
it to disable all MCA features during configuration below.
This is a deviation from
[Intel_vol3A].
-
Initialize MCA banks from bank 0 onwards (total number as per
IA32_MCG_CAP.Count). For each
bank that we choose to initialize (bank 0 may be skipped - see
the next paragraph) we are inclined to write all 1's to
IA32_MCi_CTL, however
any model-specific support should be allowed to provide another
value. Write 0 to IA32_MCi_STATUS
unless model-specific support asks that we omit clearing the
status register for this bank; before clearing bank state
the inhertited post BIOS/POST bank state should be read and
any valid errors logged (conditional upon model and whether this
is a power-on or warm reset - see [Intel_vol3A]
and [AMD_BKDG_K8F].
If no model-specific support is present (not even the
vendor-generic support such as cpu.GenuineIntel)
then skip writing to IA32_MC0_CTL
(bank 0) for two special cases: Intel family 0x6, in which
that register controls platform-specific features
(see [Intel_vol3A]
14.3.2.1); and AMD family 0x6 in which bank 0 corresponds to the
Data Cache unit but folklore has it that this bank can produce
spurious machine-checks, so we leave the register just as the
BIOS left it.
IA32_MCi_CTL controls which errors
detected by bank i may produce a machine-check exception.
This bit is necessary for the #MC, but is not sufficient to
guarantee a machine-check for that error type; for example
correctable errors will usually not produce a #MC even if
their control bit is set in this register. The actual
behaviour, and which bits control what error types, are
model-specific hence the default initialization value of all 1's.
-
Call into any model-specific support so that it may perform
additional non-architectural MCA initialization.
-
Write all 1's to IA32_MSR_MCG_CTL if present,
or whatever value model-specific support cares to change that to.
This enables all detectors for those processors that require
this enablilng (have this register).
-
Associate a handler with vector 0x18 in the IDT.
Write to CR4 to enable the machine-check
exception.
Error Polling
For error types that the implementation defines to produce a machine-check
exception, such an exception will be generated when the error is detected
if the corresponding error bit in IA32_MCi_CTL
is set. Which bit controls which error type is model-specific, and
the initialization steps above write all 1's to this register.
For errors that do not produce an #MC at detection, cpu.generic
must arrange to poll all MCA banks of every cpu at a fixed interval and
raise error reports for any valid errors it finds.
Error types that will be observed in poll are notionally those not enabled
for machine-check in IA32_MCi_CTL; these will
have IA32_MCi_STATUS.EN clear when we observe them.
Some error types do not produce a machine-check even if their
enabling IA32_MCi_CTL bit is set. Such
errors, typically hardware-corrected, will be discovered via poll and
may have IA32_MCi_STATUS.EN set; we must not
treat such observations as "should have machine-checked".
It may also be the case that the error polling discovers errors that look
like they should have produced a machine check exception -
say errors whose status indicates uncorrected data, processor context
corrupt, etc. Since by default we have enabled all detectors and written
all 1's to the bank control registers (allowing model-specific code to
modify those defaults) we assume that if such an error did not produce
a machine check that is is to be treated as non-fatal, despite apparent
indications to the contrary. This might be the case, for example,
where machine check for a known spurious case (say arising from a hardware
erratum) is suppressed by clearing the relevant control bit.
So in all cases we do not panic for a polled error, no matter how
severe it appears. We will allow model-specific support to override
this behaviour by insisting that all observations of some error type
are terminal, however they are encountered.
Error Disposition
The error handler - machine-check exception handler or poller - needs
to decide whether execution may continue or not as a result of the
observed error. This decision must be made in cpu.generic
without knowledge of the
model-specific error details, but should allow any model-specific support
that is present to contribute. This differs to past non-generic
error philosophy documents which have usually spelled out distinct
descriptions of each error type and their required handling.
There are a number of factors to consider:
-
Whether the error was discovered by machine-check or via poll.
We take the view that anything discovered by poll does not require
a panic or any similar action - we assume that it did not produce
a machine check because of some configuration decision made in
the BIOS or model-specific code. Telemetry preserved
over a warm reset and found at MCA initialization time should
be treated as if discovered in poll - never terminal.
-
Whether the current processor context is corrupt as a result of this
error, as indicated by PCC in the bank status register.
-
Whether uncorrected data is present as a result of the error
(UC). In most circumstances we are required to panic.
Some cpu models
and platforms may, however, implement some means of signalling bad
data such as through poisoning the ECC of the known-bad data ("data
poisoning"). We therefore must give model-specific code the
opportunity to further characterize a UC event: to
indicate whether the uncorrected data has been poisoned in some
way that will prevent it being mistaken for valid data at future use;
and to indicate whether the current (interrupted) context is
unaffected by the uncorrected data, as may be the case for something
like a writeback to memory of bad data displaced from a cpu cache.
Where uncorrected data has not been signalled via some form of
poisoning, we view the corruption as unconstrained.
It is possible for an error to be uncorrected (
UC) without
corrupting the current processor context (
PCC clear). For
example a load that displaces a bad line from cache may supply good
data to the processor but if the bad line is modified the data written
back to memory is bad.
The error handler must determine three characteristics of an error for
which we machine-checked, or which we found in a poll but for which
EN indicates we should have machine-checked:
-
(#MC only) Is the return
instruction pointer valid? This is indicated by
IA32_MCG_STATUS.RIPV. If invalid, execution is
unlikely to be able to continue, but if only a userland process
is affected then it may be possible to perform a contract-kill
for the affected process contract.
-
Whether there is unconstrained bad data present in the system.
This is primarily indicated by
IA32_MCi_STATUS.UC, but model-specific
support should be given the chance to indicate that the
uncorrected data has either been eliminated from the system
(e.g., by cache flush of unmodified data) or has been marked
via some poison indicator to prevent mistaken use.
If unconstrained bad data is present in the system then we should
panic in an attempt to avoid silent data corruption.
-
(#MC only) Whether the current (interrupted) context is corrupted
because of the error. This is primarily indicated by
IA32_MCi_STATUS.PCC, but model-specific
support should be given the chance to indicate otherwise.
This does not apply for errors discovered via polling.
The following code snippet specifies how error disposition is
to be determined:
ismc = (took machine check trap?) ? 1 : 0; /* #MC or poll */
for i in each MCA bank on the poll/#MC processor {
if (MCi_STATUS.VAL == 0)
continue; /* skip banks with no valid err */
else
nerr++;
pcc = MCi_STATUS.PCC;
uc = MCi_STATUS.UC;
/*
* Allow model-specific plugin to perform additional error
* handling and to indicate error status.
*/
ms_disp = model_specific_error_handler();
/*
* Model-specific code may override or eliminate corrupt
* processor context.
*/
if (pcc && (ms_disp says current context is ok))
pcc = 0;
/*
* Model-specific code may override or eliminate uncorrected
* data.
*/
if (uc && (ms_disp says model-specific handler cleared data)
uc = 0;
/*
* Model-specific poisoning of uncorrected data.
*/
if (uc)
poisoned = (ms_disp says UC data was poisoned);
if (ms_disp says always to ignore this error regardless) {
ignore++
} else {
/*
* Our default disposition calculation. If we took
* a machine check (ismc) or this is a poll but this
* error is configured to machine check (en) then
* determine whether there is uncorrected and unpoisoned
* data present, and whether the current context is
* intact or not.
*/
if (uc && !poisoned)
unconstrained++;
if (pcc && ismc)
curctxbad++;
/*
* Allow model-specific support to force an error
* to be fatal.
*/
if (ms_disp says always fatal)
forcefatal++;
}
}
/*
* A machine-check must have RIPV valid if we are to resume. This
* applies even if we somehow counted no valid errors.
*/
if (ismc && MCG_STATUS.RIPV == 0)
retval |= RIPV_INVALID;
if (nerr > 0) {
if (unconstrained > 0)
retval |= UC_UNCONSTRAINED;
if (curctxbad > 0)
retval |= CURCTXBAD;
if (forcefatal > 0)
retval |= FORCEFATAL;
}
return (retval);
The suggested responses by the caller to these disposition values are:
-
If RIPV_INVALID we have machine-checked and do not
have a valid instruction pointer pushed onto the stack to resume
at, even if other error information may indicate the
error to be resumable. If we interrupt the kernel then we must
panic; if we interrupted userland and can be certain that the
error can only have affected userland (not trivial) then
it is possible that we could sever the userland contract.
-
If UC_UNCONSTRAINED then we have uncorrected data present
in the system which is not recognisable as bad data. If this affects
the kernel we must panic. If it can be isolated to
userland we may perform a contract kill.
-
If CURCTXBAD then the current context must be terminiated -
a panic if the kernel is affected, otherwise a contract kill.
-
If FORCEFATAL then panic.
Generic Error Classification
The generic module should only classify errors according to architectural
components, as per 14.7 "Interpreting The MCA Error Codes" of
[Intel_vol3A]
(familiarity with which is necessary in understanding the remainder of this
section). It should also hook into model-specific support to permit more
detailed error classification.
The generic module should create ereports in the subclass
ereport.cpu.generic-x86. In deciding an ereport class
it should also permit model-specific support to specify a new
subclass and/or leaf ereport class.
Simple Error Codes
Simple error codes are documented in 14.7.1 of
[Intel_vol3A].
The AMD K7 (Athlon and Duron) processors implement the same set
of simple error codes (with the exception of "Internal Timer").
AMD K8 and later do not appear to implement these simple error types,
or they are not documented if they do. We will treat these simple
error codes as architectural and report the same error class regardless
of cpu type.
The following
table indicates what ereport classes are used for each simple error code:
| Simple Error Code |
MCA Error Code |
Ereport Class |
| No error |
0000 0000 0000 0000 |
(No ereport raised) |
| Unclassified |
0000 0000 0000 0001 |
ereport.cpu.generic-x86.unclassified |
| Microcode ROM Parity |
0000 0000 0000 0010 |
ereport.cpu.generic-x86.microcode_rom_parity |
| External |
0000 0000 0000 0011 |
ereport.cpu.generic-x86.external |
| Functional Redundancy Check (FRC) |
0000 0000 0000 0100 |
ereport.cpu.generic-x86.frc |
| Internal Timer |
0000 0100 0000 0000 |
ereport.cpu.generic-x86.internal_timer |
| Internal Unclassified |
0000 01xx xxxx xxxx |
ereport.cpu.generic-x86.internal_unclassified |
Ereport Payload For Simple Error Types
The simple error types all have the same ereport payload specification
(in addition to the FMA protocol-required members such as "detector").
This is the common payload data, included by all ereports for
generic x86 MCA:
| Payload Member Name |
Data Type |
| IA32_MCG_STATUS |
UINT64 |
| machine_check_in_progress |
BOOLEAN_VALUE |
| ip |
UINT64 |
| privileged |
BOOLEAN_VALUE |
| bank_number |
UINT8 |
| bank_msr_offset |
UINT64 |
| IA32_MCi_STATUS |
UINT64 |
| overflow |
BOOLEAN_VALUE |
| error_uncorrected |
BOOLEAN_VALUE |
| error_enabled |
BOOLEAN_VALUE |
| processor_context_corrupt |
BOOLEAN_VALUE |
| threshold_based_error_status |
STRING |
| error_code |
UINT16 |
| model_specific_error_code |
UINT16 |
| IA32_MCi_ADDR |
UINT64 |
| IA32_MCi_MISC |
UINT64 |
Payload Members Common To All Ereports
(Definitions)
Compound Error Codes
Compound errors are described in
[Intel_vol3A] 14.7.2 and in the
various AMD model-specific BKDG; the AMD documentation does not
include the "Generic Memory Hierarchy" compound errors but if an
AMD implementation reports an error that fits into that classification
it should be processed as such and not treated as unknown.
Compound error types are recognised as follows:
| Error Type |
MCA Error Code Form |
Error Nature |
Interpretation |
| Generic Memory Hierarchy |
000F 0000 0000 11LL |
Generic cpu cache memory error |
- |
| TLB |
000F 0000 0001 TTLL |
TLB tag/data array errors |
{TT}TLB{LL}_ERR |
| Memory HierarchyErrors |
000F 0001 RRRR TTLL |
CPU cache hierarchy errors |
{TT}CACHE{LL}_{RRRR}_ERR |
| Bus and Interconnect |
000F 1PPT RRRR IILL |
External bus/interconnect errors |
BUS{LL}_{PP}_{RRRR}_{II}_{T}_ERR |
Compound Error Encoding
The F in the MCA error code form is usually 0, but on some Intel
processor models a 1
indicates that filtering is active for this correctable error type - that
some or all subsequent corrections for this error type in this bank
will not be reported. This has no bearing on error classification.
The F sub-field is not used on AMD so will always match 0.
The error code for each type also includes an indication of one or more of the
cache level (LL), transaction type (TT), request type (RRRR),
participating processor (PP), timeout or not (T), and whether this
was a memory or IO access (II) as in the table that follows.
| TT Sub-Field Encoding |
| Transaction Type |
Mnemonic |
Binary Encoding |
Ereport Class Component |
| Instruction |
I |
00 |
"i" |
| Data |
D |
01 |
"d" |
| Generic |
G |
10 |
"" |
| LL Sub-Field Encoding |
| Cache Level |
Mnemonic |
Binary Encoding |
Ereport Class Component |
| Level 0 |
L0 |
00 |
"l0" |
| Level 1 |
L1 |
01 |
"l1" |
| Level 2 |
L2 |
10 |
"l2" |
| Generic |
LG |
11 |
"" |
| RRRR Sub-Field Encoding |
| Request Type |
Mnemonic |
Binary Encoding |
Ereport Class Component |
| Generic Error |
ERR |
0000 |
"" |
| Generic Read |
RD |
0001 |
"" |
| Generic Write |
WR |
0010 |
"" |
| Data Read |
DRD |
0011 |
"" |
| Data Write |
DWR |
0100 |
"" |
| Instruction Fetch |
IRD |
0101 |
"" |
| Prefetch |
PREFETCH |
0110 |
"" |
| Eviction |
EVICT |
0111 |
"" |
| Snoop |
SNOOP |
1000 |
"" |
| PP Sub-Field Encoding |
| Origin |
Mnemonic |
Binary Encoding |
Ereport Class Component |
| Local processor originated request |
SRC |
00 |
"" |
| Local processor responded to request |
RES |
01 |
"" |
| Local processor observed as 3rd party |
OBS |
10 |
"" |
| Generic |
- |
11 |
"" |
| T Sub-Field Encoding |
| Timeout Status |
Mnemonic |
Binary Encoding |
Ereport Class Component |
| Request timed out |
TIMEOUT |
1 |
"" |
| Request did not time out |
NOTIMEOUT |
0 |
"" |
| II Sub-Field Encoding |
| Access Type |
Mnemonic |
Binary Encoding |
Ereport Class Component |
| Memory Access |
M |
00 |
"memory" |
| Reserved |
- |
01 |
"" |
| I/O Access |
IO |
10 |
"io" |
| Other transaction |
- |
11 |
"" |
Error Code Subfield Encoding for Compound Errors
The error "interpretation", per
[Intel_vol3A]
is formed by substituting the
sub-field menmonics into the interpretation string for the particular
compound error type. For example an L1 cache error encountered during
an instruction fetch would be interpreted as a
ICACHEL1_IRD_ERR. We will term this the "expanded interpretation
string".
Using the expanded interpretation strings in forming ereport classes
would lead to a large number of ugly ereport classes, none of which
would match existing conventions for other FMA portfolios. Instead
we will compress the number of possible errors into fewer ereport
subclasses using a format string and the ereport class components
in the above table:
| Compound Error Type |
Interpretation String |
Ereport Leaf Class Format String |
| Generic Memory Hierarchy |
- |
"<l0,l1,l2,>cache[_uc]" |
| TLB |
{TT}TLB{LL}_ERR |
"<l0,l1,l2,><i,d,>tlb[_uc]" |
| Memory Hierarchy |
{TT}CACHE{LL}_{RRRR}_ERR |
"<l0,l1,l2,><i,d,>cache[_uc]" |
| Bus and Interconnect |
BUS{LL}_{PP}_{RRRR}_{II}_{T}_ERR |
"bus_interconnect_<memory,io>[_uc]" |
Compound Error Ereport Leaf Class Formats
For example, a memory hierarchy error with expanded interpretation string of
ICACHEL1_IRD_ERR and for which
MCi_STATUS
does not indicate
UC would have an ereport leaf class of
"l1icache" which is more readable and still captures all we
need for diagnosis purposes; if the error were uncorrected the leaf
class would be
"l1icache_uc".
Note that
LG in error interpretations
contributes an empty string to the ereport class, since things like
lgcache are not very descriptive; for models in which
LG
is also used to represent level-3 cache errors the model-specific plugin
should provide more descriptive ereport classes. The ereport classes
have been chosen to facilitate diagnosis of the major functional units.
For completeness, the following table
lists all compound ereport classes that can be formed:
|
Generic Memory Hierarchy
|
| Expanded Compound Error Interpretation |
Ereport Classes |
| - |
ereport.cpu.generic-x86.l0cache[_uc]
ereport.cpu.generic-x86.l1cache[_uc]
ereport.cpu.generic-x86.l2cache[_uc]
ereport.cpu.generic-x86.cache[_uc]
|
|
TLB Errors
|
| Expanded Compound Error Interpretation |
Ereport Classes |
| DTLBL0_ERR |
ereport.cpu.generic-x86.l0dtlb[_uc] |
| DTLBL1_ERR |
ereport.cpu.generic-x86.l1dtlb[_uc] |
| DTLBL2_ERR |
ereport.cpu.generic-x86.l2dtlb[_uc] |
| DTLBLG_ERR |
ereport.cpu.generic-x86.dtlb[_uc] |
| ITLBL0_ERR |
ereport.cpu.generic-x86.l0itlb[_uc] |
| ITLBL1_ERR |
ereport.cpu.generic-x86.l1itlb[_uc] |
| ITLBL2_ERR |
ereport.cpu.generic-x86.l2itlb[_uc] |
| ITLBLG_ERR |
ereport.cpu.generic-x86.itlb[_uc] |
| GTLBL0_ERR |
ereport.cpu.generic-x86.l0tlb[_uc] |
| GTLBL1_ERR |
ereport.cpu.generic-x86.l1tlb[_uc] |
| GTLBL2_ERR |
ereport.cpu.generic-x86.l2tlb[_uc] |
| GTLBLG_ERR |
ereport.cpu.generic-x86.tlb[_uc] |
|
Memory Hierarchy Errors
|
| Expanded Compound Error Interpretation |
Ereport Classes |
|
DCACHEL0_{RRRR}_ERR
|
ereport.cpu.generic-x86.l0dcache[_uc] |
|
DCACHEL1_{RRRR}_ERR
|
ereport.cpu.generic-x86.l1dcache[_uc] |
|
DCACHEL2_{RRRR}_ERR
|
ereport.cpu.generic-x86.l2dcache[_uc] |
|
DCACHELG_{RRRR}_ERR
|
ereport.cpu.generic-x86.dcache[_uc] |
|
ICACHEL0_{RRRR}_ERR
|
ereport.cpu.generic-x86.l0icache[_uc] |
|
ICACHEL1_{RRRR}_ERR
|
ereport.cpu.generic-x86.l1icache[_uc] |
|
ICACHEL2_{RRRR}_ERR
|
ereport.cpu.generic-x86.l2icache[_uc] |
|
ICACHELG_{RRRR}_ERR
|
ereport.cpu.generic-x86.icache[_uc] |
|
GCACHEL0_{RRRR}_ERR
|
ereport.cpu.generic-x86.l0cache[_uc] |
|
GCACHEL1_{RRRR}_ERR
|
ereport.cpu.generic-x86.l1cache[_uc] |
|
GCACHEL2_{RRRR}_ERR
|
ereport.cpu.generic-x86.l2cache[_uc] |
|
GCACHELG_{RRRR}_ERR
|
ereport.cpu.generic-x86.cache[_uc] |
|
Bus and Interconnect Errors
|
| Expanded Compound Error Interpretation |
Ereport Classes |
| BUS_L{0,1,2,G}_{SRC,RES,OBS}_{RRRR}_-_{TIMEOUT,NOTIMEOUT}_ERR |
ereport.cpu.generic-x86.bus_interconnect[_uc] |
| BUS_L{0,1,2,G}_{SRC,RES,OBS}_{RRRR}_M_{TIMEOUT,NOTIMEOUT}_ERR |
ereport.cpu.generic-x86.bus_interconnect_memory[_uc] |
| BUS_L{0,1,2,G}_{SRC,RES,OBS}_{RRRR}_IO_{TIMEOUT,NOTIMEOUT}_ERR |
ereport.cpu.generic-x86.bus_interconnect_io[_uc] |
All Generic CPU Compund Ereport Classes
Ereport Payload For Compound Error Types
All compound error ereports include the
common ereport payload information that the
simple error classes do. For TLB, Memory Hierarchy and Bus/Interconnect
errors the following should also be included:
| Payload Member Name |
Data Type |
Description |
| compound_errorname |
STRING |
Expanded interpretation string for this compound error |
Additional Payload Members For Compound Errors
(Definitions)
Unknown Error Codes
If the error code does not exactly match one of the
documented simple error codes and does not match any of the four
compound error code forms, then it is falls outside of the
MCA architectural classifications and should be logged with
ereport subclass unknown and include the common
ereport payload information.
AMD Vendor-Common Error Classification
A "vendor-common" (or "vendor-generic") model-specific plugin should
seek to implement model-specific support that is common to a large
number of models from that vendor. It will form the baseline for
those models where no more-specific model support is available.
A vendor generic model-specific plugin may also choose to implement
full model-specific details for particular cpu models, in addition
to its vendor-generic duties. This is an alternative to delivering
such model-specific support via further modules whose pathnames
specifies one or more of family, model and stepping. In the AMD
case it is suggested that the vendor-generic module remain truly
vendor-generic and that the existing AMD family 0xf cpu module support
be recast as a model-specific plugin that loads instead of the
generic AMD plugin.
Main Memory ECC Errors
For all of AMD family 0xf (K8, revisions B through G) and for
at least the initial family 0x10 revisions (and very likely
thereafter) we can recognise main memory ECC errors as:
- detected by the on-chip NorthBridge MCA bank (bank 4),
- having Bus/Interconnect compound error type,
- having LL of LG (generic),
- having II of M (memory),
- with the CECC or UECC bit set in the
bank status register
Furthermore, we can distinguish ChipKill vs non-ChipKill ECC code
events by the "extended error code" which is a sub-field of the
model-specific error code on these models: an extended error code
of 0 indicates regular 64/8 ECC, while a nonzero extended error
code indicates a 128/16 ChipKIll ECC (possibly still a single-bit
event, however). The following table indicates the leaf ereport class
to use for each possible error event type:
| Error Type |
Ereport Class |
| Correctable (single-bit) 64/8 ECC error |
ereport.cpu.generic-x86.mem_ce |
| Uncorrectable 64/8 ECC error |
ereport.cpu.generic-x86.mem_ue |
| Correctable 128/16 ChipKill ECC error |
ereport.cpu.generic-x86.mem_ce |
| Uncorrectable (multi symbol) 128/16 ECC error |
ereport.cpu.generic-x86.mem_ue |
Ereport Payload For AMD Memory ECC Errors
| Payload Member Name |
Data Type |
Description |
| syndrome |
UINT16 |
The ECC error syndrome |
| syndrome-type |
STRING |
"E" for 64/8 ECC, "C" for 128/16 ChipKill ECC |
| resource |
NVLIST_ARRAY |
An array of FMRIs identifying the node, dimm, rank
(or perhaps node, channel and chip-select) and perhaps
even more fine-grained resolution such as row, column and
internal bank numbers. This member can only be included
if some generic address-to-resource mechanism is available,
or if a full-featured memory-controller driver is present.
Ideally this is a one-member array, but where the error
address and syndrome do not isolate the error to a single
rank or where a full-featured memory-controller driver
is unavailable there may be multiple entries indicating
possible locations.
|
| resource_counts |
UINT8_ARRAY |
If the resource FMRIs are filled using a generic means such
as the NorthBridge ECC channel/chip-select error counters
in the Online Spare Control Register available in
K8 revision F and later and K9 (see below) then these counts
reflect the ECC count for the corresponding resource array
entry.
|
Additional Payload Members of AMD Memory ECC Ereports
(Definitions)
Address-To-Resource Resolution
Resolving a memory error address to a (node, chip-select)
involves understanding the structure of various memory-controller
registers for the particular cpu model and the details of memory
interleaving etc; these can vary within a cpu family. In Solaris
such knowledge is captured in the
model-specific memory-controller driver mc-amd whose main
purpose is memory topology discovery, address-to-resource translation,
and resource-to-address translation.
A full-featured memory-controller driver is part of any full model-specific
FMA implementation. Thus for as long as a particular model is being
supported by the generic cpu module with AMD vendor-common plugin,
there is likely no full-featured memory-controller driver present.
For AMD family 0xf revision F and later the Online Spare Control
Register exposes 4-bit counts of ECC error experienced by each
(channel, chip-select) combination. When an ECC error
is observed, usually during a poll, we can check this register to
see which combination(s) has/have experienced ECC errors during the
poll interval just passed; if we zero the count we can see at the
next memory error event which combinations have ticked on again.
This therefore provides a generic, if slightly coarse, mechanism for
deciding which channel and chip-select contributed an error if
we do not have full translation facilities available.
The resource FMRI array and corresponding resource_counts
reflect the above infomation. Diagnosis software must decide how it will
handle the case where multiple combinations are identified because
each has counted one or more ECC errors during the poll interval.
Intel Vendor-Common Error Classification
The Intel model-specific support will layer on top of cpu.generic
and will be augmented by one or more memory-controller drivers for
the off-chip memory-controller hub (MCH). A machine-check exception can be
raised by the MCH, but the associated telemetry is not in an MCA bank
that can be read by cpu.generic but instead available via
PCI accesses that will be made by the memory-controller driver.
The generic #MC handler must, therefore, call out to the memory-controller
driver to allow it to reap error telemetry.
For those cpu-experienced errors that do fall within the MCA banks
all handling and classification will be performed by cpu.generic -
Intel does not publish more-detailed error classification information.
The sole contribution from the Intel-generic module in terms of
classification will be to provide a new ereport subclass such that
ereport are logged in subclass ereport.cpu.intel. The
memory-controller driver will be responsible for raising ereports for
telemetry it reads.
Payload Member Definitions
- IA32_MCG_STATUS (UINT64)
-
The IA32_MCG_STATUS register value at the
time the error event telemetry was captured (during #MC trap
handling or during poll).
- machine_check_in_progress (BOOLEAN_VALUE)
-
Present whenever IA32_MCG_STATUS is,
with value that of IA32_MCG_STATUS.MCIP. That
bit indicates that a machine check is in-progress, so this
member should have value 1 for those errors that are handled via
a machine check exception and is expected to be 0 for errors
discovered via poll.
- ip (UINT64)
-
Only included if a machine-check is in-progress and
IA32_MCG_STATUS.EIPV is set, indicating
that the instruction pointer pushed onto the stack when the #MC
occured is directly associated with the error.
- privileged (BOOLEAN_VALUE)
-
Only included if a machine-check is in-progress, and indicates whether
the interrupt code was privileged kernel code (value 1) or userland
code (value 0). This only indicates the nature of the code that
was interrupted by the #MC and does not necessarily indicate who
suffered the error.
- bank_number (UINT8)
-
The MCA bank number from which the error telemetry was read.
This is the Nth error reporting register bank, defined by a
group of four control/status/addr/misc registers as follows:
| |
[IA32_]MC?_{CTL,STATUS,ADDR,MISC} Naming
|
| |
AMD |
Intel |
| Bank Number |
Control/Status/Address/Misc MSR |
K7 |
K8 |
K9 |
Core |
Pentium 4 & Xeon |
Core Solo/Duo |
Pentium M |
P6 |
| 0 |
MSRs 0x400, 0x401, 0x402, 0x403 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| 1 |
MSRs 0x404, 0x405, 0x406, 0x407 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
| 2 |
MSRs 0x408, 0x409, 0x40a, 0x40b |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
| 3 |
MSRs 0x40c, 0x40d, 0x40e, 0x40f |
3 |
3 |
3 |
4 |
3 |
4 |
4 |
4 |
| 4 |
MSRs 0x410, 0x411, 0x412, 0x413 |
- |
4 |
4 |
3 |
4 |
3 |
3 |
3 |
| 5 |
MSRs 0x414, 0x415, 0x416, 0x417 |
- |
- |
5 |
5 |
- |
5 |
- |
- |
Not all of the banks will include a MISC register, but that does not
affect our bank numbering. The
IA32_MCi_STATUS.MISCV should be checked to
see whether MISC should be read. Yes, newer Intel microarchitectures
name consecutive error detector banks as
IA32_MC{0,1,2,4,3} which is not at all
confusing!
Since we classify errors without knowledge of which bank is which
(e.g., Dcache vs Icache is unspecified in the MCA) this does not
affect us.
- bank_msr_offset (UINT64)
-
This is the MSR offset of the
IA32_MCi_CTL register for the bank
(0x400, 0x404, 0x408, ...). This is included in an
attempt to disambiguate the bank number, given the
out-of-order naming from some Intel models.
- IA32_MCi_STATUS (UINT64)
-
The bank status register raw value. We decode some pertinent
components in other payload members:
- overflow (BOOLEAN_VALUE)
-
Indicates the value of
IA32_MCi_STATUS.OVER. If this is 1 then
a machine-check error occured while the valid bit of the
status register was already set. In general, enabled errors
(i.e., those enabled for #MC in the control register for the
bank) overwrite disabled errors, and uncorrectable errors
overwrite correctable errors. The bank telemetry is always
that of the higher-priority error - there is never any
mixing of the two errors.
- error_uncorrected (BOOLEAN_VALUE)
-
Indicates the value of
IA32_MCi_STATUS.UC. A value of 1
means that the processor was unable to correct the observed error.
- error_enabled (BOOLEAN_VALUE)
The value of
IA32_MCi_STATUS.EN, which reflects
whether this error type was enabled for #MC in the
bank control register.
-
- processsor_context_corrupt (BOOLEAN_VALUE)
-
The value of
IA32_MCi_STATUS.PCC; a value of 1
indicates that the current processor context may have been
corrupted as a result of this error.
- threshold_based_error_status (STRING)
-
Included if
IA32_MCG_CAP.MCG_TES_P indicates that
bits 56:53 are to be considered architectural, and that 54:53
indicate the threshold-based error status. This will be one
of the following four strings: "No tracking",
"Green - Below threshold", "Yellow - Above threshold",
"Reserved". AMD does not implement this thresholding feature,
and IA32_MCG_CAP.MCG_TES_P will always
read as 0 on AMD implementations so bits 56:53 remain
model-specific on AMD.
- error_code (UINT16)
-
The MCA error code, bits 15:0 of
IA32_MCi_STATUS.
- model_specific_error_code (UINT16)
-
The model-specific error code, bits 31:16 of
IA32_MCi_STATUS.
- IA32_MCi_ADDR (UINT64)
-
The value read from
IA32_MCi_ADDR. Only included if
IA32_MCi_STATUS.ADDRV is set (address
valid).
- IA32_MCi_MISC (UINT64)
-
The value read from
IA32_MCi_MISC. Only included if
IA32_MCi_STATUS.MISCV is set.
valid).
- compound_errorname (STRING)
-
The expanded compound error interpretation string for this
error; only included for TLB, Memory Hierarchy and Bus/Interconnect
compound error ereports.
- syndrome (UINT16)
-
The ECC error syndrome
- syndrome-type (STRING)
-
"E" for regular 64/8 ECC, "C" for 128/16 ChipKill ECC
- resource (NVLIST_ARRAY)
-
An array of FMRIs indicating the resource or resources that
are the source of the memory error. This should be provided
in "hc" scheme and the structure should match the diagnosis
topology as per fmtopo. This member is only present
if the error address can be resolved to a resource.
- resource_counts (UINT8_ARRAY)
-
The number of error observations associated with each entry of
the resource array.
References
-
[Intel_vol3A]
Intel 64 and IA-32 Architectures
Software Developer's Manual, Volume 3A, May 2007;
Order Number: 253668-023US
-
[AMD_vol2]
AMD64 Architecture Programmer's Manual, Volume 2:
System Programming, July 2007; Publication Number: 24593
- [AMD_BKDG_K8F]
BIOS and Kernel Developer's Guide for AMD
NPT Family 0Fh Processors, December 2006; Publication
Number: 32559