OpenSolaris

You are not signed in. Sign in or register.

Generic x86 MCA Error Philosophy

Published Revision History

Version Date Description
1.11 07/09/12
1.7 07/07/30 First push to OpenSolaris FM Community Website

Introduction

We describe the Solaris error-handling philosophy for generic (or "architectural") machine-check architecture (MCA) errors and for vendor-common machine-specific (non-architectural) extensions. This will be implemented in an upcoming putback to Nevada, and backported aiming for Solaris 10 Update 5.

In [Intel_vol3A] and [AMD_vol2] a generic machine-check architecture is described. This provides for a generic means of discovering and enabling the error-detector banks of a processor, and for collecting error telemetry from these banks and interpreting the impact of the observed error - whether processor context is corrupt etc. We will treat [Intel_vol3A] as the specification for the MCA since it nicely separates what is architectural (defined and available as part of the MCA) and what is model-specific. Between [AMD_vol2] and various model-specific AMD BKDG (Bios & Kernel Developer's Guide) volumes one can see that the AMD MCA implementations do adhere to the Intel architectural aspects of the MCA, perhaps with some small deviations but remaining compatible for our purposes.

The generic architecture also provides a limited error classification mechanism that allows individual observations to be interpreted in a generic fashion for aspects such as error source - within chip or off chip, within chip cache hierarchy and which cache, etc. Model-specific documentation such as [AMD_BKDG_K8F] may provide detail that can classify an error with much more resolution, such as refining "corrected error from bus/interconnect for a memory read access" to "corrected multiple-bit ChipKill-correctable ECC from main memory, data bits 4 and 6 were bad and corrected"; AMD provides this detail in the public releases of its BKDG, but Intel do not publically document error types in such detail.

Capturing all model-specific error interpretation is a code-intensive task. Until such support exists in Solaris for a particular cpu model the baseline MCA and FMA support for that model is provided by the generic x86 MCA cpu module cpu.generic. The present document serves as a specification for the required behaviour of cpu.generic for it to implement a respectable baseline level of MCA support for all x86 cpu models supporting MCA.

The cpu.generic cpu module must limit itself solely to MCA architectural aspects, so that it applies to any x86 cpu model that conforms to the MCA (defined as indicating both MCE and MCA features in CPUID information). It must also provide a means to call into model-specific support (if present) to augment its capabilities such that model-specific support can be achieved without having to write whole new cpu modules - simply layering a new model-specific plugin on top of the canonical cpu.generic should be adequate for most cases.

CPU Module Description
cpu.generic Default cpu module which applies if no more model-specific cpu module exists or chooses to initialize for a given cpu model. The expectation is that, with model-specific support layering on top of it, cpu.generic will be the only cpu module delivered in Solaris for the forseeable future.
cpu.AuthenticAMD.15 CPU module for AMD family 15. This module was introduced with the original FMA/x64 project putback in which it had full FMA support while cpu.generic had very little FMA support. It is expected that this project to improve cpu.generic will eliminate this deliverable and recast the model-specific support it provides as a model-specific plugin.
CPU Module Implementations
Model-specific plugin Description
cpu_ms.AuthenticAMD.15 Detailed model-specific support for AMD family 15 (K8 revisions A through G)
cpu_ms.AuthenticAMD.15.65.2 Detailed model-specific support for a particular AMD family 15 cpu with model 65 and stepping 2. It is generally expected that no or few such fine-grained modules will be delivered - the family 15 module can special-case particular models where need be.
cpu_ms.AuthenticAMD "Generic AMD" model-specific support. Such a module may make use of AMD-specific MCA aspects, typically picking on higher-level features that are common to all existing or current AMD processor families. For example, K8 revision F introduced error thresholding registers which are continued in the next family so such support could live here.
cpu_ms.GenuineIntel.6 Detailed model-specific support for Intel family 6 (all models and steppings, unless a more-specific module exists for a given cpu in which case that is tried first).
cpu_ms.GenuineIntel "Generic Intel" model-specific support.
Some possible CPU Model-Specific Module Implementations

At most one model-specific plugin should ever augment cpu.generic for a particular chip instance - the most-specific module that exists and matches the vendor/family/model/stepping combination and which chooses to initialize for that cpu. The cpu.generic module should be able to stand alone with no model-specific support, but the intention is that where no fine-grained model-specific support yet exists for a chip at least cpu.AuthenticAMD or cpu.GenuineIntel should assist with some broad vendor-common model-specific aspects. A vendor-common module such as cpu_ms.GenuineIntel may choose to provide full model-specific support for some cpu models; it will require some internal model detection code to tune its behaviour to the particular model it finds itself running on.

Scope

This document will detail the required error-handling aspects of the cpu.generic cpu module together with those of the vendor-common model-specific modules cpu.AuthenticAMD and cpu.GenuineIntel. It will not document error-handling aspects of more-detailed model-specific support such as that for cpu_ms.AuthenticAMD.15.

We will begin by describing MCA initialization - discovering detector banks, enabling them, and choosing which error types detected by that bank are enabled for machine-check exception (#MC).

Next we detail our error event response - what action, if any, Solaris should take in addition to the logging the error for diagnosis. Examples are panic, reboot, contract kill, cache flush etc.

Finally we will detail all error types that can be detected, and what error report classes will be used for these errors. We will also detail the ereport payload included for each error report.

We will not document diagnosis algorithms here - they will be covered in a separate document. We are concerned only with raising the telemetry upon which the diagnosis engines will operate.

We will also not be documenting the cpu module interface API, nor the model-specific cpu module interface which layers on top of it. These will be documented elsewhere, with the implementation of the requirements laid out in the present document. For our purposes all we need know is that the overall model is that the cpu module implementation that has initialized for a given cpu instance (usually cpu.generic) performs the bulk of the work and for each cpu module API member there is typically a corresponding model-specifc API member which is called from within the former to perform additional, model-specific actions.

A Word On Virtualization

In addition to running natively on x86 hardware, Solaris may find itself as a dom0 to a Solaris xPM hypervisor or a domU for some potentially unknown hypervisor.

For the Solaris xVM dom0 case, a generic MCA implementation must be designed for the case in which it a privileged domain above a hypervisor - i.e., the design must easily extend to reusing most code between native and dom0 contexts.

For the domU case we may be paravirtualised or running under full hardware virtualisation - Solaris may be unaware that it does not own the hardware. An implementation must be sure to fail safely in those cases in which the hardware does not appear to behave as it would natively. For example we should not assume the presence of MCA/MCE support from cpu model information alone (i.e., "we know AMD Opteron supports MCA/MCE") but must used CPUID information since the hypervisor may choose to mask some features that would normally be visible via CPUID. Similarly we may discover apparent MCA/MCE support via CPUID, but should be prepared for behaviour such as MCG_CAP indicating zero MCA banks or of all MCi_{CTL,STATUS,ADDR,MISC} appearing to be read-only or read-as-zero.

A particular design pitfall to avoid is the assumption that every Solaris logical cpu (as listed by psrinfo) corresponds to a unique set of bank control/status/address/miscellaneous registers. This is not the case even when running natively, since in stranded/hyperthreaded designs the individual strands of a single core will typically share the MCA banks of that core. In the virtualised case the virtual cpus presented to a domain may have little or no correspondence to the real cpus, and any correlation need not be fixed (e.g., a single virtual cpu may be mapped onto different real cpus at different times according to the scheduling choices of the hypervisor).

MCA Initialization

The implementation should comply with 14.6 of [Intel_vol3A] and 9.4 of [AMD_vol2]: we should enable all detector banks in IA32_MCG_CTL if that register is present, and in IA32_MCi_CTL enable machine-check exception for all error types detected by each MCA bank, allowing for a couple of special cases.

The cpu module interface must only initiate MCA initialization for cpu instances for which CPUID information indicates both MCA (machine-check architecture) and MCE (machine-check exception) support. We therefore exclude some old cpu models from the start (e.g., AMD K6 and Intel Pentium), but all recent AMD and Intel processors support these features.

We must initialize via the following steps:

  1. Read the IA32_MCG_CAP (MCA global capabilities register) and note whether it indicates that the IA32_MCG_CTL is present, and how many MCA banks exist for this processor. If IA32_MCG_CTL is present, initialize it in a later step to enable MCA features - this is a deviation from [Intel_vol3A], however it seems desirable to initialize the individual banks before we enable all detectors. It is not an error for IA32_MCG_CTL to be absent - other initialization below must be performed even when it is absent. Terminate without further initialization if the number of MCA banks indicated is zero.

  2. If IA32_MCG_CTL is present, write 0 to it to disable all MCA features during configuration below. This is a deviation from [Intel_vol3A].

  3. Initialize MCA banks from bank 0 onwards (total number as per IA32_MCG_CAP.Count). For each bank that we choose to initialize (bank 0 may be skipped - see the next paragraph) we are inclined to write all 1's to IA32_MCi_CTL, however any model-specific support should be allowed to provide another value. Write 0 to IA32_MCi_STATUS unless model-specific support asks that we omit clearing the status register for this bank; before clearing bank state the inhertited post BIOS/POST bank state should be read and any valid errors logged (conditional upon model and whether this is a power-on or warm reset - see [Intel_vol3A] and [AMD_BKDG_K8F].

    If no model-specific support is present (not even the vendor-generic support such as cpu.GenuineIntel) then skip writing to IA32_MC0_CTL (bank 0) for two special cases: Intel family 0x6, in which that register controls platform-specific features (see [Intel_vol3A] 14.3.2.1); and AMD family 0x6 in which bank 0 corresponds to the Data Cache unit but folklore has it that this bank can produce spurious machine-checks, so we leave the register just as the BIOS left it.

    IA32_MCi_CTL controls which errors detected by bank i may produce a machine-check exception. This bit is necessary for the #MC, but is not sufficient to guarantee a machine-check for that error type; for example correctable errors will usually not produce a #MC even if their control bit is set in this register. The actual behaviour, and which bits control what error types, are model-specific hence the default initialization value of all 1's.

  4. Call into any model-specific support so that it may perform additional non-architectural MCA initialization.

  5. Write all 1's to IA32_MSR_MCG_CTL if present, or whatever value model-specific support cares to change that to. This enables all detectors for those processors that require this enablilng (have this register).

  6. Associate a handler with vector 0x18 in the IDT. Write to CR4 to enable the machine-check exception.

Error Polling

For error types that the implementation defines to produce a machine-check exception, such an exception will be generated when the error is detected if the corresponding error bit in IA32_MCi_CTL is set. Which bit controls which error type is model-specific, and the initialization steps above write all 1's to this register.

For errors that do not produce an #MC at detection, cpu.generic must arrange to poll all MCA banks of every cpu at a fixed interval and raise error reports for any valid errors it finds.

Error types that will be observed in poll are notionally those not enabled for machine-check in IA32_MCi_CTL; these will have IA32_MCi_STATUS.EN clear when we observe them. Some error types do not produce a machine-check even if their enabling IA32_MCi_CTL bit is set. Such errors, typically hardware-corrected, will be discovered via poll and may have IA32_MCi_STATUS.EN set; we must not treat such observations as "should have machine-checked".

It may also be the case that the error polling discovers errors that look like they should have produced a machine check exception - say errors whose status indicates uncorrected data, processor context corrupt, etc. Since by default we have enabled all detectors and written all 1's to the bank control registers (allowing model-specific code to modify those defaults) we assume that if such an error did not produce a machine check that is is to be treated as non-fatal, despite apparent indications to the contrary. This might be the case, for example, where machine check for a known spurious case (say arising from a hardware erratum) is suppressed by clearing the relevant control bit.

So in all cases we do not panic for a polled error, no matter how severe it appears. We will allow model-specific support to override this behaviour by insisting that all observations of some error type are terminal, however they are encountered.

Error Disposition

The error handler - machine-check exception handler or poller - needs to decide whether execution may continue or not as a result of the observed error. This decision must be made in cpu.generic without knowledge of the model-specific error details, but should allow any model-specific support that is present to contribute. This differs to past non-generic error philosophy documents which have usually spelled out distinct descriptions of each error type and their required handling.

There are a number of factors to consider:

  • Whether the error was discovered by machine-check or via poll. We take the view that anything discovered by poll does not require a panic or any similar action - we assume that it did not produce a machine check because of some configuration decision made in the BIOS or model-specific code. Telemetry preserved over a warm reset and found at MCA initialization time should be treated as if discovered in poll - never terminal.

  • Whether the current processor context is corrupt as a result of this error, as indicated by PCC in the bank status register.

  • Whether uncorrected data is present as a result of the error (UC). In most circumstances we are required to panic. Some cpu models and platforms may, however, implement some means of signalling bad data such as through poisoning the ECC of the known-bad data ("data poisoning"). We therefore must give model-specific code the opportunity to further characterize a UC event: to indicate whether the uncorrected data has been poisoned in some way that will prevent it being mistaken for valid data at future use; and to indicate whether the current (interrupted) context is unaffected by the uncorrected data, as may be the case for something like a writeback to memory of bad data displaced from a cpu cache.

    Where uncorrected data has not been signalled via some form of poisoning, we view the corruption as unconstrained.

It is possible for an error to be uncorrected (UC) without corrupting the current processor context (PCC clear). For example a load that displaces a bad line from cache may supply good data to the processor but if the bad line is modified the data written back to memory is bad.

The error handler must determine three characteristics of an error for which we machine-checked, or which we found in a poll but for which EN indicates we should have machine-checked:

  1. (#MC only) Is the return instruction pointer valid? This is indicated by IA32_MCG_STATUS.RIPV. If invalid, execution is unlikely to be able to continue, but if only a userland process is affected then it may be possible to perform a contract-kill for the affected process contract.
  2. Whether there is unconstrained bad data present in the system. This is primarily indicated by IA32_MCi_STATUS.UC, but model-specific support should be given the chance to indicate that the uncorrected data has either been eliminated from the system (e.g., by cache flush of unmodified data) or has been marked via some poison indicator to prevent mistaken use. If unconstrained bad data is present in the system then we should panic in an attempt to avoid silent data corruption.
  3. (#MC only) Whether the current (interrupted) context is corrupted because of the error. This is primarily indicated by IA32_MCi_STATUS.PCC, but model-specific support should be given the chance to indicate otherwise. This does not apply for errors discovered via polling.

The following code snippet specifies how error disposition is to be determined:

	ismc = (took machine check trap?) ? 1 : 0;	/* #MC or poll */

	for i in each MCA bank on the poll/#MC processor {
		if (MCi_STATUS.VAL == 0)
			continue;	/* skip banks with no valid err */
		else
			nerr++;

		pcc = MCi_STATUS.PCC;
		uc = MCi_STATUS.UC;

		/*
		 * Allow model-specific plugin to perform additional error
		 * handling and to indicate error status.
		 */
		ms_disp = model_specific_error_handler();

		/*
		 * Model-specific code may override or eliminate corrupt
		 * processor context.
		 */
		if (pcc && (ms_disp says current context is ok))
			pcc = 0;

		/*
		 * Model-specific code may override or eliminate uncorrected
		 * data.
		 */
		if (uc && (ms_disp says model-specific handler cleared data)
			uc = 0;

		/*
		 * Model-specific poisoning of uncorrected data.
		 */
		if (uc)
			poisoned = (ms_disp says UC data was poisoned);

		if (ms_disp says always to ignore this error regardless) {
			ignore++
		} else {
			/*
			 * Our default disposition calculation.  If we took
			 * a machine check (ismc) or this is a poll but this
			 * error is configured to machine check (en) then
			 * determine whether there is uncorrected and unpoisoned
			 * data present, and whether the current context is
			 * intact or not.
			 */
			if (uc && !poisoned)
				unconstrained++;

			if (pcc && ismc)
				curctxbad++;

			/*
			 * Allow model-specific support to force an error
			 * to be fatal.
			 */
			if (ms_disp says always fatal)
				forcefatal++;
		}
	}

	/*
	 * A machine-check must have RIPV valid if we are to resume.  This
	 * applies even if we somehow counted no valid errors.
	 */
	if (ismc && MCG_STATUS.RIPV == 0)
		retval |= RIPV_INVALID;

	if (nerr > 0) {
		if (unconstrained > 0)
			retval |= UC_UNCONSTRAINED;

		if (curctxbad > 0)
			retval |= CURCTXBAD;

		if (forcefatal > 0)
			retval |= FORCEFATAL;
	}

	return (retval);
	
  
The suggested responses by the caller to these disposition values are:
  • If RIPV_INVALID we have machine-checked and do not have a valid instruction pointer pushed onto the stack to resume at, even if other error information may indicate the error to be resumable. If we interrupt the kernel then we must panic; if we interrupted userland and can be certain that the error can only have affected userland (not trivial) then it is possible that we could sever the userland contract.
  • If UC_UNCONSTRAINED then we have uncorrected data present in the system which is not recognisable as bad data. If this affects the kernel we must panic. If it can be isolated to userland we may perform a contract kill.
  • If CURCTXBAD then the current context must be terminiated - a panic if the kernel is affected, otherwise a contract kill.
  • If FORCEFATAL then panic.

Generic Error Classification

The generic module should only classify errors according to architectural components, as per 14.7 "Interpreting The MCA Error Codes" of [Intel_vol3A] (familiarity with which is necessary in understanding the remainder of this section). It should also hook into model-specific support to permit more detailed error classification.

The generic module should create ereports in the subclass ereport.cpu.generic-x86. In deciding an ereport class it should also permit model-specific support to specify a new subclass and/or leaf ereport class.

Simple Error Codes

Simple error codes are documented in 14.7.1 of [Intel_vol3A]. The AMD K7 (Athlon and Duron) processors implement the same set of simple error codes (with the exception of "Internal Timer"). AMD K8 and later do not appear to implement these simple error types, or they are not documented if they do. We will treat these simple error codes as architectural and report the same error class regardless of cpu type.

The following table indicates what ereport classes are used for each simple error code:

Simple Error Code MCA Error Code Ereport Class
No error 0000 0000 0000 0000 (No ereport raised)
Unclassified 0000 0000 0000 0001 ereport.cpu.generic-x86.unclassified
Microcode ROM Parity 0000 0000 0000 0010 ereport.cpu.generic-x86.microcode_rom_parity
External 0000 0000 0000 0011 ereport.cpu.generic-x86.external
Functional Redundancy Check (FRC) 0000 0000 0000 0100 ereport.cpu.generic-x86.frc
Internal Timer 0000 0100 0000 0000 ereport.cpu.generic-x86.internal_timer
Internal Unclassified 0000 01xx xxxx xxxx ereport.cpu.generic-x86.internal_unclassified

Ereport Payload For Simple Error Types

The simple error types all have the same ereport payload specification (in addition to the FMA protocol-required members such as "detector"). This is the common payload data, included by all ereports for generic x86 MCA:

Payload Member Name Data Type
IA32_MCG_STATUS UINT64
machine_check_in_progress BOOLEAN_VALUE
ip UINT64
privileged BOOLEAN_VALUE
bank_number UINT8
bank_msr_offset UINT64
IA32_MCi_STATUS UINT64
overflow BOOLEAN_VALUE
error_uncorrected BOOLEAN_VALUE
error_enabled BOOLEAN_VALUE
processor_context_corrupt BOOLEAN_VALUE
threshold_based_error_status STRING
error_code UINT16
model_specific_error_code UINT16
IA32_MCi_ADDR UINT64
IA32_MCi_MISC UINT64
Payload Members Common To All Ereports (Definitions)

Compound Error Codes

Compound errors are described in [Intel_vol3A] 14.7.2 and in the various AMD model-specific BKDG; the AMD documentation does not include the "Generic Memory Hierarchy" compound errors but if an AMD implementation reports an error that fits into that classification it should be processed as such and not treated as unknown.

Compound error types are recognised as follows:

Error Type MCA Error Code Form Error Nature Interpretation
Generic Memory Hierarchy 000F 0000 0000 11LL Generic cpu cache memory error -
TLB 000F 0000 0001 TTLL TLB tag/data array errors {TT}TLB{LL}_ERR
Memory HierarchyErrors 000F 0001 RRRR TTLL CPU cache hierarchy errors {TT}CACHE{LL}_{RRRR}_ERR
Bus and Interconnect 000F 1PPT RRRR IILL External bus/interconnect errors BUS{LL}_{PP}_{RRRR}_{II}_{T}_ERR
Compound Error Encoding

The F in the MCA error code form is usually 0, but on some Intel processor models a 1 indicates that filtering is active for this correctable error type - that some or all subsequent corrections for this error type in this bank will not be reported. This has no bearing on error classification. The F sub-field is not used on AMD so will always match 0.

The error code for each type also includes an indication of one or more of the cache level (LL), transaction type (TT), request type (RRRR), participating processor (PP), timeout or not (T), and whether this was a memory or IO access (II) as in the table that follows.

TT Sub-Field Encoding
Transaction Type Mnemonic Binary Encoding Ereport Class Component
Instruction I 00 "i"
Data D 01 "d"
Generic G 10 ""
LL Sub-Field Encoding
Cache Level Mnemonic Binary Encoding Ereport Class Component
Level 0 L0 00 "l0"
Level 1 L1 01 "l1"
Level 2 L2 10 "l2"
Generic LG 11 ""
RRRR Sub-Field Encoding
Request Type Mnemonic Binary Encoding Ereport Class Component
Generic Error ERR 0000 ""
Generic Read RD 0001 ""
Generic Write WR 0010 ""
Data Read DRD 0011 ""
Data Write DWR 0100 ""
Instruction Fetch IRD 0101 ""
Prefetch PREFETCH 0110 ""
Eviction EVICT 0111 ""
Snoop SNOOP 1000 ""
PP Sub-Field Encoding
Origin Mnemonic Binary Encoding Ereport Class Component
Local processor originated request SRC 00 ""
Local processor responded to request RES 01 ""
Local processor observed as 3rd party OBS 10 ""
Generic - 11 ""
T Sub-Field Encoding
Timeout Status Mnemonic Binary Encoding Ereport Class Component
Request timed out TIMEOUT 1 ""
Request did not time out NOTIMEOUT 0 ""
II Sub-Field Encoding
Access Type Mnemonic Binary Encoding Ereport Class Component
Memory Access M 00 "memory"
Reserved - 01 ""
I/O Access IO 10 "io"
Other transaction - 11 ""
Error Code Subfield Encoding for Compound Errors
The error "interpretation", per [Intel_vol3A] is formed by substituting the sub-field menmonics into the interpretation string for the particular compound error type. For example an L1 cache error encountered during an instruction fetch would be interpreted as a ICACHEL1_IRD_ERR. We will term this the "expanded interpretation string".

Using the expanded interpretation strings in forming ereport classes would lead to a large number of ugly ereport classes, none of which would match existing conventions for other FMA portfolios. Instead we will compress the number of possible errors into fewer ereport subclasses using a format string and the ereport class components in the above table:

Compound Error Type Interpretation String Ereport Leaf Class Format String
Generic Memory Hierarchy - "<l0,l1,l2,>cache[_uc]"
TLB {TT}TLB{LL}_ERR "<l0,l1,l2,><i,d,>tlb[_uc]"
Memory Hierarchy {TT}CACHE{LL}_{RRRR}_ERR "<l0,l1,l2,><i,d,>cache[_uc]"
Bus and Interconnect BUS{LL}_{PP}_{RRRR}_{II}_{T}_ERR "bus_interconnect_<memory,io>[_uc]"
Compound Error Ereport Leaf Class Formats
For example, a memory hierarchy error with expanded interpretation string of ICACHEL1_IRD_ERR and for which MCi_STATUS does not indicate UC would have an ereport leaf class of "l1icache" which is more readable and still captures all we need for diagnosis purposes; if the error were uncorrected the leaf class would be "l1icache_uc". Note that LG in error interpretations contributes an empty string to the ereport class, since things like lgcache are not very descriptive; for models in which LG is also used to represent level-3 cache errors the model-specific plugin should provide more descriptive ereport classes. The ereport classes have been chosen to facilitate diagnosis of the major functional units. For completeness, the following table lists all compound ereport classes that can be formed:
Generic Memory Hierarchy
Expanded Compound Error Interpretation Ereport Classes
- ereport.cpu.generic-x86.l0cache[_uc]
ereport.cpu.generic-x86.l1cache[_uc]
ereport.cpu.generic-x86.l2cache[_uc]
ereport.cpu.generic-x86.cache[_uc]
TLB Errors
Expanded Compound Error Interpretation Ereport Classes
DTLBL0_ERR ereport.cpu.generic-x86.l0dtlb[_uc]
DTLBL1_ERR ereport.cpu.generic-x86.l1dtlb[_uc]
DTLBL2_ERR ereport.cpu.generic-x86.l2dtlb[_uc]
DTLBLG_ERR ereport.cpu.generic-x86.dtlb[_uc]
ITLBL0_ERR ereport.cpu.generic-x86.l0itlb[_uc]
ITLBL1_ERR ereport.cpu.generic-x86.l1itlb[_uc]
ITLBL2_ERR ereport.cpu.generic-x86.l2itlb[_uc]
ITLBLG_ERR ereport.cpu.generic-x86.itlb[_uc]
GTLBL0_ERR ereport.cpu.generic-x86.l0tlb[_uc]
GTLBL1_ERR ereport.cpu.generic-x86.l1tlb[_uc]
GTLBL2_ERR ereport.cpu.generic-x86.l2tlb[_uc]
GTLBLG_ERR ereport.cpu.generic-x86.tlb[_uc]
Memory Hierarchy Errors
Expanded Compound Error Interpretation Ereport Classes
DCACHEL0_{RRRR}_ERR ereport.cpu.generic-x86.l0dcache[_uc]
DCACHEL1_{RRRR}_ERR ereport.cpu.generic-x86.l1dcache[_uc]
DCACHEL2_{RRRR}_ERR ereport.cpu.generic-x86.l2dcache[_uc]
DCACHELG_{RRRR}_ERR ereport.cpu.generic-x86.dcache[_uc]
ICACHEL0_{RRRR}_ERR ereport.cpu.generic-x86.l0icache[_uc]
ICACHEL1_{RRRR}_ERR ereport.cpu.generic-x86.l1icache[_uc]
ICACHEL2_{RRRR}_ERR ereport.cpu.generic-x86.l2icache[_uc]
ICACHELG_{RRRR}_ERR ereport.cpu.generic-x86.icache[_uc]
GCACHEL0_{RRRR}_ERR ereport.cpu.generic-x86.l0cache[_uc]
GCACHEL1_{RRRR}_ERR ereport.cpu.generic-x86.l1cache[_uc]
GCACHEL2_{RRRR}_ERR ereport.cpu.generic-x86.l2cache[_uc]
GCACHELG_{RRRR}_ERR ereport.cpu.generic-x86.cache[_uc]
Bus and Interconnect Errors
Expanded Compound Error Interpretation Ereport Classes
BUS_L{0,1,2,G}_{SRC,RES,OBS}_{RRRR}_-_{TIMEOUT,NOTIMEOUT}_ERR ereport.cpu.generic-x86.bus_interconnect[_uc]
BUS_L{0,1,2,G}_{SRC,RES,OBS}_{RRRR}_M_{TIMEOUT,NOTIMEOUT}_ERR ereport.cpu.generic-x86.bus_interconnect_memory[_uc]
BUS_L{0,1,2,G}_{SRC,RES,OBS}_{RRRR}_IO_{TIMEOUT,NOTIMEOUT}_ERR ereport.cpu.generic-x86.bus_interconnect_io[_uc]
All Generic CPU Compund Ereport Classes

Ereport Payload For Compound Error Types

All compound error ereports include the common ereport payload information that the simple error classes do. For TLB, Memory Hierarchy and Bus/Interconnect errors the following should also be included:

Payload Member Name Data Type Description
compound_errorname STRING Expanded interpretation string for this compound error
Additional Payload Members For Compound Errors (Definitions)

Unknown Error Codes

If the error code does not exactly match one of the documented simple error codes and does not match any of the four compound error code forms, then it is falls outside of the MCA architectural classifications and should be logged with ereport subclass unknown and include the common ereport payload information.

AMD Vendor-Common Error Classification

A "vendor-common" (or "vendor-generic") model-specific plugin should seek to implement model-specific support that is common to a large number of models from that vendor. It will form the baseline for those models where no more-specific model support is available.

A vendor generic model-specific plugin may also choose to implement full model-specific details for particular cpu models, in addition to its vendor-generic duties. This is an alternative to delivering such model-specific support via further modules whose pathnames specifies one or more of family, model and stepping. In the AMD case it is suggested that the vendor-generic module remain truly vendor-generic and that the existing AMD family 0xf cpu module support be recast as a model-specific plugin that loads instead of the generic AMD plugin.

Main Memory ECC Errors

For all of AMD family 0xf (K8, revisions B through G) and for at least the initial family 0x10 revisions (and very likely thereafter) we can recognise main memory ECC errors as:

  • detected by the on-chip NorthBridge MCA bank (bank 4),
  • having Bus/Interconnect compound error type,
  • having LL of LG (generic),
  • having II of M (memory),
  • with the CECC or UECC bit set in the bank status register
Furthermore, we can distinguish ChipKill vs non-ChipKill ECC code events by the "extended error code" which is a sub-field of the model-specific error code on these models: an extended error code of 0 indicates regular 64/8 ECC, while a nonzero extended error code indicates a 128/16 ChipKIll ECC (possibly still a single-bit event, however). The following table indicates the leaf ereport class to use for each possible error event type:
Error Type Ereport Class
Correctable (single-bit) 64/8 ECC error ereport.cpu.generic-x86.mem_ce
Uncorrectable 64/8 ECC error ereport.cpu.generic-x86.mem_ue
Correctable 128/16 ChipKill ECC error ereport.cpu.generic-x86.mem_ce
Uncorrectable (multi symbol) 128/16 ECC error ereport.cpu.generic-x86.mem_ue

Ereport Payload For AMD Memory ECC Errors

Payload Member Name Data Type Description
syndrome UINT16 The ECC error syndrome
syndrome-type STRING "E" for 64/8 ECC, "C" for 128/16 ChipKill ECC
resource NVLIST_ARRAY An array of FMRIs identifying the node, dimm, rank (or perhaps node, channel and chip-select) and perhaps even more fine-grained resolution such as row, column and internal bank numbers. This member can only be included if some generic address-to-resource mechanism is available, or if a full-featured memory-controller driver is present. Ideally this is a one-member array, but where the error address and syndrome do not isolate the error to a single rank or where a full-featured memory-controller driver is unavailable there may be multiple entries indicating possible locations.
resource_counts UINT8_ARRAY If the resource FMRIs are filled using a generic means such as the NorthBridge ECC channel/chip-select error counters in the Online Spare Control Register available in K8 revision F and later and K9 (see below) then these counts reflect the ECC count for the corresponding resource array entry.
Additional Payload Members of AMD Memory ECC Ereports (Definitions)

Address-To-Resource Resolution

Resolving a memory error address to a (node, chip-select) involves understanding the structure of various memory-controller registers for the particular cpu model and the details of memory interleaving etc; these can vary within a cpu family. In Solaris such knowledge is captured in the model-specific memory-controller driver mc-amd whose main purpose is memory topology discovery, address-to-resource translation, and resource-to-address translation.

A full-featured memory-controller driver is part of any full model-specific FMA implementation. Thus for as long as a particular model is being supported by the generic cpu module with AMD vendor-common plugin, there is likely no full-featured memory-controller driver present.

For AMD family 0xf revision F and later the Online Spare Control Register exposes 4-bit counts of ECC error experienced by each (channel, chip-select) combination. When an ECC error is observed, usually during a poll, we can check this register to see which combination(s) has/have experienced ECC errors during the poll interval just passed; if we zero the count we can see at the next memory error event which combinations have ticked on again. This therefore provides a generic, if slightly coarse, mechanism for deciding which channel and chip-select contributed an error if we do not have full translation facilities available.

The resource FMRI array and corresponding resource_counts reflect the above infomation. Diagnosis software must decide how it will handle the case where multiple combinations are identified because each has counted one or more ECC errors during the poll interval.

Intel Vendor-Common Error Classification

The Intel model-specific support will layer on top of cpu.generic and will be augmented by one or more memory-controller drivers for the off-chip memory-controller hub (MCH). A machine-check exception can be raised by the MCH, but the associated telemetry is not in an MCA bank that can be read by cpu.generic but instead available via PCI accesses that will be made by the memory-controller driver. The generic #MC handler must, therefore, call out to the memory-controller driver to allow it to reap error telemetry.

For those cpu-experienced errors that do fall within the MCA banks all handling and classification will be performed by cpu.generic - Intel does not publish more-detailed error classification information. The sole contribution from the Intel-generic module in terms of classification will be to provide a new ereport subclass such that ereport are logged in subclass ereport.cpu.intel. The memory-controller driver will be responsible for raising ereports for telemetry it reads.

Payload Member Definitions

IA32_MCG_STATUS (UINT64)
The IA32_MCG_STATUS register value at the time the error event telemetry was captured (during #MC trap handling or during poll).
machine_check_in_progress (BOOLEAN_VALUE)
Present whenever IA32_MCG_STATUS is, with value that of IA32_MCG_STATUS.MCIP. That bit indicates that a machine check is in-progress, so this member should have value 1 for those errors that are handled via a machine check exception and is expected to be 0 for errors discovered via poll.
ip (UINT64)
Only included if a machine-check is in-progress and IA32_MCG_STATUS.EIPV is set, indicating that the instruction pointer pushed onto the stack when the #MC occured is directly associated with the error.
privileged (BOOLEAN_VALUE)
Only included if a machine-check is in-progress, and indicates whether the interrupt code was privileged kernel code (value 1) or userland code (value 0). This only indicates the nature of the code that was interrupted by the #MC and does not necessarily indicate who suffered the error.
bank_number (UINT8)
The MCA bank number from which the error telemetry was read. This is the Nth error reporting register bank, defined by a group of four control/status/addr/misc registers as follows:
  [IA32_]MC?_{CTL,STATUS,ADDR,MISC} Naming
  AMD Intel
Bank Number Control/Status/Address/Misc MSR K7 K8 K9 Core Pentium 4 & Xeon Core Solo/Duo Pentium M P6
0 MSRs 0x400, 0x401, 0x402, 0x403 0 0 0 0 0 0 0 0
1 MSRs 0x404, 0x405, 0x406, 0x407 1 1 1 1 1 1 1 1
2 MSRs 0x408, 0x409, 0x40a, 0x40b 2 2 2 2 2 2 2 2
3 MSRs 0x40c, 0x40d, 0x40e, 0x40f 3 3 3 4 3 4 4 4
4 MSRs 0x410, 0x411, 0x412, 0x413 - 4 4 3 4 3 3 3
5 MSRs 0x414, 0x415, 0x416, 0x417 - - 5 5 - 5 - -
Not all of the banks will include a MISC register, but that does not affect our bank numbering. The IA32_MCi_STATUS.MISCV should be checked to see whether MISC should be read. Yes, newer Intel microarchitectures name consecutive error detector banks as IA32_MC{0,1,2,4,3} which is not at all confusing! Since we classify errors without knowledge of which bank is which (e.g., Dcache vs Icache is unspecified in the MCA) this does not affect us.
bank_msr_offset (UINT64)
This is the MSR offset of the IA32_MCi_CTL register for the bank (0x400, 0x404, 0x408, ...). This is included in an attempt to disambiguate the bank number, given the out-of-order naming from some Intel models.
IA32_MCi_STATUS (UINT64)
The bank status register raw value. We decode some pertinent components in other payload members:
overflow (BOOLEAN_VALUE)
Indicates the value of IA32_MCi_STATUS.OVER. If this is 1 then a machine-check error occured while the valid bit of the status register was already set. In general, enabled errors (i.e., those enabled for #MC in the control register for the bank) overwrite disabled errors, and uncorrectable errors overwrite correctable errors. The bank telemetry is always that of the higher-priority error - there is never any mixing of the two errors.
error_uncorrected (BOOLEAN_VALUE)
Indicates the value of IA32_MCi_STATUS.UC. A value of 1 means that the processor was unable to correct the observed error.
error_enabled (BOOLEAN_VALUE)
The value of IA32_MCi_STATUS.EN, which reflects whether this error type was enabled for #MC in the bank control register.
processsor_context_corrupt (BOOLEAN_VALUE)
The value of IA32_MCi_STATUS.PCC; a value of 1 indicates that the current processor context may have been corrupted as a result of this error.
threshold_based_error_status (STRING)
Included if IA32_MCG_CAP.MCG_TES_P indicates that bits 56:53 are to be considered architectural, and that 54:53 indicate the threshold-based error status. This will be one of the following four strings: "No tracking", "Green - Below threshold", "Yellow - Above threshold", "Reserved". AMD does not implement this thresholding feature, and IA32_MCG_CAP.MCG_TES_P will always read as 0 on AMD implementations so bits 56:53 remain model-specific on AMD.
error_code (UINT16)
The MCA error code, bits 15:0 of IA32_MCi_STATUS.
model_specific_error_code (UINT16)
The model-specific error code, bits 31:16 of IA32_MCi_STATUS.
IA32_MCi_ADDR (UINT64)
The value read from IA32_MCi_ADDR. Only included if IA32_MCi_STATUS.ADDRV is set (address valid).
IA32_MCi_MISC (UINT64)
The value read from IA32_MCi_MISC. Only included if IA32_MCi_STATUS.MISCV is set. valid).
compound_errorname (STRING)
The expanded compound error interpretation string for this error; only included for TLB, Memory Hierarchy and Bus/Interconnect compound error ereports.
syndrome (UINT16)
The ECC error syndrome
syndrome-type (STRING)
"E" for regular 64/8 ECC, "C" for 128/16 ChipKill ECC
resource (NVLIST_ARRAY)
An array of FMRIs indicating the resource or resources that are the source of the memory error. This should be provided in "hc" scheme and the structure should match the diagnosis topology as per fmtopo. This member is only present if the error address can be resolved to a resource.
resource_counts (UINT8_ARRAY)
The number of error observations associated with each entry of the resource array.

References

  • [Intel_vol3A] Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A, May 2007; Order Number: 253668-023US
  • [AMD_vol2] AMD64 Architecture Programmer's Manual, Volume 2: System Programming, July 2007; Publication Number: 24593
  • [AMD_BKDG_K8F] BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh Processors, December 2006; Publication Number: 32559