1、 Error Classification
Error Classification
|
desc
|
||
Correctable Errors
|
Correctable errors include those error conditions where
hardware can recover without any loss of
information. Hardware corrects these errors and software intervention is not required. For
example, an LCRC error in a TLP that might be corrected by Data Link Level Retry is considered a
correctable error. Measuring the frequency of Link-level correctable errors may be helpful for
profiling the integrity of a Link.
Correctable errors also include
transaction-level cases where one agent detects an error with a TLP,
but another agent is responsible for taking any recovery action if needed, such as re-attempting the
operation with a separate subsequent transaction. The detecting agent can be configured to report
the error as being correctable since the recovery agent may be able to correct it. If recovery action is
indeed needed, the recovery agent must report the error as uncorrectable if the recovery agent
decides not to attempt recovery
|
硬件可恢复
传输级可恢复;
|
Bad TLP
Bad DLLP
Replay Timer
REPLAY_NUM Rollover
|
Fatal Errors
|
Uncorrectable errors are those error conditions that impact functionality of the interface. There is
no mechanism defined in this specification to correct these errors. Reporting an uncorrectable error
is analogous to asserting SERR# in PCI/PCI-X. For more robust error handling by the system, this
specification further classifies uncorrectable errors as Fatal and Non-fatal.---
Fatal errors are uncorrectable error conditions which render the particular Link and related hardware
unreliable. For Fatal errors, a reset of the components on the Link may be required to return to
reliable operation. Platform handling of Fatal errors, and any efforts to limit the effects of these
errors, is platform implementation specific.
|
特定链路/相关硬件不可靠
|
Data Link Protocol Error
Surprise Down
Receiver Overflow
Flow Control Protocol Error
Malformed TLP Uncorrectable
TLP Prefix Blocked
|
Non-Fatal Errors
|
Non-fatal errors are uncorrectable errors which cause a particular transaction to be unreliable but
the Link is otherwise fully functional. Isolating Non-fatal from Fatal errors provides
Requester/Receiver logic in a device or system management software the opportunity to recover
from the error without resetting the components on the Link and disturbing other transactions in
progress. Devices not associated with the transaction in error are not impacted by the error.
|
特定交易不可靠,链路功能完整
|
Poisoned TLP Received
ECRC Check Failed
Unsupported Request (UR)
Completion Timeout
Completer Abort
Unexpected Completion
ACS Violation
MC Blocked TLP
AtomicOp Egress Blocked
|
2、 Error Signaling
Completion Status
|
The Completion Status field (when status is not Successful Completion) in the Completion header
indicates that the associated Request failed (see Section 2.2.9). This is one method of error reporting
which enables the Requester to associate an error with a specific Request. In other words, since
20 Non-Posted Requests are not considered complete until after the Completion returns, the
Completion Status field gives the Requester an opportunity to “fix” the problem at some higher
level protocol (outside the scope of this specification). For example, if a Read is issued to
prefetchable Memory Space and the Completion returns with an Unsupported Request Completion
Status, the Requester would not be in violation of this specification if it chose to reissue the Read
25 Request. Note that from a PCI Express point of view, the reissued Read Request is a distinct
Request, and there is no relationship (on PCI Express) between the initial Request and the reissued
Request.
|
|
Error Messages
|
Error Messages are sent to the Root Complex for reporting the detection of errors according to the
severity of the error.
Error messages that originate from PCI Express or Legacy Endpoints are sent to corresponding
Root Ports. Errors that originate from a Root Port itself are reported through the same Root Port. If a Root Complex Event Collector is implemented, errors that originate from a Root Complex
Integrated Endpoint may optionally be sent to the corresponding Root Complex Event Collector.
Errors that originate from a Root Complex Integrated Endpoint are reported in a Root Complex
Event Collector residing on the same Logical Bus as the Root Complex Integrated Endpoint. The
Root Complex Event Collector must explicitly declare supported Root Complex Integrated
Endpoints as part of its capabilities; each Root Complex Integrated Endpoint must be associated
with no more than one Root Complex Event Collector.
When multiple errors of the same severity are detected, the corresponding error Messages with the
same Requester ID may be merged for different errors of the same severity. At least one error
Message must be sent for detected errors of each severity level. Note, however, that the detection
of a given error in some cases will preclude the reporting of certain errors. Refer to
Section 6.2.3.2.3. Also note special rules in Section 6.2.4 regarding non-Function-specific errors in
multi-Function devices.
|
|
Error Forwarding (Data Poisoning)
|
Error Forwarding, also known as data poisoning, is indicated by setting the EP bit in a TLP. Refer
to Section 2.7.2. This is another method of error reporting in PCI Express that enables the Receiver
mechanism, Error Forwarding can be used with either Requests or Completions that contain data.
In addition, “intermediate” Receivers along the TLP’s route, not just the Receiver at the ultimate
destination, are required to detect and report (if enabled) receiving the poisoned TLP. This can help
software determine if a particular Switch along the path poisoned the TLP.
|
3、 Error Logging
-
Device Status
-
Advanced Error Reporting Capability
-
PCI compatible (Type 00h and 01h) configuration registers
PCI-Compatible Configuration Registers
|
Command Register.SERR#Enable
|
使能Non-Fatal和Fatal错误上报(通过这个bit或者Device Control寄存器的相应bit),控制error Message是否发送
|
|
PCI Express Capability Structure
|
Device Control Register.
Correctable Error Reporting Enable
/Non-Fatal Error Reporting Enable
/Fatal Error Reporting Enable
/Unsupported Request Reporting Enable
|
使能错误上报,控制error Message是否发送
|
|
Root Control Register.
System Error on Correctable Error Enable
System Error on Non-Fatal Error Enable
System Error on Fatal Error Enable
|
If Set, this bit indicates that a System Error should be generated if a xx error is reported by any of the devic in the hierarchy associated with
this Root Port, or by the Root ort itself. The mechanism for signaling a System Error to the system is system specific
1、os native mode去上报pcie aer,在这种模式下,pcie故障是通过对应的rootport触发msi上报故障,rootctl寄存器是不需要的;
2、如果使用firmware first mode去处理aer,就必须使能rootctl,让故障能传递到一个global aer的模块,再由这个global模块触发smi中断通知bios。 error pin也是由global aer触发的,所以一般情况是firmware
|
||
Advanced Error Reporting Capability
|
Root Error Command Register.
/Non-Fatal Error Reporting Enable
/Fatal Error Reporting Enable
/Unsupported Request Reporting Enable
|
使能AER错误上报interrupt
|
|
Type 1 Configuration Space Header
|
Bridge Control Register.
SERR# Enable
|
This bit controls forwarding of ERR_COR, ERR_NONFATAL
|