PCIE错误系统

1、 Error Classification
Error Classification
desc
Correctable Errors
Correctable errors include those error conditions where hardware can recover without any loss of
information. Hardware corrects these errors and software intervention is not required. For
example, an LCRC error in a TLP that might be corrected by Data Link Level Retry is considered a
correctable error. Measuring the frequency of Link-level correctable errors may be helpful for
profiling the integrity of a Link.
Correctable errors also include transaction-level cases where one agent detects an error with a TLP,
but another agent is responsible for taking any recovery action if needed, such as re-attempting the
operation with a separate subsequent transaction. The detecting agent can be configured to report
the error as being correctable since the recovery agent may be able to correct it. If recovery action is
indeed needed, the recovery agent must report the error as uncorrectable if the recovery agent
decides not to attempt recovery
硬件可恢复
传输级可恢复;
Bad TLP
Bad DLLP
Replay Timer
REPLAY_NUM Rollover
Fatal Errors
Uncorrectable errors are those error conditions that impact functionality of the interface. There is
no mechanism defined in this specification to correct these errors. Reporting an uncorrectable error
is analogous to asserting SERR# in PCI/PCI-X. For more robust error handling by the system, this
specification further classifies uncorrectable errors as Fatal and Non-fatal.---
Fatal errors are uncorrectable error conditions which render the particular Link and related hardware
unreliable. For Fatal errors, a reset of the components on the Link may be required to return to
reliable operation. Platform handling of Fatal errors, and any efforts to limit the effects of these
errors, is platform implementation specific.
特定链路/相关硬件不可靠
Data Link Protocol Error
Surprise Down
Receiver Overflow
Flow Control Protocol Error
Malformed TLP Uncorrectable
TLP Prefix Blocked
Non-Fatal Errors
Non-fatal errors are uncorrectable errors which cause a particular transaction to be unreliable but
the Link is otherwise fully functional. Isolating Non-fatal from Fatal errors provides
Requester/Receiver logic in a device or system management software the opportunity to recover
from the error without resetting the components on the Link and disturbing other transactions in
progress. Devices not associated with the transaction in error are not impacted by the error.
特定交易不可靠,链路功能完整
Poisoned TLP Received
ECRC Check Failed
Unsupported Request (UR)
Completion Timeout
Completer Abort
Unexpected Completion
ACS Violation
MC Blocked TLP
AtomicOp Egress Blocked
2、 Error Signaling
Completion Status
The Completion Status field (when status is not Successful Completion) in the Completion header
indicates that the associated Request failed (see Section 2.2.9). This is one method of error reporting
which enables the Requester to associate an error with a specific Request. In other words, since
20 Non-Posted Requests are not considered complete until after the Completion returns, the
Completion Status field gives the Requester an opportunity to “fix” the problem at some higher
level protocol (outside the scope of this specification). For example, if a Read is issued to
prefetchable Memory Space and the Completion returns with an Unsupported Request Completion
Status, the Requester would not be in violation of this specification if it chose to reissue the Read
25 Request. Note that from a PCI Express point of view, the reissued Read Request is a distinct
Request, and there is no relationship (on PCI Express) between the initial Request and the reissued
Request.
Error Messages
Error Messages are sent to the Root Complex for reporting the detection of errors according to the
severity of the error.
Error messages that originate from PCI Express or Legacy Endpoints are sent to corresponding
Root Ports. Errors that originate from a Root Port itself are reported through the same Root Port. If a Root Complex Event Collector is implemented, errors that originate from a Root Complex
Integrated Endpoint may optionally be sent to the corresponding Root Complex Event Collector.
Errors that originate from a Root Complex Integrated Endpoint are reported in a Root Complex
Event Collector residing on the same Logical Bus as the Root Complex Integrated Endpoint. The
Root Complex Event Collector must explicitly declare supported Root Complex Integrated
 Endpoints as part of its capabilities; each Root Complex Integrated Endpoint must be associated
with no more than one Root Complex Event Collector.
When multiple errors of the same severity are detected, the corresponding error Messages with the
same Requester ID may be merged for different errors of the same severity. At least one error
Message must be sent for detected errors of each severity level. Note, however, that the detection
 of a given error in some cases will preclude the reporting of certain errors. Refer to
Section 6.2.3.2.3. Also note special rules in Section 6.2.4 regarding non-Function-specific errors in
multi-Function devices.
Error Forwarding (Data Poisoning)
Error Forwarding, also known as data poisoning, is indicated by setting the EP bit in a TLP. Refer
to Section 2.7.2. This is another method of error reporting in PCI Express that enables the Receiver
mechanism, Error Forwarding can be used with either Requests or Completions that contain data.
In addition, “intermediate” Receivers along the TLP’s route, not just the Receiver at the ultimate
destination, are required to detect and report (if enabled) receiving the poisoned TLP. This can help
 software determine if a particular Switch along the path poisoned the TLP.
3、 Error Logging
  • Device Status
  • Advanced Error Reporting  Capability
  • PCI compatible (Type 00h  and 01h) configuration registers
PCI-Compatible Configuration Registers
Command Register.SERR#Enable
使能Non-Fatal和Fatal错误上报(通过这个bit或者Device Control寄存器的相应bit),控制error Message是否发送
PCI Express Capability Structure
Device Control Register.
Correctable Error Reporting Enable
/Non-Fatal Error Reporting Enable
/Fatal Error Reporting Enable
/Unsupported Request Reporting Enable
使能错误上报,控制error Message是否发送
Root Control Register.
System Error on Correctable Error Enable
System Error on Non-Fatal Error Enable
System Error on Fatal Error Enable
If Set, this bit indicates that a System Error should be generated if a xx error is reported by any of the devic in the hierarchy associated with
this Root Port, or by the Root ort itself. The mechanism for signaling a System Error to the system is system specific
1、os native mode去上报pcie aer,在这种模式下,pcie故障是通过对应的rootport触发msi上报故障,rootctl寄存器是不需要的; 
2、如果使用firmware first mode去处理aer,就必须使能rootctl,让故障能传递到一个global aer的模块,再由这个global模块触发smi中断通知bios。  error pin也是由global aer触发的,所以一般情况是firmware
Advanced Error Reporting Capability
Root Error Command Register.
/Non-Fatal Error Reporting Enable
/Fatal Error Reporting Enable
/Unsupported Request Reporting Enable
使能AER错误上报interrupt
Type 1 Configuration Space Header
Bridge Control Register. 
SERR# Enable
This bit controls forwarding of ERR_COR, ERR_NONFATAL
上一篇:R155 VTA 认证对汽车入侵检测系统(IDS)合规要求


下一篇:在21世纪的我用C语言探寻世界本质——字符函数和字符串函数(2)-一、strncpy函数的使用