带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

点击查看第一章
点击查看第二章

Chapter3

A Top-Level View of Computer Function and Interconnection

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

At a top level, a computer consists of CPU (central processing unit), memory, and I/O components, with one or more modules of each type. These components are interconnected in some fashion lo achieve the basic function of the computer, which is to execute programs. Thus, al a top level, we can characterize a computer system by describing (1) the external behavior of each component. that is. the data and control signals (hat it exchanges with other components, and (2) the interconnection structure and the controls required to manage the use of the interconnection structure.
This top-level view of structure and function is important b(*caiKP of ils explanatory power in understanding the nature of a computer. Equally important is its use to understand the increasingly complex issues of performance evaluation. A grasp of the top-level structure and function offers insight into system bottlenecks, alternate pathways, the magnitude of system failures if a component fails, and the ease of adding performance enhancements. In many eases. requirements for greater system power and fail-safe capabilities arc being met by changing the design rather than merely increasing the speed and reliability of individual components.
This chapter focuses on the basic structures used for computer component interconnection. As background, the chapter begins with a brief examination of the basic components and their interface requirements. Then a functional overview is provided. We are then prepared to examine the use of buses to interconnect system components.

3.1 COMPUTER COMPONENTS

As discussed in Chapter 1. virtually ail contemporary computer designs arc based on concepts developed by John von Neumann at the Institute for Advanced Studies. Princeton. Such a design is referred to as the von Neumann architecture and is based on three key concepts:

  • Data and instructions arc stored in a single read-write memory.
  • The contents of (his memory arc addressable by location, without regard to the type of data contained there.
  • Execution occurs in a sequential fashion (unless explicitly modified) from one instruction to the next.

The reasoning behind these concepts was discussed in Chapter 2 but is worth summarizing here. There is a small set of basic logic components that can be combined in various ways to store binary data and perform arithmetic and logical operations on that data. If there is a particular computation to be performed, a configuration of logic components designed specifically for that computation could be constructed. We can think of the process of connecting the various components in the desired configuration as a form of programming. The resulting "program'' is in the form of hardware and is termed a hardwired program.
Now consider this alternative. Suppose we construct a general-purpose configuration of arithmetic and logic functions. This set of hardware will perform various functions on data depending on control signals applied to the hardware. In the original ease of customized hardware, the system accepts data and produces results (Figure 3.1a). With general-purpose hardware. the system accepts data and control signals and produces results. Thus, instead of rewiring the hardware for each new program, the programmer merely needs to supply a new set of control signals.
How shall control signals be supplied? Die answer is simple but subtle. The entire program is actually a sequence of steps. At each step, some arithmetic or logical operation is performed on some data. For each step, a new set of control signals is needed. Let us provide a unique code for each possible set of control signals,and let us add to the gvncral-purposc hardware a segment that can accept a code and generate control signals (Figure 3.1b).

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

Programming is now much easier. Instead of rewiring the hardware for each new program, ail we need to do is provide a new sequence of codes. Each code is, in effect. an instruction, and part of the hardware interprets each instruction and generates control signals. To distinguish this new method of programming, a sequence of codes or instructions is called software.
Figure 3.1b indicates two major components of the system: an instruction interpreter and a module of gvncral-purposc arithmetic and logic functions, llicsc two constitute the CPU. Several other components arc needed to yield a functioning computer. Data and instructions must be put into the system. For this we need some sort of input module. This module contains basic components for acceding data and instructions in some form and converting them into an internal form of signals usable by the system. A means of reporting results is needed, and this is in the form of an output module. Taken together. these arc referred to as I/O components.
One more component is needed. An input device will bring instructions and data in sequentially. But a program is not invariably executed sequentially; i( may jump around (c.g.. the IAS jump instruction). Similarly, operations on data may require access to more than just one element at a time in a predetermined sequence. Thus, there must be a place to temporarily store both instructions and data. That module is called memory, or main memory, to distinguish i( from external storage or peripheral devices. Von Neumann pointed out that the same memory coukl be used to store both instructions and data.
Figure 32 illustrates these top-level components and suggests the interactions among them. The CPU exchanges data with memory. For this purpose, it typically makes use of two internal (to the CPU) registers: a memory address register (MAR), which specifies the address in memory for the next read or write, and a memory buffer register (MBR), which contains the data to be written into memory or receives the data read from memory. Similarly, an I/O address register (I/OAR) specifics a particular I/O device. An I/O buffer register (I/OBR) is used for the exchange of data between an I/O module and the CPU.
A memory module consists of a set of locations, defined by sequentially numbered addresses. Each location contains a binary number that can be interpreted as either an instruction or data. An I/O module transfers data from external devices to CPU and memory. and vice versa. It contains internal buffers for temporarily holding these data until they can be sent on.
Having looked briefly at these major components, we now turn to an oveniew of how these components function together to execute programs.

3.2 COMPUTER FUNCTION

The basic function performed by a computer is execution of a program, which consists of a set of instructions stored in memory.'Die processor docs the actual work by executing instructions specified in the program.This section provides an overview of
the key elements of program execution. In its simplest form, instruction processing consists of two steps: "Die processor reads (fetches) instructions from memory one at a time and executes each instruction. Program execution consists of repeating the process of instruction fetch and instruction execution/Die instruction execution may involve several operations and depends on the nature of the instruction (see, for example. the lower portion of Figure 24).

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

The processing required for a single instruction is called an instruction cycle. Using the simplified two-step description given previously. the instruction cycle is depicted in Figure 33. The two steps arc referred to as the fetch cycle and the execute cycle. Program execution halts only if the machine is turned off. some sort of unrecoverable error occurs, or a program instruction that halts the computer is encountered.
Instruction Fetch and Execute
At the beginning of each instruction t^clc. the processor fetches an instruction from memory. In a typical processor, a register called the program counter (PC) hokls the address of the instruction to be fetched next. Unless told otherwise, the processor
always increments the PC after each instruction fetch so that i( will fetch the next instruction in sequence (i.e., the instruction located at the next higher memory address). So, for example, consider a computer in which each instruction occupies one 16-bit word of memory. Assume that the program counter is set to memory location 30(), where the location address refers to a 16-bit word.Hie processor will next fetch the instruction al location 300. On succeeding instruction cycles, i( will fetch instructions from locations 301.302.303. and so on. This sequence maybe altered, as explained presently.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

The fetched instruction is loaded into a register in the processor known as the instruction register (IR). The instruction contains bits (hat specify the action the processor is to take. The processor interprets the instruction and performs the required action. In general. these actions fall into four categories:

  • Processor-memory: Data may be transferred from processor to memory or fioin nicnioiy tu pioccssoi.
  • Processor-I/O: Data may be transferred to or from a peripheral device by transferring between the processor and an I/O module.
  • Data processing: The processor may perform some arithmetic or logic operation on data.
  • Control: An instruction may specify that the sequence of execution be altered. For example. the processor may fetch an instruction from location 149. which specifics that the next instruction be from location 182. The processor will remember this fact by setting the program counter to 182. Thus, on the next fetch cycle, the instruction will be fetched from location 182 rather than 150.

An instruction's execution may involve a combination of these actions.
Consider a simple example using a hypothetical machine that includes the characteristics listed in Figure 3.4. The processor contains a single data register, called an accumulator (AC). Both instructions and data arc 16 bits long. Thus, it is convenient to organize memory using 16-bit words. The instruction format provides 4 bits for the opcode. so that there can be as many as 24 = 16 different opcodes.and up to 带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection = 4096 (4K) words of memory can be directly addressed.
Figure 3.5 illustrates a partial program execution, showing the relevant portions of memory and processor registers.The program fragment shown adds the contents of the memory word at address 940 to the contents of the memory word at address 941 and stores the result in the latter location. Three instructions, which can be described as three fetch and three execute cycles, arc required:
1.The PC contains 30(), the address of the first instruction. This instruction (the value 1940 in hexadecimal) is loaded into the instruction register IR. and the PC is incremented. Note that this process involves the use of a memory address register and a memory buffer register. For simplicity. these intermediate registers arc ignored.
2.The first 4 bits (first hexadecimal digit) in the IR indicate that the AC is to be loaded. The remaining 12 bits (three hexadecimal digits) specify the address (940) from which data arc to be loaded.
3.The next instruction (5941) is fetched from location 301, and the PC is incremented.
4.The old contents of the AC and the contents of location 941 arc added, and the result is stored in the AC
5.The next instruction (2941) is fetched from location 302. and the PC is incremented.
6.The contents of the AC arc stored in location 941.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

In this example, three instruction cycles. each consisting of a fetch cycle and an execute cycle, arc needed to add the contents of location 94() to the contents of 941. With a more complex set of instructions, fewer cycles would be needed. Some oklcr processors, for example. included instructions that contain more than one memory address. Thus, the execution cycle for a particular instruction on such processors could involve more than one reference to memory. Also, instead of memory references. an instruction may specify an I/O operation.
For example, the PDP-11 processor includes an instruction. expressed symbolically as ADD B.A. that stores the sum of the contents of memory locations B and A into memory location A. A single instruction cycle with the following steps occurs:

  • Fetch the ADD instruction.
  • Read the contents of memory location A into the processor.
  • Read the contents of memory location B into the processor. In order that the contents of A arc not lost, the processor must have at least two registers for storing memory values. rather than a single accumulator.
  • Add the two values.
  • Write the result from the processor to memory location A.

Thus, the execution cycle fora particular instruction may involve more than one reference to memory. Ako. instead of memory references, an instruction may specify an I/O operation. With these additional considerations in mind. Figure 3.6 provides a more detailed look at the basic instruction cycle of Figure 33. The figure is in the form of a state diagram. For any given instruction cycle, some states may be null and others may be visited more than once. 'Die states can be described as follows:

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

  • Instruction address calculation (iac): Determine the address of the next instruction to be executed. Usually, this involves adding a fixed number to the address of the previous instruction. For example. if each instruction is 16 bits long and memory is organized into 16-bit words, then add 1 to the previous address. If. instead, memory is organized as individually addressable 8-bit bytes, then add 2 to the previous address.
  • Instruction fetch (if): Read instruction from its memory location into the processor.
  • Instruction operation decoding (iod): Analyze instruction to determine type of operation to be performed and operand(s) to be used.
  • Operand uddress calculation (oac): If the operation involves reference to an operand in memory or available via I/O. then determine the address of the operand.
  • Operand fetch (of): Fetch the operand from memory or read i( in from I/O.
  • Data operation (do): Perform the operation indicated in the instruction.
  • Operand store (os): Write the result into memory or out to I/O.

States in the upper part of Figure 3.6 involve an exchange between the processor and either memory or an I/O module. States in the lower part of the diagram involve only internal processor operations. The oac state appears twice. because an instruction may involve a read, a write. or both. However, the action performed during that state is fundamentally the same in both eases, and so only a single state identifier is needed.
Ako note that the diagram allows for multiple operands and multiple results, because some instructions on some machines require this. For example, the PDP-11 instruction ADD A.B results in the following sequence of states: iac. if. iod. oac, of. oac, of. do. oac. os.
Finally, on some machines, a single instruction can specify an operation to be performed on a vector (one-dimensional array) of numbers or a string (one-dimensional
array) of characters. As Figure 3.6 indicates, this would involve repetitive operand fetch and/or store operations.
Interrupts
Virtually ail computers provide a mechanism by which other modules (I/O. memory) may interrupt the normal processing of the processor. Table 3.1 lists the most common classes of interrupts. 'Die specific nature of these interrupts is examined later in this book.especially in Chapters 7 and 14. However, we need to introduce the concept now to understand more clearly the nature of the instruction cydc and the implications of interrupts on the interconnection structure. Hie reader need not be concerned at this stage about the details of the generation and processing of interrupts, but only focus on the communication between modules that results from interrupts.
Interrupts arc provided primarily as a way to improve processing efficiency. For example. most external devices arc much slower than the processor. Suppose that the processor is transferring data to a printer using the instruction cycle scheme of Figure 3.3. After each write operation, the processor must pause and remain idle until the printer catches up. The length of this pause may be on the order of many hundreds or even thousands of instruction cycles that do not involve memory. Clearly, this is a very wasteful use of the processor.
Figure 3.7a illustrates this state of affairs. Hie user program performs a series of WRITE calls interleaved with processing. Code segments 1.2, and 3 refer to sequences of instructions (hat do not involve I/O. The WRITE calls arc to an I/O program that is a system utility and that will perform the actual I/O operation. The I/O program consists of three sections:

  • A sequence of instructions, labeled 4 in the figure, to prepare for the actual I/O operation.This may include copying the data to be output into a special buffer and preparing the parameters for a device command.
  • The actual I/O command. Without the use of interrupts. once this command is issued, the program must wait for the I/O device to perform the requested function (or periodically poll the device). The program might wait by simply repeatedly performing a test operation to determine if the I/O operation is done.
  • A sequence of instructions, labeled 5 in the figure. to complete the operation. This may include setting a flag indicating the success or failure of the operation.

    带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

Because the I/O operation may take a relatively long time to complete. the I/O program is hung up waiting for the operation to complete: hence, the user program is stopped at the point of the WRITE call for some considerable period of time.
INTERRUPTS AND THE INSTRUCTION CYCLE With interrupts, the processor can be engaged in executing other instructions while an I/O operation is in process. Consider the flow of control in Figure 3.7b. As before. the user program reaches a point at which it makes a system call in the form of a WRITE call. The I/O program that is invoked in this ease consists only of the preparation code and the actual I/O command. After these few instructions have been executed, control returns to the user program. Meanwhile, the external device is busy accepting data from computer memory and printing it. This I/O operation is conducted concurrently with the execution of instructions in the user program.
When the external device becomes ready to be serviced—that is. when it is ready to accept more data from the processor—the I/O module for (ha( external device sends an interrupt request signal to the processor. The processor responds by suspending operation of the current program, branching off to a program to service that particular I/O device, known as an interrupt handler, and resuming the original execution after the device is serviced. The points at which such interrupts occur arc indicated by an asterisk in Figure 3.7b.
Let us try to clarify what is happening in Figure 3.7. We have a user program that contains two WRITE commands. There is a segment of code at the beginning, then one WRITE command, then a second segment of code, then a second WRITE command, then a third and final segment of code. The WRITE command invokes the I/O program provided by the OS. Similarly, the I/O program consists of a segment of code, followed by an I/O command. followed by another segment of code. The I/O command invokes a hardware I/O operation.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection
带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

From the point of view of the user program, an interrupt is just that: an intenup- tionof the normal sequence of execution. When the interrupt processing is completed, execution resumes (Figure 3.8). Thus, the user program docs not have to contain any special code to accommodate interrupts; the processor and the operating system arc responsible for suspending the user program and then resuming it at the same point.
To accommodate interrupts. an interrupt cycle is added to the instruction cycle, as shown in Figure 3.9. In the interrupt cycle, the processor checks to see if any interrupts have occurred, indicated by the presence of an interrupt signal. If no interrupts arc pending, the processor proceeds lo the fetch cycle and fetches the next instruction of the current program. If an interrupt is pending, the processor docs the following:

  • It suspends execution of the current program being executed and saves its context. This means saving the address of the next instruction to be executed (current contents of the program counter) and any other data relevant to the processor's current activity.
  • Itsets the program counter to thestarting address of an interrupt handler routine.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

The processor now proceeds to the fetch cycle and fetches the first instruction in the interrupt handler program, which will service the interrupt. The intenupt handler program is generally part of the operating system. Typically, this program determines the nature of the interrupt and performs whatever actions arc needed. In the example we have been using, the handler determines which I/O module generated the interrupt and may branch to a program (hat will write more data ott to (ha( I/O module. When the interrupt handler routine is completed, the processor can resume execution of the user program at the point of interruption.
It is clear (ha( there is some overhead involved in this process. Extra instructions must be executed (in the interrupt handler) to determine the nature of the ir.tcr- rupl and to decide on the appropriate action. Nevertheless, because of the relatively large amount of time lhat would be wasted by simply waiting on an I/O operation, the processor can be employed much more efficiently with the use of interrupts.
To appreciate the gain in efficiency, consider Figure 3.10. which is a timing diagram based on the flow of control in Figures 3.7a and 3.7b. In this figure. user program code segments arc shaded dark gray, and I/O program code segments are shaded light gray. Figure 3.1()a shows the ease in which interrupts arc not used. The processor must wait while an I/O operation is performed.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

Figures 3.7b and 3.10b assume that the time rcquiivd for the I/O operation is relatively short: less than the time to complete the execution of instructions between write operations in the user program. In this ease, the segment of code labeled code segment 2 is interrupted. A portion of the code (2a) executes (while the I/O operation is performed) and then the intcmipl occurs (upon the completion of the I/O operation). After the interrupt is serviced. cxcculion resumes with the remainder of code segment 2 (2b).
The more typical ease, especially for a slow device such as a printer. is that the I/O operation will take much more time than executing a sequence of user instructions. Figure 3.7c indicates this state of affairs. In this ease, the user program reaches the second WRITE call before the I/O operation spawned by the first call is complete. The result is that the user program is hung up at (hat point. When the preceding I/O operation is completed, this new WRITE call may be processed, and a new I/O operation may be started. Figure 3.11 shows the timing for this situation with and without the use of interrupts. We can see that there is still a gain in efficiency because part of the time during which the I/O operation is under way overlaps with the execution of user instructions.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

Figure 3.12 shows a revised instruction cycle state diagram that includes interrupt cycle processing.
MULTIPLE INTERRUPTS The discussion so far has focused only on the occurrence of a single interrupt. Suppose, however. (ha( multiple interrupts can occur. For example, a program may be receiving data from a communications line and printing results. The printer will generate an interrupt every time it completes a print operation. The communication line controller will generate an intenupt every time a unit of data arrives. The unit coukl either be a single character or a block, depending on the nature of the communications discipline. In any ease. it is possible for a communications interrupt to occur while a printer interrupt is being processed.
Two approaches can be taken to dealing with multiple intcrrupls. The first is to disable interrupts while an interrupt is being processed. A disabled interrupt simply means that the processor can and will ignore that interrupt request signal. If an interrupt occurs during this time, it generally remains pending and will be checked by the processor after the processor has enabled interrupts. Thus, when a user program is executing and an interrupt occurs, interrupts arc disabled immediately. After the interrupt handler routine completes, interrupts arc enabled before resuming the user program, and the processor checks to see if additional interrupts have occurred. This approach is nice and simple. as interrupts arc handled in strict sequential order (Figure 3.13a).
The drawback to the preceding approach is that it docs not take into account relative priority or time-critical needs. For example, when input arrives from the communications line. it may need to be absorbed rapidly to make room for more input. If the first batch of input has not been processed before the second batch arrives, data may be lost.
A second approach is to define priorities for interrupts and to allow an intenupt of higher priority to cause a lower-priority interrupt handler to be itself interrupted (Figure 3.13b). As an example of this second approach, consider a system with three I/O devices: a printer. a disk, and a communications line, wilh increasing priorities of 2,4. and 5. respectively. Figure 3.14 illustrates a possible sequence. A user program begins at f = 0. At f = 10, a printer intcrrupl occurs; user information is placed on the system stack and execution continues at the printer interrupt service routine (ISR). While this routine is still executing, at f = 15. a communications ir.tcr- rupt occurs. Because the communications line has higher priority than the printer, the interrupt is honored. "Die printer ISR is interrupted, its state is pushed onto the stack, and execution continues at the communications ISR. While this routine is executing. a disk interrupt occurs (t = 20). Because this interrupt is of lower priority, it is simply held, and the communications ISR runs to completion.
When the communications ISR is complete (t = 25). the previous processor state is restored, which is the execution of the printer ISR. However. before even a single instruction in (ha( routine can be executed, the processor honors the high er-priority disk interrupt and control transfers to the disk ISR. Only when that routine is complete (t = 35) is the printer ISR resumed. When that routine completes (t = 40), control finally returns to the user program.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection
带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection
带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

I/O Function
Thus far. we have discussed the operation of the computer as controlled by the processor. and we have looked primarily at the interaction of processor and m*ry. The discussion has only alluded to the role of the I/O component. This role is discussed in detail in Chapter 7 but a brief summary is in order here.
An I/O module (c.g.. a disk controller) can exchange data directly with the processor. Just as the processor can initiate a read or write with memory, designating the address of a specific location, the processor can also read data from or write data to an I/O module. In this latter ease, the processor identifies a specific device that is controlled by a particular I/O module. Thus, an instruction sequence similar in form to that of Figure 35 coukl occur. with I/O instructions rather than memory-referencing instructions.
In some cases. it is desirable to allow I/O exchanges to occur directly with memory. In such a ease. the processor grants to an I/O module the authority to read from or wrile to memory. so that the I/O-mcmory transfer can occur without tying up the processor. During such a transfer. the I/O module issues read or write commands to memory. relieving the processor of responsibility for the exchange. This operation is known as direct memory access (DMA) and is examined in Chapter 7.

3.3 INTERCONNECTION STRUCTURES

A computer consists of a set of components or modules of three basic types (processor. memory. I/O) that communicate with each other. In effect, a computer is a network of basic modules. Th us, there must be paths for connecting the modules.
The collection of paths connecting the various modules is called the interconnection structure. The design of this structure will depend on the exchanges that must be made among modules.
Figure 3.15 suggests the types of exchanges that arc needed by indicating the major forms of input and output for each module type2:

  • Memory: Typically, a memory module will consist of N words of equal length. Each word is assigned a unique numerical address (0,1. ... , N-l). A word of data can be read from or written into the memory.The nature of the operation is indicated by read and write control signals.'Die location for the operation is specified by an address.

    带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

  • I/O module: From an internal (to the computer system) point of view, I/O is functionally similar to memory. There arc two operations; read and write. Further. an I/O module may control more (han one external device. We can refer to each of the interfaces to an external device as a port and give each a unique address (e.g.,0,1,...,M-l). In addition, there arc externa) data paths for the input and output of data with an external device. Finally, an I/O module may be able to send interrupt signals to the processor.
  • Processor: The processor reads in instructions and data, writes out data after processing, and uses control signals to control the overall operation of the system. It also receives interrupt signals.

The preceding list defines the data to be exchanged. The interconnection structure must support the following types of transfers:

  • Memory to processor: The processor reads an instruction or a unit of data from memory.
  • Processor to memory: The processor writes a unit of data to memory.
  • I/O to processor: The processor reads data from an I/O device via an I/O module.
  • Processor to I/O: The processor sends data to the I/O device.
  • I/O to or from memory: For these two eases, an I/O module is allowed to exchange data directly with memory. wilhoul going through the processor, using direct memory access.

Over the years, a number of interconnection structures have been tried. By far the most common arc (1) the bus and various mulliplc-bus structures. and (2) point- to-point interconnection structures with packctizcd data transfer. We devote the remainder of this chapter for a discussion of these structures.

3.4 BUS INTERCONNECTION

The bus was the dominant means of computer system component interconnection for decades For general-purpose computers, it has gradually given way to various point-to-point interconnection structures, which now dominate computer system design. However, bus structures arc still commonly used for embedded systems, particularly microcontrollers In this section, we give a brief overview of bus struclurc. Appendix C provides more detail.
A bus is a communication pathway connecting two or more devices. A key characteristic of a bus is that i( is a shared transmission medium. Multiple devices connect to the bus. and a signal transmitted by any one device is available for reception by ail other devices attached to the bus. If two devices transmit during the same time period, their signals will overlap and become garbled. Thus, only one device at a time can successfully transmit.
Typically. a bus consists of multiple communication pathways. or lines. Each line is capable of transmitting signals representing binary 1 and binary 0. Overtime, a sequence of binary digits can be transmitted across a single line. Taken together, several lines of a bus can be used to transmit binary digits simultaneously (in parallel). For example. an 8-bit unit of data can be transmitted over eight bus lines.
Computer systems contain a number of different buses (ha( provide pathways between components at various levels of the computer system hierarchy. A bus that connects major computer components (processor. memory. I/O) is called a system bus. The most common computer interconnection structures arc based on the use of one or more system buses.
A system bus consists, typically, of from about fifty to hundreds of separate lines. Each line is assigned a particular meaning or function. Although there arc many different bus designs, on any bus the lines can be classified into three functional groups (Figure 3.16): data, address, and control lines. In addition, there may be power distribution lines that supply power to the attached modules.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

The data lines provide a path for moving data among system modules. These lines. collectively. arc called the data bus. The data bus may consist of 32.64.128, or even more separate lines, the number of lines being referred to as the width of the data bus. Because each line can carry only one bit at a time. the number of lines determines how many bits can be transferred at a time. The width of the data bus is a key factor in determining overall system performance. For example. if the data bus is 32 bits wide and each instruction is 64 bits long, then the processor must access the memory module twice during each instruction cycle.
The address lines are used to designate the source or destination of the data on the data bus. For example, if the processor wishes to read a word (8. 16. or 32 bits) of data from memory. it puts the address of the desired word on the address liies. Clearly, the width of the address bus determines the maximum possible memory capacity of the system. Furthermore. the address lines arc generally also used to address I/O ports. Typically, the higher-order bits arc used to select a particular module on the bus. and the lower-order bits select a memory location or I/O port within the module. For example, on an 8-bit address bus. address 01111111 and below might reference locations in a memory module (module 0) with 128 words of memory. and address 1000000() and above refer to devices attached to an I/O module (module 1).
The control lines arc used to control the access to and the use of the data and address lines. Because the data and address lines arc shared by ail components,there must be a means of controlling their use. Control signals transmit both command and timing information among system modules. Timing signals indicate the validity of data and address information. Command signals specify operations to be performed. Typical control lines include:

  • Memory write: causes data on the bus to be written into the addressed location.
  • Memory read: causes data from the addressed location to be placed on the bus.
  • I/O write: causes data on the bus to be output to the addressed I/O port.
  • I/O read: causes data from the addressed I/O port to be placed on the bus.
  • Transfer ACK: indicates that data have been accepted from or placed on the bus.
  • Bus request: indicates that a module needs to gain control of the bus.
  • Bus grant: indicates that a requesting module has been granted control of the bus.
  • Interrupt request: indicates that an interrupt is pending.
  • Interrupt ACK: acknowledges that the pending interrupt has been recognised.
  • Clock: is used to synchronize operations.
  • Reset: initializes all modules.

The operation of the bus is as follows. If one module wishes to send data to another. it must do two things: (1) obtain the use of the bus, and (2) transfer data via the bus. If one module wishes to request data from another module. it must (1) obtain the use of the bus. and (2) transfer a request to the other module over the appropriate control and address lines. It must then wait for that second module to send the data.

3.5 POINT-TO-POINT INTERCONNECT

The shared bus architecture was (he standard approach to interconnection between the processor and other components (memory, I/O. and so on) for decades. But contemporary systems increasingly rely on point-to-point interconnection rather than shared buses.
The principal reason driving the change from bus to point-to-point interconnect was the electrical constraintscncountcrcd with increasing the frequency of wide synchronous buses. At higher and higher data rates, it becomes increasingly difficult to perform the synchronization and arbitration functions in a timely fashion. Further. with the advent of multicorc chips, with multiple processors and significant memory on a single chip. it was found that the use of a conventional shared bus on the same chip magnified the difficulties of increasing bus data rate and reducing bus latency to keep up with the processors. Compared to the shared bus. the poinl-to- point interconnect has lower latent. higher data rate. and better scalability.
In this section, we look at an important and representative example of the point-to-point interconnect approach: Intel's Quick Path Interconnect (QPI). which was introduced in 2008.
The foilouing arc significant characteristics of OPI and other point-to-p3int interconnect schemes:

  • Multiple direct connections: Multiple components within the system enjoy direct pairwise connections to other components. This eliminates the need for arbitration found in shared transmission systems.
  • Layered protocol architecture: As found in network environments, such as TCP/IP-based data networks, these processor-level interconnects use a layered protocol architecture. rather (han the simple use of control signals found in shared bus arrangements.
  • Packetized data transfer: Data arc not sent as a raw bit stream. Rather, data arc sent as a sequence of packets, each of which includes control headers and error control codes.

Figure 3.17 illustrates a typical use of QPI on a multicorc computer.'Die QPI links form a switching fabric that enables data to move throughout the network. Direct OPI connections can be established between each pair ofcorc processors. If core A in Figure 3.17 needs to access the memory controller in core D. it sends its request through either cores B or C, which must in turn forward that request on to the memory controller in core D. Similarly, larger systems with eight or more processors can be built using processors with three links and routing traffic through intermediate processors.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

In addition. OPI is used to connect to an I/O module, called an I/O hub (IOH). The IOH acts as a switch dircctine traffic to and from I/O devices. Tvpically in newer systems, the link from the IOH to the I/O device controller uses an interconnect technology called PCI Express (PCIc). described later in this chapter. The IOH translates between the OPI protocols and formats and the PCIc protocols and formats. A core also links to a main memory module (typically the memory uses dynamic access random memory (DRAM) technology) using a dedicated memory bus.
OPI is defined as a four-layer protocol architecture, encompassing the following layers (Figure 3.18):

  • Physical: Consists of the actual wires carrying the signals, as well as circuiiry and logic to support ancillary features required in the transmission and receipt of the Is and Os. Hie unit of transfer at the Physical layer is 20 bits, which is called a Phit (physical unit).
  • Link: Responsible for reliable transmission and flow control. The Link layer's unit of transfer is an 80-bit Flit (flow control unit).
  • Routing: Provides the framework for directing packets through the fabric.
  • Protocol: The high-level set of rules for exchanging packets of data between devices. A packet is comprised of an integral number of Flits.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

QPI Physical Layer
Figure 3.19 shows the physical architecture of a OPI port. "Die OPI port consisls of 84 individual links grouped as follows. Each data path consists of a pair of wires lhal transmits data one bit at a time: the pair is referred to as a lane. There arc 20 data lanes in each direction (transmit and receive). plus a clock lane in cadi dircclion. Thus, OPI is capable of transmitting 20 bits in parallel in each direction.'Die 2(l-bit unit is referred to as a phit. Typical signaling speeds of the link in current products calls for operation at 6.4 GT/s (transfers per second). At 20 bits per transfer. that adds up to 16 GB/s, and since OPI links involve dedicated bidirectional pairs, the total capacity is 32 GB/s.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

The lanes in each direction arc grouped into four quadrants of 5 lanes each. In some applications, the link can also operate at half or quarter widths in order to reduce power consumption or work around failures.
The form of transmission on each lane is known as differential signaling, or balanced transmission. With balanced transmission, signals arc transmitted as a current that travels down one conductor and returns on the other. The binary value depends on the voltage difference. Typically, one line has a positive voltage value and the other line has zero voltage, and one line is associated with binary 1 and one line is associated with binary 0. Specifically, the technique used by OPI is known as low-voltage differential signaling (LVDS). In a typical implementation, the transmitter injects a small current into one wire or the other. depending on the logic level to be sent. The current passes through a resistor at the receiving end. and then returns in the opposite direction along the other wire. The receiver senses the polarity of the voltage across the resistor to determine the logic level.
Another function performed by the physical layer is that it manages the translation between 80-bit flits and 20-bit phits using a technique known as multilane distribution. The flits can be considered as a bit stream that is distributed across the data lanes in a round-robin fashion (first bit to first lane. second bit to second hnc. eta), as illustrated in Figure 3.20. This approach enables OPI to achieve very high data rates by implementing the physical link between two ports as multiple parallel channels.
QPI Link Layer
The OPI link layer performs two key functions: flow control and error control.Thcsc functions arc performed as part of the OPI link layer protocol, and operate on the level of the flit (flow control unit). Each flit consists of a 72-bit message payload and an 8-bit error control code called a q-clic redundancy check (CRC).Wc discuss error control codes in Chapter 5.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

A flit payload may consist of data or message information. The data flits transfer the actual bits of data between cores or between a core and an IOH. Die message flits arc used for such functions as flow control, error control, and cache coherence. We discuss cache coherence in Chapters 5 and 17.
The flow control function is needed to ensure that a sending OPI entity docs not overwhelm a receiving QPI entity by sending data faster than the receiver can process the data and clear buffers for more incoming data. To control the flow of data. OPI makes use of a credit scheme. During initialization, a sender is given a set number of credits to send flits to a receiver. Whenever a flit is sent to the receiver, the sender decrements its credit counters by one credit. Whenever a buffer is freed at the receiver. a credit is returned to the sender for that buffer. Thus, the receiver controls that pace at which data is transmitted over a OPI link.
Occasionally, a bit transmitted at the physical layer is changed during tnns- mission. due to noise or some other phenomenon. The error control function at the link layer detects and recovers from such bit errors, and so isolates higher layers from experiencing bit errors. The procedure works as follows for a flow of data from system A to system B:
1.As mentioned, each 80-bit flit includes an 8-bit CRC field. The CRC is a function of the value of the remaining 72 bits. On transmission. A calculates a CRC value for each flit and inserts that value into the flit.
2.When a flit is received. B calculates a CRC value for the 72-bit payload and compares this value with the value of the incoming CRC value in the flit. If the two CRC values do not match, an error has been detected.
3.When B detects an error. it sends a request to A to retransmit the flit that is in error. However. because A may have had sufficient credit to send a stream of flits, so that additional flits have been transmitted after the flit in error and before A receives the request to retransmit. Therefore. the request is for A to back up and retransmit the damaged flit plus ail subsequent flits.
QPI Routing Layer
The routing layer is used to determine the course that a packet will traverse across the available system interconnects. Routing tables arc defined by firmware and describe the possible paths that a packet can follow. In small configurations, such as a two-sockct platform, the routing options arc limited and the routing tables quite simple. For larger systems, the routing table options arc more complex. giving the flexibility of routing and rerouting traffic depending on how (1) devices arc populated in the platform. (2) system resources arc partitioned, and (3) reliability events result in mapping around a failing resource.
QPI Protocol Layer
In this layer, the packet is defined as the unit of transfer.The packet contents ddintion is standardized with some tlcxibilily allowed to meet differing market segment requirements. One key function performed at this level is a cache coherency protocol, which deals with making suiu that main memory values held in multiple caches arc consistent. A typical data packet payload is a block of data being sent to or from a cache.

3.6 PCI EXPRESS

The peripheral component interconnect (PCI) is a popular high-bandwidth, processor-independent bus that can function as a mezzanine or peripheral bus. Compared with other common bus specifications. PCI delivers better system performance for high-speed I/O subsystems (e.g., graphic display adapters, network ir.tcr- facc controllers, and disk controllers).
Intel began work on PCI in 1990 for its Pentium-based systems. Intel soon released all the patents to the public domain and promoted the creation of an industry association, the PCI Special Intercst Group (SIG). to develop further and maintain the compatibility of the PCI specifications. The result is that PCI has been widely adopted and is finding increasing use in personal computer. workstation.and server systems. Because the specification is in the public domain and is supported by a broad cross-section of the microprocessor and peripheral industry. PCI products built by different vendors arc compatible.
As with the system bus discussed in the preceding sections, the bus-based PCI scheme has not been able to keep pace with the data rate demands of attached devices. Accordingly, a new version, known as PCI Express (PCIe) has been developed. PCIc, as with OPI. is a point-to-point interconnect scheme intended to replace bus-based schemes such as PCI.
A key requirement for PCIe is high capacity to support the needs of higher data rate I/O devices, such as Gigabit Ethernet. Another requirement deals with the need to support time-dependent data streams. Applications such as vidco-on- demand and audio redistribution arc putting real-time constraints on servers too. Many communications applications and embedded PC control systems also process data in real-time. Today's platforms must also deal with multiple concurrent transfers at cvcr-incrcasing data rates. It is no longer acceptable to treat ail data as equal—it is more important, for example. to process streaming data first since late real-time data is as useless as no data. Data needs to be tagged so (ha( an I/O system can prioritize its flow throughout the platform.
PCI Physical and Logical Architecture
Figure 321 shows a typical configuration that supports the use of PCIc. A root complex device, also referred to as a chipset or a host bridge, connects the processor and memory subsystem to the PCI Express switch fabric comprising one or more PCIc and PCIc switch devices. The root complex acts as a buffering device, to deal with difference in data rates between I/O controllers and memory and processor components.The root complex also translates between PCIc transaction formatsand the processor and memory signal and control rcquiivmcnts/nic chipset will typically support multiple PCIc ports, some of which attach directly to a PCIc device, andonc or more that attach to a switch that manages multiple PCIc streams. PCIc links from the chipset may attach to the following kinds of devices that implement PCIc:

  • Switch: The switch manages multiple PCIc streams.
  • PCIe endpoint: An I/O device or controller that implements PCIc. such as a Gigabit ethemet switch, a graphics or video controller. disk interface, or a communications controller.

    带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

  • Legacy endpoint: Legacy endpoint category is intended for existing designs that have been migrated to PCI Express, and i( allows legacy behaviors such as use of I/O space and locked transactions. PCI Express endpoints arc not permitted to require the use of I/O space at runtime and must not use locked transactions. By distinguishing these categories. i( is possible for a system designer to restrict or eliminate legacy behaviors (hat have negative impacts on system performance and robustness.
  • PCIe/PCI bridge: Allows older PCI devices to be connected to PCIc-based systems.

As with OPI. PCIc interactions arc defined using a protocol architecture. The PCIc protocol architecture encompasses the following layers (Figure 322):

  • Physical: Consists of the actual wires carrying the signals, as well as circuitry and logic to support ancillary features required in the transmission and rcceipl of the 1s and 0s.
  • Data link: Is responsible for reliable transmission and flow control. Data packets generated and consumed by the DLL arc called Data Link Layer Packets (DLLPs).
  • Transaction: Generates and consumes data packets used to implement load/ store data transfer mechanisms and also manages the flow control of those packets between the two components on a link. Data packets generated and consumed by the TL are called Transaction Layer Packets (TLPs).

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

Above the TL arc software layers that generate read and write requests that arc transported by the transaction layer to the I/O devices using a packet-based transaction protocol.
PCIe Physical Layer
Similar to OPI. PCIc is a poinl-to-point architecture. Each PCIc port consists of a number of bidirectional lanes (note that in OPI. the lane refers to transfer in one direction only).Transfer in each direction in a lane is by means of differential signaling over a pair of wires A PCI port can provide 1.4.6,16. or 32 lanes In what follows, we refer to the PCIc 3.0 specification, introduced in late 2010.
As with OPI. PCIc uses a multilane distribution technique. Figure 3.23 shows an example for a PCIc port consisting of four lanes. Data arc distributed to the four lanes 1 byte at a time using a simple round-robin scheme. At each physical lane, data arc buffered and processed 16 bytes (128 bits) at a time. Each block of 128 bits is encoded into a unique 130*bil codeword for transmission; this is referred to as 128b/130b encoding. Thus, the effective data rate of an individual lane is reduced by a factor of 128/130.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

To understand the rationale for the 128b/130b encoding, note that unlike QPI. PCIc docs not use its clock line to synchronize the bit stream. That is. the clock line is not used to determine the start and end point of each incoming bit; it is used for other signaling purposes only. However. it is necessary for the receiver to be synchronized with the transmitter. so that the receiver knows when each bit begins and ends. If there is any drift between the clocks used for bit transmission and reception of the transmitter and receiver. errors may occur. To compensate for the possibility of drift. PCIc relies on the receiver synchronizing with the transmitter based on the transmitted signal. As with OPI. PCIc uses differential signaling over a pair of wires. Synchronization can be achieved by the receiver looking for transitions in the data and synchronizing its clock to the transition. However. consider that with a long string of Is or Os using differential signaling, the output is a constant voltage over a long period of time. Under these circumstances. any drift between the clocks of transmitter and receiver will result in loss of synchronization between the two.
A common approach, and the one used in PCIc 3.0. to overcoming the problem of a long string of bits of one value is scrambling. Scrambling, which docs not increase the number of bits to be transmitted, is a mapping technique that tends to make the data appear more random. The scrambling tends to spread out the number of transitions so that they appear at the receiver more uniformly spaced, which is good for synchronization. Also. other transmission properties, such as spectral properties, arc enhanced if the data arc more nearly of a random nature rather than constant or repetitive. For more discussion of scrambling, see Appendix E.
Another technique that can aid in synchronization is encoding, in which additional bits arc inserted into the bit stream to force transitions. For PCIc 3.0, each group of 128 bits of input is mapped into a 130-bit block by adding a 2-bit block sync header. The value of the header is 10 for a data block and 01 for what is called an ordered set block, which refers to a link-level information block.
Figure 3.24 illustrates the use of scrambling and encoding. Data to be transmitted arc fed into a scrambler. The scrambled output is then fed into a 128b/130b encoder, which buffers 128 bits and then maps the 128-bit block into a 130-bit block. This block then passes through a parailcl-to-scrial converter and transmitted one bit at a time using differential signaling.
At the receiver, a clock is synchronized to the incoming data to recover the bit stream. This then passes through a scrial-to-parallcl converter to produce a stream of 130-bil blocks. Each block is passed through a 128b/130b decoder to recover the original scrambled bit pattern, which is then dcscramblcd to produce the original bitstream.
Using these techniques, a data rate of 16 GB/s can be achieved. One final detail to mention; each transmission of a block of data over a PCI link begins and ends with an 8-bit framing sequence intended to give the receiver time to synchronize with the incoming physical layer bit stream.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

PCIe Transaction Layer
The transaction layer (TL) receives read and write requests from the software above the TL and creates request packets for transmission to a destination via the link layer. Most transactions use a split transaaion technique, which works in the following fashion. A request packet is sent out by a source PCIe device, which then waits for a response, called a completion packet. "Die completion following a request is initialed by the completer only when it has the data and/or status ready for delivery. Each packet has a unique identifier that enables completion packets to be directed to the correct originator. With the split transaction technique, the completion is separated in time from the request, in contrast to a typical bus operation in which both sides of a transaction must be available to seize and use the bus. Between the request and the completion, other PCIe traffic may use the link.
TL messages and some write transactions arc posted transactions, meaning that no response is expected.
The TL packet format supports 32-bit memory addressing and extended 64-bit memory addressing. Packets also have attributes such as “no-snoop,",“rclaxcdordcring," and "priority," which may be used to optimally route these packets through the I/O subsystem.
ADDRESS SPACES AND TRANSACTION TYPES The TL supports four ad dress spaces:

  • Memory: The memory space includes system main memory. It also includes PCIc I/O devices. Certain ranges of memory addresses map into I/O devices.
  • I/O: This address space is used for legacy PCI devices, with reserved memory address ranges used to address legacy I/O devices.
  • Configuration: This address space enables the TL to rcad/writc configuration registers associated with I/O devices.
  • Message: This address space is for control signals related to interrupts, error handling, and power management.

Table 32 shows the transaction types provided by the TL. For memory. I/O. and configuration address spaces, there arc read and write transaclions. In the case of memory transactions, there is also a read lock request function. Locked operations occur as a result of device drivers requesting atomic access to registers on a PCIc device. A device driver. for example, can atomically read, modify, and then write to a device register. To accomplish this, the device driver causes the processor to execute an instruction or set of instructions. The root complex converts these processor instructions into a sequence of PCIc transactions, which perform individual read and write requests for the device driver. If these transactions must be executed atomically, the root complex locks the PCIc link while executing the trans actons. This locking prevents transactions that arc not part of the sequence from occurring. This sequence of transactions is called a locked operation. "Die particular set of processor instructions that can cause a locked operation to occur depends on the system chip set and processor architecture.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

To maintain compatibility with PCI. PCIc supports both Type 0 and Type 1 configuration cydcs. A Type 1 cycle propagates downstream until it reaches the bridge interface hosting the bus (link) that the target device resides on. The configuration transaction is converted on the destination link from Type 1 to Type 0 by the bridge.
Finally, completion messages arc used with split transactions for memory. I/O. and configuration transactions.
TLP PACKET ASSEMBLY PCIc transactions arc conveyed using transaction layer packets, which arc illustrated in Figure 325a. A TLP originates in the transaction layer of the sending device and terminates at the transaction layer of the receiving device.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

Upper layer software sends to the TL the information needed for the TL to create the core of the TLP. which consists of the following fields:

  • Header: Tie header describes the type of packet and includes information needed by the receiver to process the packet, including any needed routing information. Hie internal header format is discussed subsequently.
  • Data: A data fickl of up lo 4096 bytes may be included in the TLP. Some TLPs do not contain a data field.
  • ECRC: An optional end-to-end CRC fickl enables the destination TL layer to check for errors in the header and data portions of the TLP.

PCIe Data Link Layer
The purpose of the PCIe data link layer is to ensure reliable delivery of packets across the PCIe link.The DLL participates in the formation ofTLPs and also transmits DLLPs.
DATA LINK LAYER PACKETS Data link layer packets originate at the data link layer of a transmitting device and terminate at the DLL of the device on the other end of the link. Figure 325b shows the format of a DLLP. There arc three important groups of DLLPs used in managing a link: flow control packets. power management packets, and TLP ACK and NAK packets. Power management packets are in managing pnw(*r platform budgeting Flew control packets regulate the rate at which TLPs and DLLPs can be transmitted across a link. The ACK and NAK packets arc used in TLP processing, discussed in the following paragraphs.
TRANSACTION LAYER PACKET PROCESSING The DLL adds two fields to the core of the TLP created by the TL (Figure 3.25a): a 16-bit sequence number and a 32-bit link-layer CRC (LCRC). Whereas the core fields created at the TL arc only used at the destination TL. the two fickis added by the DLL arc processed at each intermediate node on the way from source to destination.
When a TLP arrives at a device, the DLL strips off the sequence number and LCRC fields and checks the LCRC. There arc two possibilities:
1.If no errors arc detected, the core portion of the TLP is handed up to the local transaction layer. If this receiving device is the intended destination, then the TL processes the TLP. Otherwise. theTL determines a route for the TLP and passes i( back down to the DLL for transmission over the next link on the way to the destination.
2.If an error is detected, the DLL schedules an NAK DLL packet to return back to the remote transmitter. The TLP is eliminated.
When the DLL transmits a TLP. it retains a copy of the TLP. If it receives an NAK for the TLP with this sequence number, it retransmits the TLP. When it receives an ACK. it discards the buffered TLP.

3.7 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS

Key Terms

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

Review Questions
3.1 What general categories of functions are specified by computer instructions?
3.2 List and briefly define the possible states that define an instruction execution.
3.3 List and briefly define two approaches to dealing with multiple interrupts.
3.4 What types of transfers must a computer's interconnection structure (eg- bus) support?
3.5 List and briefly define the OPI protocol layers.
3.6 List and briefly define the PCle protocol layers.
Problems
3.1 The hypothetical machine of Figure 3.4 also has two I/O instructions:

0011 = Load AC from I/O
0111 = Store AC to I/O

In these cases, the 12-bit address identifies a particular I/O device. Show the program execution (using the format of Figure 3.5) for the following program:
  1.Load AC from device 5.
  2.Add contents of memory location 940.
  3.Store AC to device 6.
Assume that the next value retrieved from device 5 is3 and that location 940 contains a value of 2.
3.2 The program execution of Figure 3.5 is described in the text using six steps. Expand this description to show the use of the MAR and MBR.
3.3 Consider a hypothetical 32-bit microprocessor having 32-bit instructions composed of two fields: the first byte contains the opcode and the remainder the immediate operand or an operand address.
  a.What is the maximum directly addressable memory capacity (in bytes)?
  b.Discuss the impact on the system speed if the microprocessor bus has:
    1.32-bit local address bus and a 16-bit local data bus, or
    2.16-bit local address bus and a 16-bit local data bus.
  c.How many bits are needed for the program counter and the instruction register?
3.4 Consider a hypothetical microprocessor generating a 16-bit address (for example, assume that the program counter and the address registers are 16 bits wide) and having a 16-bit data bus.
  a.What is the maximum memory address space that the processor can access directly if it is connected to a M16-bit memory”?
  b.What is the maximum memory address space that the processor can access directly if it is connected to an “8-bit memory”?
  c.What architectural features will allow this microprocessor to access a separate T/O space'?
  d.If an input and an output instruction can specify an 8-bit I/O port number, how many8-bit I/O ports can the microprocessor support? How many 16-bit I/O ports?Explain.
3.5 Consider a 32-bit microprocessor, with a 16-bit external data bus. driven by an 8-MHz input clock. Assume that this microprocessor has a bus cycle whose minimum duration equals four input clock cycles. What is the maximum data transfer rate across the bus that this microprocessor can sustain in bytes/sec? To increase its performance, would it be better to make its external data bus 32 bits or to double the external clock frequency supplied to the microprocessor? State any other assumptions you make. and explain. Hint: Determine the number of bytes that can be transferred per bus cycle.
3.6 Consider a computer system that contains an I/O module controlling a simple key-board/printer teletype, lhe following registers are contained in the processor and connected directly to the system bus:
  INPR: Input Register.8 bits
  OUTR: Output Register.8 bits
  FGI: Input Flag. 1 bit
  FGO: Output Flag. 1 bit
  IEN: Interrupt Enable, 1 bit
Keystroke input from the teletype and printer output to the teletype are controlled by the I/O module lhe teletype is able to encode an alphanumeric symbol to an 土bit word and decode an 8-bit word into an alphanumeric symbol.
  a.Describe how the processor, using the first four registers listed in this problem can achieve I/O with the teletype.
  b.Describe how the function can be performed more efficiently by ako employing IEN.
3.7 Consider two microprocessors having 8- and 16-bit-wide external data buses, respectively. The two processors are identical othenvise and their bus cycles take just as long.
  a.Suppose ail instructions and operands are two bytes long. By what factor do the maximum data transfer rates differ?
  b.Repeat assuming that half of the operands and instructions are one byte long
3.8 Figure 3.26 indicates a distributed arbitration scheme that can be used with an obsolete bus scheme known as Multibus I. Agents are daisy-chained physically in priority order, lhe left-most agent in the diagram receives a constant bus priority in (BPRN) signal indicating that no higher-priority agent desires the bus. If the agent doe5 not require the bus. it asserts its bus priority oia (BPRO) line. At the beginning of a dock cycle, any agent can request control of the bus by lowering its BPRO line.This lowers the BPRN line of the next agent in the chain. which is in turn required to lower its BPRO line. Thus, the signal is propagated the length of the chain. At the end of this chain reaction, there should be only one agent whose BPRN is asserted and w!>ose BPRO is not. This agent has priority. IL at the beginning of a bus cycle, the bus k not busy (BUSY inactive), the agent that has priority may seize control of the bus by asserting the BUSY line.
It takes a certain amount of time for the BPR signal to propagate from the highest-priority agent to the lowest. Must this time be less than the clock cycle? Explain.

带你读《计算机组成与体系结构:性能设计(英文版·原书第10版)》之三:A Top-Level View of Computer Function and Interconnection

3.9 The VAX SBI bus uses a distributed, synchronous arbitration scheme. Each SBI device (ie. processor, memory I/O module) has a unique priority and is assigned a unique transter request (TK) line The SBI has 16 such lines (TRO.TRI,.. ./IRIS), with TRO having the highest priority. When a device wants to use the bus. it places a reservation for a future lime slot by asserting its TR line during the current time slot. At the end of the current time slot, each device with a pending reservation exairines theTR lines: the highest-priority device with a reservation uses the next time slot.
A maximum of 17 devices can be attached to the bus.The device with priority 16 has no TR line.Why not?
3.10 On the VAX SBI. the lowesJpriorily device usually has the lowest average wait time. For this reason, the processor is usually given the lowest priority on the SBI. Why does the priority 16 device usually have the lowest average wait lime? Under what circumstances would this not be true?
3.11 For a synchronous read operation (Figure 3.18), the memory module must placx the data on the bus sufficiently ahead of the falling edge of the Read signal to allow for signal settling. Assume a microprocessor bus is clocked at 10 MHz and that the Read signal begins to fail in the middle of the second half of Ty
  a.Determine the length of the memory read instruction cycle
  b.When, at the latest.should memory data be placed on the bus? Allow 20 ns for the settling of data lines.
3.12 Consider a microprocessor that has a memory read timing as shown in Figure 3.18. After some analysis, a designer determines that the memory falls short of providing read data on time by about 180 ns.
  a.How many wait states (dock cycles) need to be inserted for proper system operation if the bus clocking rate is 8 MHz?
  b.To enforce the wait states, a Ready status line is employed Once the processor has issued a Read command, it must wait until the Ready line is asserted before attempting to read data. At what time interval must we keep the Ready line low in order to force the processor to insert the required number of wait states?
3.13 A microprocessor has a memory write timing as shown in Figure 3.18. Its manufacturer specifies that the width of the Write signal can be determined by T—50. where T is the clock period in ns.
  a.What width should we expect for the Write signal if bus clocking rate is 5 MHz?
  b.The data sheet for the microprocessor specifies that the data remain valid for 20 ns after the failing edge of the Write signal. What is the total duration of valid data presentation to memory?
  c.How many wait states should we insert if memory requires valid data presentation for at leM 190 ns?
3.14 A microprocessor has an increment memory direct instruction, which adds 1 tc the value in a memory location. The instruction has five stages: fetch opcode (four bus clock cycles), fetch operand address (three cycles), fetch operand (three cycles), add 1 to operand (three cycles), and store operand (three cycles).
  a.By what amount (in percent) will the duration of the instruction increase if we have to insert two bus wait states in each memory read and memory write operation?
  b.Repeat assuming that the increment operation takes 13 cycles instead of 3 cycles.
3.15 The Intel 8088 microprocessor has a read bus timing similar to that of Figure 3.1&but requires four processor clock cycles.The valid data is on the bus for an amount of time that extends into the fourth processor clock cycle. Assume a processor clock rate of 8 MHz.
  a.What is the maximum data transfer rate?
  b.Repeat, but assume the need to insert one wait state per byte transferred
3.16 The Intel 8086 is a 16-bit processor similar in many ways to the 8-bit 8088. Ihe 8086 uses a 16-bit bus that can transfer 2 bytes at a time, provided that the lower-order byte has an even address. However, the 8086 allows both even- and odd-aiigned word operands. If an odd-aligned word is referenced,two memory cycles,each consisting of fuui bus cycles, arc required to transfer the wood. Consider an instraction that involves two 16-bit operands. How long does it take to fetch the operands? Give the range of passible answers. Assume a clocking rate of 4 MHz and no wait states.
3.17 Consider a 32-bit microprocessor whose bus cycle is the same duration as that of a 16-bit microprocessor. Assume that, on average, 20% of the operands and instructions are 32 bits long, 40% are 16 bits long, and 40% are only 8 bits long. Calculate the improvement achieved when fetching instructions and operands with the 32-bit microprocessor.
3.18 The microprocessor of Problem 3.14 initiates the fetch operand stage of the increment memory direct instruction at the same lime that a keyboard actives an interrupt request line. After how long does the processor enter the interrupt processing cycle? Assume a bus clocking rate of 10 MHz.

上一篇:FFMPEG学习记录


下一篇:SparkStream kafka direct