点击查看第二章
计算机体系结构:量化研究方法(英文版原书第6版)
Computer Architecture: A Quantitative Approach, Sixth Edition
[美] 约翰L. 亨尼斯(John L. Hennessy)
戴维A. 帕特森(David A. Patterson) 著
1 Fundamentals of Quantitative Design and Analysis
1.1 Introduction
Computer technology has made incredible progress in the roughly 70 ycais since lhe first general-purpose electronic computer was created. Today, less than S500 will purchase a cell phone that has as much performance as the world's fastest computer bought in 1993 for $50 million. Ulis rapid improveineni has come both from advances in the technology used to build computers and from innovations in computer design.
Although technological improvements historically have been fairly steady, progress arising from better computer architectures has been much less consistent. During lhe first 25 years of electronic compulcrs. both forces made a major con- iribuiion. delivering performance improvement of about 25% per year. Hie late 1970s saw the emergence of the microprocessor. The ability of the microprocessor to ride the improvements in integrated circuit technology led lo a higher rate of performance improveineni—roughly 35% growth per year.
This growth rate, combined with the cost advantages of a mass-produced microprocessor, led to an increasing fraclion of lhe computer business being based on microprocessors. In addition, two significant changes in the computer marketplace made it easier than ever before to succeed coninicrcially with a new archilecture. First lhe virtual elimination of assembly language programming reduced the need for object-code compatibility. Second, the creation of standardized, vendor-independent openiting systems, such as UNIX and its clone. Linux, lowered the cost and risk of bringing out a new architeciure.
These changes made it possible to develop successfully a new set of airhitectures with simpler instniciions. called RISC (Reduced Instruction Set Coinpuier) architectures, in the early 1980s. The RISC-based machines focused the af.ention of designers on two critical pcrformiincc techniques, the exploitation of instruction-level parallelism (initially through pipelining and later through n ulliple instruction issue) and the use of caches (initially in simple forms and later using more sophisticated organizations and optimizations).
The RISC-based computers raised the performance bar, forcing prior architectures to keep up or disappear. The Digital Equipment Vax could not. and sc it was replaced by a RISC archiieciure. Irnel rose to lhe challenge, primarily by translating 80x86 instructions into RISC-like instnictions internally, allowing it to adopt many of the innovations first pioneered in the RISC designs. As transistor counts soared in lhe laie 1990s, lhe hardware overhead of iranslaiing the more complex x86 architecture became negligible. In low-end applications, such as cell phones, lhe cosi in power and silicon area of the x86-translaiion overhead helped lead to a RISC architeciure, ARM, becoming dominant.
Figure 1.1 shows that the combination of architectural and organizational enhancements led to 17 years of sustained growth in performance at an annual rate of over 50% a rate that is unprecedented in the computer industry.
The effect of this dramatic growth rate during the 20th century was fourfold. First, it has significantly enhanced the capability available to computer users. For many applications, the highest-performance microprocessors outperformed the supercomputer of less than 20 years earlier.
Second,this dramatic improvement in cost-performance led to new classes of compuiers. Personal computers and workslations emerged in the 1980s with ihe availability of the microprocessor. The past decade saw the rise of smart cell phones and tablet computers, which many people arc using as their primary computing platfbnns instead of PCs. These mobile clienl devices are increasingly using the Internet to access warehouses containing 100.000 servers, which are being designed as if they were a single gigantic computer.
Third,improvement of semiconductor manufacturing as predicted by Moorers law has led to the dominance of inicroproccssor-bascd computers across the entire range of computer design. Minicomputers, which were tmdilionally made from off-the-shelf logic or from gate arrays, were replaced by servers made by using microprocessors. Even mainfnimc computers and high-performance supercomputers are all collections of microprocessors.
The preceding hardware innovations led to a renaissance in computer design, which emphasized both architectural innovation and cfiicient use of technology improvements. This rate of growth compounded so that by 2003, high- performance microprocessors were 7.5 times as fast as what would havs been obtained by relying solely on technology, including improved circuit design, lhai is, 52% per year versus 35% per year.
This hardware renaissance led lo lhe fourth impact, which was on scftwiirc development. This 50.000-fold performance improvement since 1973 (see Figure 1.1) allowed modern programmers to trade performance for productivity. In place of perfbrmance-oriented languages like C and C++, much more programming today is done in managed programming languages like Java and Scala. Moreover. scripting languages like JavaScript and Python, which arc even more productive, are gaining in popularity along with programming frameworks like AngularJS and Django. To maintain productivity and iry to close the perfoimance gap. interpreters with just-in-limc compilers and tracc-biiscd compiling arc replacing lhe traditional compiler and linker of (he past. Software deployment is changing as well, with Software as a Service (SaaS) used over the Internet replacing shrink-wrapped software lhat must be installed and run on a local computer.
The nature of applications is also changing. Speech, sound, images, and video arc becoming increasingly important, along with predictable response time that is so critical to the user experience. An inspiring example is Google Translate. This application lets you hold up your cell phone to point its camera at an object, and the image is sent wirelessly over the Internet to a warchousc-scalc computer iWSC) that recognizes lhe text in lhe photo and translates it into your native language. You can also speak into it. and it will translate what you said into audio output in another language. It translates text in 90 languages and voice in 15 languages.
Alas, Figure 1.1 also shows that this 17-year hardware renaissance is over. The fundamental reason is that two characteristics of semiconductor processes that were ture for decades no longer hold.
In 1974 Robert Dennard observed that power density was constant for a given area of silicon even as you increased the number of transistors because of smaller dimensions of each transistor. Remarkably, transistors could go faster but use less power. Demuird scaling ended around 2004 because current and voltage couldn't keep dropping and still maintain lhe dependability of integrated circuits.
This change forced the microprocessor industry to use multiple efficient processors or cores instead of a single inefficient processor. Indeed, in 2004 Intel canceled its high-perfonnance uniprocessor projects and joined others in declaring that the road to higher performance would be via multiple processors per chip ndher than via faster uniprocessors. This milesionc signaled a historic sw ich from relying solely on instruction-level parallelism (ILP), the primary focus on the first three editions of this book, to data-level parallelism (DLP) and thread-level par-allelism (TLP), which were teatured in lhe tburih edition and expanded iri the fifth edition. The fifth edition also added WSCs and request-level {Kirallelism (RLP), which is expanded in this edition. Whereas the compiler and hardware conspire to exploit ILP implicitly wiihoui the programmer's atlention, DLP, TLP. and RLP are explicitly parallel, requiring the restructuring of the application so that it can exploit explicit parallelism. In some instances, this is easy; in many, it is a major new burden for programmers.
Amdahl's Law (Section i.9) prescribes practical limits to the number of useful cores per chip. If 10% of the task is serial, then lhe rnaxinniin performance benefit from parallelism is 10 no matter how many cores you put on the chip.
The second observation that ended recently is Moore's Law. In 1965 Gordon Moore famously predicted that the number of transistors per chip would double every year, which was amended in 1975 to every two years. That prediction lasted for about 50 years, but no longer holds. For example, in lhe 2010 edition of this book, the most recent Intel microprocessor had 1,170,000,000 transistors. If Moore's Law had continued, we could have expected microprocessors in 2016 io have 18,720,000,000 transistors. Insiead, lhe equivalent Iniel microprocessor has just 1,750,000,000 transistors, or off by a factor of 10 from what Moore's Law would have predicted.
The combinaiion of
- transisiors no longer getting much better because of lhe slowing of Moore's Law and the end of Dinnard scaling.
- the unchanging power budgets for microprocessors.
- the replacement of the single power-hungry processor with several energy-efficient processors, and
- the limits to multiprocessing to achieve Amdahl's Law
caused improvements in processor performance to slow down, that is. to double every 20 years, rather than every 1.5 years as it did between 1986 and 2003 (see Figure 1.1).
The only path left to improve energy-performance-cost is specializatiai. Future microprocessors will include several domain-specific cores that perform only one class of computations well, but they do so remarkably better than general-purpose cores. The new Chapter 7 in this edition introduces domain-specific architectures.
This text is about the architectural ideas and accompanying compiler improYemenis that made the incredible growth rale possible over lhe past century, ihe reasons for the dramatic change, and the challenges and initial promising approaches to architectural ideas, compilers, and interpreters for the 21 st century. Al the core is a quantilaiive approach to computer design and analysis that uses empirical observations of programs, experimentation, and simulation as its tools. It is this style and approach io compulcr design that is rcilected in this text. The purpose of this chapter is to lay the quantitative foundation on which the following chapters and appendices are based.
This book was written noi only io explain this design style but also to stimulate you to contribute to this progress. We believe this approach will serve the computers □f the future just as it worked for the implicitly parallel computers of the past.
1.2 Classes of Computers
These changes have set the stage for a dramatic change in how we view commuting, computing applications, and the computer markets in this new centuiy. Net since the creation of the personal computer have wc seen such striking changes in the way computers appearand in how they are used. These changes in computer use hive led to five diverse computing markets, each characterized by different applications, rrcininunents, nncl mmpnling technologies. Figure 1.2 snmmarizes these mainstream classes of computing environments and their important characteristics.
Internet of Things/Embedded Computers
Embedded computers are found in everyday machines: microwaves, washing machines, most printers, networking switches, and all automobiles. The phrase Internet of Things (loT) refers to embedded computers that arc connected to the Internet, typically wirelessly. When augmented with sensors and actuators, loT devices collect useful data and interact with the physical world, leading to a wide variety of "smart” applications, such as smiirt watches, smart thermostats, smart speakers, smart cars, smart homes, smart grids, and smart cities.
Embedded computers have the widest spread of processing power and cost. They include 8-bil to 32-bit processors that may cost one penny, and high-end 64-bit processors for cars and network switches that cost $100. Although he range of computing power in the embedded computing market is very large, price is a key factor in the design of computers Ibrthis space. Peribnnance requiremenisdo exist, of course, but the primary goal often meets the performance need at a minimum price, rather than achieving more performance at a higher price. The pnycctions for the number of loT devices in 2020 range from 20 to 50 billion.
Most of this book applies to the design, use, and performance of embedded processors, whether they are off-the-shelf microprocessors or microprocessor cores that will be assembled with other special-purpose hardware.
Unfortunately, the data that drive the quantitative design and evaluation of other classes of computers have not yet been extended successfully to embedded computing (see the challenges with EEMBC, for example, in Section 1.8). Hence we are left for now with qualitative descriptions, which do not fit well with the rest of the book. As a result, the embedded material is concentrated in Appendix E. We believe a separate appendix improves the flow of ideas in the text while allowing readers to see how the differing requirements affect embedded computing.
Personal Mobile Device
Personal mobile device (PMD) is the term we apply to a collection of wireless devices with multimedia user interfaces such as cell phones, tablet computers, and so on. Cost is a prime concern given the consumer price for the whole product is a few hundred dollars. Although the emphasis on energy efficiency is frequently driven by the use of batteries, the need to use less expensive packag- ing—plastic versus ceramic—and the absence of a fan for cooling also limit total power consumption. We examine the issue of energy and power in more detail in Section 1.5. Applications on PMDs are often web-based and media-oriented, like the previously mentioned Google Translate example. Energy and size requirements lead to use of Flash memory for storage (Chapter 2) instead of magnetic disks.
The processors in a PMD are often considered embedded computers, but we are keeping them as a separate category because PMDs are platforms that can run externally developed software, and they share many of the characteristics of desktop computers. Other embedded devices are more limited in hardware and software sophistication. We use the ability to run third-party software as the divid- ing line between nonembedded and embedded computers.
Responsiveness and predictability are key characteristics for media applica- tions. A real-time performance requirement means a segment of the application has an absolute maximum execution time. For example, in playing a video on a PMD, the time to process each video frame is limited, since the processor must accept and process the next frame shortly. In some applications, a more nuanced requirement exists: the average time for a particular task is constrained as well as the number of instances when some maximum time is exceeded. Such approaches—sometimes called soft real-time—arise when it is possible to miss the time constraint on an event occasionally, as long as not too many are missed. Real-time performance tends to be highly application-dependent.
Other key characteristics in many PMD applications are the need to minimize memory and the need to use energy efficiently. Energy efficiency is driven by both battery power and heat dissipation. The memory can be a substantial portion of the system cost, and it is important to optimize memory size in such cases. The impor- tance of memory size translates to an emphasis on code size, since data size is dic- tated by the application.
Desktop Computing
The first, and possibly still the largest market in dollar terms, is desktop computing. Desktop computing spans from low-end netbooks that sell for under $300 to high- end, heavily configured workstations that may sell for $2500. Since 2008, more than half of the desktop computers made each year have been battery operated lap- top computers. Desktop computing sales are declining.
Throughout this range in price and capability, the desktop market tends to be driven to optimize price-performance. This combination of performance (measured primarily in terms of compute performance and graphics perfor- mance) and price of a system is what matters most to customers in this market, and hence to computer designers. As a result, the newest, highest-performance microprocessors and cost-reduced microprocessors often appear first in desktop systems (see Section 1.6 for a discussion of the issues affecting the cost of computers).
Desktop computing also tends to be reasonably well characterized in terms of applications and benchmarking, though the increasing use of web-centric, interac- tive applications poses new challenges in performance evaluation.
Servers
As the shift to desktop computing occurred in the 1980s, the role of servers grew to provide larger-scale and more reliable file and computing services. Such servers have become the backbone of large-scale enterprise computing, replacing the tra- ditional mainframe.
For servers, different characteristics are important. First, availability is critical. (We discuss availability in Section 1.7.) Consider the servers running ATM machines for banks or airline reservation systems. Failure of such server systems is far more catastrophic than failure of a single desktop, since these servers must operate seven days a week, 24 hours a day. Figure 1.3 estimates revenue costs of downtime for server applications.
A second key feature of server systems is scalability. Server systems often grow in response to an increasing demand for the services they support or an expansion in functional requirements. Thus the ability to scale up the computing capacity, the memory, the storage, and the I/O bandwidth of a server is crucial.
Finally, servers are designed for efficient throughput. That is, the overall per- formance of the server—in terms of transactions per minute or web pages served per second—is what is crucial. Responsiveness to an individual request remains important, but overall efficiency and cost-effectiveness, as determined by how many requests can be handled in a unit time, are the key metrics for most servers. We return to the issue of assessing performance for different types of computing environments in Section 1.8.
Clusters/Warehouse-Scale Computers
The growth of Software as a Service (SaaS) for applications like search, social net- working, video viewing and sharing, multiplayer games, online shopping, and so on has led to the growth of a class of computers called clusters. Clusters are col- lections of desktop computers or servers connected by local area networks to act as a single larger computer. Each node runs its own operating system, and nodes com- municate using a networking protocol. WSCs are the largest of the clusters, in that they are designed so that tens of thousands of servers can act as one. Chapter 6 describes this class of extremely large computers.
Price-performance and power are critical to WSCs since they are so large. As Chapter 6 explains, the majority of the cost of a warehouse is associated with power and cooling of the computers inside the warehouse. The annual amortized computers themselves and the networking gear cost for a WSC is $40 million, because they are usually replaced every few years. When you are buying that much computing, you need to buy wisely, because a 10% improvement in price- performance means an annual savings of $4 million (10% of $40 million) per WSC; a company like Amazon might have 100 WSCs!
WSCs are related to servers in that availability is critical. For example, Ama- zon.com had $136 billion in sales in 2016. As there are about 8800 hours in a year, the average revenue per hour was about $15 million. During a peak hour for Christ- mas shopping, the potential loss would be many times higher. As Chapter 6 explains, the difference between WSCs and servers is that WSCs use redundant, inexpensive components as the building blocks, relying on a software layer to catch and isolate the many failures that will happen with computing at this scale to deliver the availability needed for such applications. Note that scalability for a WSC is handled by the local area network connecting the computers and not by integrated computer hardware, as in the case of servers.
Supercomputers are related to WSCs in that they are equally expensive, costing hundreds of millions of dollars, but supercomputers differ by emphasi- zing floating-point performance and by running large, communication-intensive batch programs that can run for weeks at a time. In contrast, WSCs emphasize interactive applications, large-scale storage, dependability, and high Internet bandwidth.
Classes of Parallelism and Parallel Architectures
Parallelism at multiple levels is now the driving force of computer design across all four classes of computers, with energy and cost being the primary constraints. There are basically two kinds of parallelism in applications:
1.Data-level parallelism (DLP) arises because there are many data items that can be operated on at the same time.
2.Task-level parallelism (TLP) arises because tasks of work are created that can operate independently and largely in parallel.
Computer hardware in turn can exploit these two kinds of application parallelism in four major ways:
1.Instruction-level parallelism exploits data-level parallelism at modest levels with compiler help using ideas like pipelining and at medium levels using ideas like speculative execution.
2.Vector architectures, graphic processor units (GPUs), and multimedia instruc- tion sets exploit data-level parallelism by applying a single instruction to a col- lection of data in parallel.
3.Thread-level parallelism exploits either data-level parallelism or task-level par- allelism in a tightly coupled hardware model that allows for interaction between parallel threads.
4.Request-level parallelism exploits parallelism among largely decoupled tasks specified by the programmer or the operating system.
When Flynn (1966) studied the parallel computing efforts in the 1960s, he found a simple classification whose abbreviations we still use today. They target data-level parallelism and task-level parallelism. He looked at the parallelism in the instruction and data streams called for by the instructions at the most constrained component of the multiprocessor and placed all computers in one of four categories:
1.Single instruction stream, single data stream (SISD)—This category is the uni- processor. The programmer thinks of it as the standard sequential computer, but it can exploit ILP. Chapter 3 covers SISD architectures that use ILP techniques such as superscalar and speculative execution.
2.Single instruction stream, multiple data streams (SIMD)—The same instruc- tion is executed by multiple processors using different data streams. SIMD com- puters exploit data-level parallelism by applying the same operations to multiple items of data in parallel. Each processor has its own data memory (hence, the MD of SIMD), but there is a single instruction memory and control processor, which fetches and dispatches instructions. Chapter 4 covers DLP and three different architectures that exploit it: vector architectures, multimedia extensions to standard instruction sets, and GPUs.
3.Multiple instruction streams, single data stream (MISD)—No commercial mul- tiprocessor of this type has been built to date, but it rounds out this simple classification.
4.Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches its own instructions and operates on its own data, and it targets task-level parallelism. In general, MIMD is more flexible than SIMD and thus more gen- erally applicable, but it is inherently more expensive than SIMD. For example, MIMD computers can also exploit data-level parallelism, although the overhead is likely to be higher than would be seen in an SIMD computer. This overhead means that grain size must be sufficiently large to exploit the parallelism effi- ciently. Chapter 5 covers tightly coupled MIMD architectures, which exploit thread-level parallelism because multiple cooperating threads operate in paral- lel. Chapter 6 covers loosely coupled MIMD architectures—specifically, clus- ters and warehouse-scale computers—that exploit request-level parallelism, where many independent tasks can proceed in parallel naturally with little need for communication or synchronization.
This taxonomy is a coarse model, as many parallel processors are hybrids of the SISD, SIMD, and MIMD classes. Nonetheless, it is useful to put a framework on the design space for the computers we will see in this book.
1.3 Defining Computer Architecture
The task the computer designer faces is a complex one: determine what attributes are important for a new computer, then design a computer to maximize performance and energy efficiency while staying within cost, power, and availabil- ity constraints. This task has many aspects, including instruction set design, func- tional organization, logic design, and implementation. The implementation may encompass integrated circuit design, packaging, power, and cooling. Optimizing the design requires familiarity with a very wide range of technologies, from com- pilers and operating systems to logic design and packaging.
A few decades ago, the term computer architecture generally referred to only instruction set design. Other aspects of computer design were called implementa- tion, often insinuating that implementation is uninteresting or less challenging.
We believe this view is incorrect. The architect’s or designer’s job is much more than instruction set design, and the technical hurdles in the other aspects of the project are likely more challenging than those encountered in instruction set design. We’ll quickly review instruction set architecture before describing the larger challenges for the computer architect.
Instruction Set Architecture: The Myopic View of Computer Architecture
We use the term instruction set architecture (ISA) to refer to the actual programmer-visible instruction set in this book. The ISA serves as the boundary between the software and hardware. This quick review of ISA will use examples from 80x86, ARMv8, and RISC-V to illustrate the seven dimensions of an ISA. The most popular RISC processors come from ARM (Advanced RISC Machine), which were in 14.8 billion chips shipped in 2015, or roughly 50 times as many chips that shipped with 80x86 processors. Appendices A and K give more details on the three ISAs.
RISC-V (“RISC Five”) is a modern RISC instruction set developed at the University of California, Berkeley, which was made free and openly adoptable in response to requests from industry. In addition to a full software stack (com- pilers, operating systems, and simulators), there are several RISC-V implementa- tions freely available for use in custom chips or in field-programmable gate arrays. Developed 30 years after the first RISC instruction sets, RISC-V inherits its ances- tors’ good ideas—a large set of registers, easy-to-pipeline instructions, and a lean set of operations—while avoiding their omissions or mistakes. It is a free and open, elegant example of the RISC architectures mentioned earlier, which is why more than 60 companies have joined the RISC-V foundation, including AMD, Google, HP Enterprise, IBM, Microsoft, Nvidia, Qualcomm, Samsung, and Western Digital. We use the integer core ISA of RISC-V as the example ISA in this book.
1.Class of ISA—Nearly all ISAs today are classified as general-purpose register architectures, where the operands are either registers or memory locations. The 80x86 has 16 general-purpose registers and 16 that can hold floating-point data, while RISC-V has 32 general-purpose and 32 floating-point registers (see Figure 1.4). The two popular versions of this class are register-memory ISAs,such as the 80x86, which can access memory as part of many instructions, and load-store ISAs, such as ARMv8 and RISC-V, which can access memory only with load or store instructions. All ISAs announced since 1985 are load-store.
2.Memory addressing—Virtually all desktop and server computers, including the 80x86, ARMv8, and RISC-V, use byte addressing to access memory operands. Some architectures, like ARMv8, require that objects must be aligned. An access to an object of size s bytes at byte address A is aligned if A mod s=0. (See Figure A.5 on page A-8.) The 80x86 and RISC-V do not require alignment, but accesses are generally faster if operands are aligned.
3.Addressing modes—In addition to specifying registers and constant operands, addressing modes specify the address of a memory object. RISC-V addressing modes are Register, Immediate (for constants), and Displacement, where a con- stant offset is added to a register to form the memory address. The 80x86 supports those three modes, plus three variations of displacement: no register (absolute), two registers (based indexed with displacement), and two registers where one register is multiplied by the size of the operand in bytes (based with scaled index and displacement). It has more like the last three modes, minus the displacement field, plus register indirect, indexed, and based with scaled index. ARMv8 has the three RISC-V addressing modes plus PC-relative addressing, the sum of two registers, and the sum of two registers where one register is multiplied by the size of the operand in bytes. It also has autoincrement and autodecrement addressing, where the calculated address replaces the contents of one of the registers used in forming the address.
4.Types and sizes of operands—Like most ISAs, 80x86, ARMv8, and RISC-V support operand sizes of 8-bit (ASCII character), 16-bit (Unicode character or half word), 32-bit (integer or word), 64-bit (double word or long integer), and IEEE 754 floating point in 32-bit (single precision) and 64-bit (double precision). The 80x86 also supports 80-bit floating point (extended double precision).
5.Operations—The general categories of operations are data transfer, arithmetic logical, control (discussed next), and floating point. RISC-V is a simple and easy-to-pipeline instruction set architecture, and it is representative of the RISC architectures being used in 2017. Figure 1.5 summarizes the integer RISC-V ISA, and Figure 1.6 lists the floating-point ISA. The 80x86 has a much richer and larger set of operations (see Appendix K).
6.Control flow instructions—Virtually all ISAs, including these three, support conditional branches, unconditional jumps, procedure calls, and returns. All three use PC-relative addressing, where the branch address is specified by an address field that is added to the PC. There are some small differences. RISC-V conditional branches (BE, BNE, etc.) test the contents of registers, and the 80x86 and ARMv8 branches test condition code bits set as side effects of arithmetic/logic operations. The ARMv8 and RISC-V procedure call places the return address in a register, whereas the 80x86 call (CALLF) places the return address on a stack in memory.
7.Encoding an ISA—There are two basic choices on encoding: fixed length and variable length. All ARMv8 and RISC-V instructions are 32 bits long, which simplifies instruction decoding. Figure 1.7 shows the RISC-V instruction for- mats. The 80x86 encoding is variable length, ranging from 1 to 18 bytes. Variable-length instructions can take less space than fixed-length instructions, so a program compiled for the 80x86 is usually smaller than the same program compiled for RISC-V. Note that choices mentioned previously will affect how the instructions are encoded into a binary representation. For example, the num- ber of registers and the number of addressing modes both have a significant impact on the size of instructions, because the register field and addressing mode field can appear many times in a single instruction. (Note that ARMv8 and RISC-V later offered extensions, called Thumb-2 and RV64IC, that provide a mix of 16-bit and 32-bit length instructions, respectively, to reduce program size. Code size for these compact versions of RISC architectures are smaller than that of the 80x86. See Appendix K.)
The other challenges facing the computer architect beyond ISA design are par- ticularly acute at the present, when the differences among instruction sets are small and when there are distinct application areas. Therefore, starting with the fourth edition of this book, beyond this quick review, the bulk of the instruction set mate- rial is found in the appendices (see Appendices A and K).
Genuine Computer Architecture: Designing the Organization and Hardware to Meet Goals and Functional Requirements
The implementation of a computer has two components: organization and hard- ware. The term organization includes the high-level aspects of a computer’s design, such as the memory system, the memory interconnect, and the design of the internal processor or CPU (central processing unit—where arithmetic, logic, branching, and data transfer are implemented). The term microarchitecture is also used instead of organization. For example, two processors with the same instruc- tion set architectures but different organizations are the AMD Opteron and the Intel Core i7. Both processors implement the 80 x 86 instruction set, but they have very different pipeline and cache organizations.
The switch to multiple processors per microprocessor led to the term core also being used for processors. Instead of saying multiprocessor microprocessor, the term multicore caught on. Given that virtually all chips have multiple processors, the term central processing unit, or CPU, is fading in popularity.
Hardware refers to the specifics of a computer, including the detailed logic design and the packaging technology of the computer. Often a line of computers contains computers with identical instruction set architectures and very similar organizations, but they differ in the detailed hardware implementation. For exam- ple, the Intel Core i7 (see Chapter 3) and the Intel Xeon E7 (see Chapter 5) are nearly identical but offer different clock rates and different memory systems, mak- ing the Xeon E7 more effective for server computers.
In this book, the word architecture covers all three aspects of computer design—instruction set architecture, organization or microarchitecture, and hardware.
Computer architects must design a computer to meet functional requirements as well as price, power, performance, and availability goals. Figure 1.8 summarizes requirements to consider in designing a new computer. Often, architects also must determine what the functional requirements are, which can be a major task. The requirements may be specific features inspired by the market. Application software typically drives the choice of certain functional requirements by determining how the computer will be used. If a large body of software exists for a particular instruc- tion set architecture, the architect may decide that a new computer should imple- ment an existing instruction set. The presence of a large market for a particular class of applications might encourage the designers to incorporate requirements that would make the computer competitive in that market. Later chapters examine many of these requirements and features in depth.
Architects must also be aware of important trends in both the technology and the use of computers because such trends affect not only the future cost but also the longevity of an architecture.
1.4 Trends in Technology
If an instruction set architecture is to prevail, it must be designed to survive rapid changes in computer technology. After all, a successful new instruction set architecture may last decades—for example, the core of the IBM mainframe has been in use for more than 50 years. An architect must plan for technology changes that can increase the lifetime of a successful computer.
To plan for the evolution of a computer, the designer must be aware of rapid changes in implementation technology. Five implementation technologies, which change at a dramatic pace, are critical to modern implementations:
- Integrated circuit logic technology—Historically, transistor density increased by about 35% per year, quadrupling somewhat over four years. Increases in die size are less predictable and slower, ranging from 10% to 20% per year. The combined effect was a traditional growth rate in transistor count on a chip of about 40%–55% per year, or doubling every 18–24 months. This trend is popularly known as Moore’s Law. Device speed scales more slowly, as we discuss below. Shockingly, Moore’s Law is no more. The number of devices per chip is still increasing, but at a decelerating rate. Unlike in the Moore’s Law era, we expect the doubling time to be stretched with each new technol- ogy generation.
- Semiconductor DRAM (dynamic random-access memory)—This technology is the foundation of main memory, and we discuss it in Chapter 2. The growth of DRAM has slowed dramatically, from quadrupling every three years as in the past. The 8-gigabit DRAM was shipping in 2014, but the 16-gigabit DRAM won’t reach that state until 2019, and it looks like there will be no 32-gigabit DRAM (Kim, 2005). Chapter 2 mentions several other technologies that may replace DRAM when it hits its capacity wall.
- Semiconductor Flash (electrically erasable programmable read-only mem- ory)—This nonvolatile semiconductor memory is the standard storage device in PMDs, and its rapidly increasing popularity has fueled its rapid growth rate in capacity. In recent years, the capacity per Flash chip increased by about 50%–60% per year, doubling roughly every 2 years. Currently, Flash memory is 8–10 times cheaper per bit than DRAM. Chapter 2 describes Flash memory.
- Magnetic disk technology—Prior to 1990, density increased by about 30% per year, doubling in three years. It rose to 60% per year thereafter, and increased to 100% per year in 1996. Between 2004 and 2011, it dropped back to about 40% per year, or doubled every two years. Recently, disk improvement has slowed to less than 5% per year. One way to increase disk capacity is to add more plat- ters at the same areal density, but there are already seven platters within the one-inch depth of the 3.5-inch form factor disks. There is room for at most one or two more platters. The last hope for real density increase is to use a small laser on each disk read-write head to heat a 30 nm spot to 400°C so that it can be written magnetically before it cools. It is unclear whether Heat Assisted Magnetic Recording can be manufactured economically and reliably, although Seagate announced plans to ship HAMR in limited production in 2018. HAMR is the last chance for continued improvement in areal density of hard disk drives, which are now 8–10 times cheaper per bit than Flash and 200–300 times cheaper per bit than DRAM. This technology is central to server- and warehouse-scale storage, and we discuss the trends in detail in Appendix D.
- Network technology—Network performance depends both on the performance of switches and on the performance of the transmission system. We discuss the trends in networking in Appendix F.
These rapidly changing technologies shape the design of a computer that, with speed and technology enhancements, may have a lifetime of 3–5 years. Key tech- nologies such as Flash change sufficiently that the designer must plan for these changes. Indeed, designers often design for the next technology, knowing that, when a product begins shipping in volume, the following technology may be the most cost-effective or may have performance advantages. Traditionally, cost has decreased at about the rate at which density increases.
Although technology improves continuously, the impact of these increases can be in discrete leaps, as a threshold that allows a new capability is reached. For example, when MOS technology reached a point in the early 1980s where between 25,000 and 50,000 transistors could fit on a single chip, it became possible to build a single-chip, 32-bit microprocessor. By the late 1980s, first-level caches could go on a chip. By eliminating chip crossings within the processor and between the pro- cessor and the cache, a dramatic improvement in cost-performance and energy- performance was possible. This design was simply unfeasible until the technology reached a certain point. With multicore microprocessors and increasing numbers of cores each generation, even server computers are increasingly headed toward a sin- gle chip for all processors. Such technology thresholds are not rare and have a sig- nificant impact on a wide variety of design decisions.
Performance Trends: Bandwidth Over Latency
As we shall see in Section 1.8, bandwidth or throughput is the total amount of work done in a given time, such as megabytes per second for a disk transfer. In contrast, latency or response time is the time between the start and the completion of an event, such as milliseconds for a disk access. Figure 1.9 plots the relative improve- ment in bandwidth and latency for technology milestones for microprocessors, memory, networks, and disks. Figure 1.10 describes the examples and milestones in more detail.
Performance is the primary differentiator for microprocessors and networks, so they have seen the greatest gains: 32,000–40,000 in bandwidth and 50–90 in latency. Capacity is generally more important than performance for memory and disks, so capacity has improved more, yet bandwidth advances of 400–2400 are still much greater than gains in latency of 8–9.
Clearly, bandwidth hasoutpaced latencyacrossthese technologies andwilllikely continue to do so. A simple rule of thumb is that bandwidth grows by at least the square of the improvement in latency. Computer designers should plan accordingly.
Scaling of Transistor Performance and Wires
Integrated circuit processes are characterized by the feature size, which is the min- imum size of a transistor or a wire in either the x or y dimension. Feature sizes decreased from 10 μm in 1971 to 0.016 μm in 2017; in fact, we have switched units, so production in 2017 is referred to as “16 nm,” and 7 nm chips are under- way. Since the transistor count per square millimeter of silicon is determined by the surface area of a transistor, the density of transistors increases quadratically with a linear decrease in feature size.
The increase in transistor performance, however, is more complex. As feature sizes shrink, devices shrink quadratically in the horizontal dimension and also shrink in the vertical dimension. The shrink in the vertical dimension requires a reduction in operating voltage to maintain correct operation and reliability of the transistors. This combination of scaling factors leads to a complex interrelationship between transistor performance and process feature size. To a first approximation, in the past the transistor performance improved linearly with decreasing feature size. The fact that transistor count improves quadratically with a linear increase in tran- sistor performance is both the challenge and the opportunity for which computer architects were created! In the early days of microprocessors, the higher rate of improvement in density was used to move quickly from 4-bit, to 8-bit, to 16-bit, to 32-bit, to 64-bit microprocessors. More recently, density improvements have sup- ported the introduction of multiple processors per chip, wider SIMD units, and many
of the innovations in speculative execution and caches found in Chapters 2–5.
Although transistors generally improve in performance with decreased feature size, wires in an integrated circuit do not. In particular, the signal delay for a wire increases in proportion to the product of its resistance and capacitance. Of course, as feature size shrinks, wires get shorter, but the resistance and capacitance per unit length get worse. This relationship is complex, since both resistance and capaci- tance depend on detailed aspects of the process, the geometry of a wire, the loading on a wire, and even the adjacency to other structures. There are occasional process enhancements, such as the introduction of copper, which provide one-time improvements in wire delay.
In general, however, wire delay scales poorly compared to transistor perfor- mance, creating additional challenges for the designer. In addition to the power dissipation limit, wire delay has become a major design obstacle for large inte- grated circuits and is often more critical than transistor switching delay. Larger and larger fractions of the clock cycle have been consumed by the propagation delay of signals on wires, but power now plays an even greater role than wire delay.
1.5 Trends in Power and Energy in Integrated Circuits
Today, energy is the biggest challenge facing the computer designer for nearly every class of computer. First, power must be brought in and distributed around the chip, and modern microprocessors use hundreds of pins and multiple intercon- nect layers just for power and ground. Second, power is dissipated as heat and must be removed.
Power and Energy: A Systems Perspective
How should a system architect or a user think about performance, power, and energy? From the viewpoint of a system designer, there are three primary concerns. First, what is the maximum power a processor ever requires? Meeting this demand can be important to ensuring correct operation. For example, if a processor attempts to draw more power than a power-supply system can provide (by drawing more current than the system can supply), the result is typically a voltage drop, which can cause devices to malfunction. Modern processors can vary widely in power consumption with high peak currents; hence they provide voltage indexing methods that allow the processor to slow down and regulate voltage within a wider margin. Obviously, doing so decreases performance.
Second, what is the sustained power consumption? This metric is widely called the thermal design power (TDP) because it determines the cooling requirement. TDP is neither peak power, which is often 1.5 times higher, nor is it the actual aver- age power that will be consumed during a given computation, which is likely to be lower still. A typical power supply for a system is typically sized to exceed the TDP, and a cooling system is usually designed to match or exceed TDP. Failure to provide adequate cooling will allow the junction temperature in the processor to exceed its maximum value, resulting in device failure and possibly permanent damage. Modern processors provide two features to assist in managing heat, since the highest power (and hence heat and temperature rise) can exceed the long-term average specified by the TDP. First, as the thermal temperature approaches the junction temperature limit, circuitry lowers the clock rate, thereby reducing power. Should this technique not be successful, a second thermal overload trap is activated to power down the chip.
The third factor that designers and users need to consider is energy and energy efficiency. Recall that power is simply energy per unit time: 1 watt 1 joule per second. Which metric is the right one for comparing processors: energy or power? In general, energy is always a better metric because it is tied to a specific task and the time required for that task. In particular, the energy to complete a workload is equal to the average power times the execution time for the workload.
Thus, if we want to know which of two processors is more efficient for a given task, we need to compare energy consumption (not power) for executing the task. For example, processor A may have a 20% higher average power consumption than processor B, but if A executes the task in only 70% of the time needed by B, its energy consumption will be 1.2 0.7 0.84, which is clearly better.
One might argue that in a large server or cloud, it is sufficient to consider the average power, since the workload is often assumed to be infinite, but this is mis- leading. If our cloud were populated with processor Bs rather than As, then the cloud would do less work for the same amount of energy expended. Using energy to compare the alternatives avoids this pitfall. Whenever we have a fixed workload, whether for a warehouse-size cloud or a smartphone, comparing energy will be the right way to compare computer alternatives, because the electricity bill for the cloud and the battery lifetime for the smartphone are both determined by the energy consumed.
When is power consumption a useful measure? The primary legitimate use is as a constraint: for example, an air-cooled chip might be limited to 100 W. It can be used as a metric if the workload is fixed, but then it’s just a variation of the true metric of energy per task.
Energy and Power Within a Microprocessor
For CMOS chips, the traditional primary energy consumption has been in switch- ing transistors, also called dynamic energy. The energy required per transistor is proportional to the product of the capacitive load driven by the transistor and the square of the voltage:
This equation is the energy of pulse of the logic transition of 0→1→0 or 1→0→1. The energy of a single transition (0→1 or 1→0) is then:
The power required per transistor is just the product of the energy of a transition multiplied by the frequency of transitions:
For a fixed task, slowing clock rate reduces power, but not energy.
Clearly, dynamic power and energy are greatly reduced by lowering the volt- age, so voltages have dropped from 5 V to just under 1 V in 20 years. The capac- itive load is a function of the number of transistors connected to an output and the technology, which determines the capacitance of the wires and the transistors.
As we move from one process to the next, the increase in the number of tran- sistors switching and the frequency with which they change dominate the decrease in load capacitance and voltage, leading to an overall growth in power consump- tion and energy. The first microprocessors consumed less than a watt, and the first 32-bit microprocessors (such as the Intel 80386) used about 2 W, whereas a 4.0 GHz Intel Core i7-6700K consumes 95 W. Given that this heat must be dissi- pated from a chip that is about 1.5 cm on a side, we are near the limit of what can be cooled by air, and this is where we have been stuck for nearly a decade.
Given the preceding equation, you would expect clock frequency growth to slow down if we can’t reduce voltage or increase power per chip. Figure 1.11 shows that this has indeed been the case since 2003, even for the microprocessors in Figure 1.1 that were the highest performers each year. Note that this period of flatter clock rates corresponds to the period of slow performance improvement range in Figure 1.1.
Distributing the power, removing the heat, and preventing hot spots have become increasingly difficult challenges. Energy is now the major constraint to using transistors; in the past, it was the raw silicon area. Therefore modern microprocessors offer many techniques to try to improve energy efficiency despite flat clock rates and constant supply voltages:
1.Do nothing well. Most microprocessors today turn off the clock of inactive modules to save energy and dynamic power. For example, if no floating-point instructions are executing, the clock of the floating-point unit is disabled. If some cores are idle, their clocks are stopped.
2.Dynamic voltage-frequency scaling (DVFS). The second technique comes directly from the preceding formulas. PMDs, laptops, and even servers have periods of low activity where there is no need to operate at the highest clock frequency and voltages. Modern microprocessors typically offer a few clock frequencies and voltages in which to operate that use lower power and energy. Figure 1.12 plots the potential power savings via DVFS for a server as the work- load shrinks for three different clock rates: 2.4, 1.8, and 1 GHz. The overall server power savings is about 10%–15% for each of the two steps.
3.Design for the typical case. Given that PMDs and laptops are often idle, mem- ory and storage offer low power modes to save energy. For example, DRAMs have a series of increasingly lower power modes to extend battery life in PMDs and laptops, and there have been proposals for disks that have a mode that spins more slowly when unused to save power. However, you cannot access DRAMs or disks in these modes, so you must return to fully active mode to read or write, no matter how low the access rate. As mentioned, microprocessors for PCs have been designed instead for heavy use at high operating temperatures, relying on on-chip temperature sensors to detect when activity should be reduced automat- ically to avoid overheating. This “emergency slowdown” allows manufacturers to design for a more typical case and then rely on this safety mechanism if some- one really does run programs that consume much more power than is typical.
- Overclocking. Intel started offering Turbo mode in 2008, where the chip decides that it is safe to run at a higher clock rate for a short time, possibly on just a few cores, until temperature starts to rise. For example, the 3.3 GHz Core i7 can run in short bursts for 3.6 GHz. Indeed, the highest-performing microprocessors each year since 2008 shown in Figure 1.1 have all offered temporary overclock- ing of about 10% over the nominal clock rate. For single-threaded code, these microprocessors can turn off all cores but one and run it faster. Note that, although the operating system can turn off Turbo mode, there is no notification once it is enabled, so the programmers may be surprised to see their programs vary in performance because of room temperature!
Although dynamic power is traditionally thought of as the primary source of power dissipation in CMOS, static power is becoming an important issue because leakage current flows even when a transistor is off:
That is, static power is proportional to the number of devices.
Thus increasing the number of transistors increases power even if they are idle, and current leakage increases in processors with smaller transistor sizes. As a result, very low-power systems are even turning off the power supply (power gat- ing) to inactive modules in order to control loss because of leakage. In 2011 the goal for leakage was 25% of the total power consumption, with leakage in high-performance designs sometimes far exceeding that goal. Leakage can be as high as 50% for such chips, in part because of the large SRAM caches that need power to maintain the storage values. (The S in SRAM is for static.) The only hope to stop leakage is to turn off power to the chips’ subsets.
Finally, because the processor is just a portion of the whole energy cost of a sys- tem, it can make sense to use a faster, less energy-efficient processor to allow the rest of the system to go into a sleep mode. This strategy is known as race-to-halt.
The importance of power and energy has increased the scrutiny on the effi- ciency of an innovation, so the primary evaluation now is tasks per joule or per- formance per watt, contrary to performance per mm2 of silicon as in the past. This new metric affects approaches to parallelism, as we will see in Chapters 4 and 5.
The Shift in Computer Architecture Because of Limits of Energy
As transistor improvement decelerates, computer architects must look elsewhere for improved energy efficiency. Indeed, given the energy budget, it is easy today to design a microprocessor with so many transistors that they cannot all be turned on at the same time. This phenomenon has been called dark silicon, in that much of a chip cannot be unused (“dark”) at any moment in time because of thermal con- straints. This observation has led architects to reexamine the fundamentals of pro- cessors’ design in the search for a greater energy-cost performance.
Figure 1.13, which lists the energy cost and area cost of the building blocks of a modern computer, reveals surprisingly large ratios. For example, a 32-bit floating-point addition uses 30 times as much energy as an 8-bit integer add. The area difference is even larger, by 60 times. However, the biggest difference is in memory; a 32-bit DRAM access takes 20,000 times as much energy as an 8-bit addition. A small SRAM is 125 times more energy-efficient than DRAM, which demonstrates the importance of careful uses of caches and memory buffers.
The new design principle of minimizing energy per task combined with the relative energy and area costs in Figure 1.13 have inspired a new direction for com- puter architecture, which we describe in Chapter 7. Domain-specific processors save energy by reducing wide floating-point operations and deploying special-pur- pose memories to reduce accesses to DRAM. They use those saving to provide 10–100 more (narrower) integer arithmetic units than a traditional processor. Although such processors perform only a limited set of tasks, they perform them remarkably faster and more energy efficiently than a general-purpose processor. Like a hospital with general practitioners and medical specialists, computers in this energy-aware world will likely be combinations of general-purpose cores that can perform any task and special-purpose cores that do a few things extremely well and even more cheaply.
1.7 Trends in Cost
Although costs tend to be less important in some computer designs—specifically supercomputers—cost-sensitive designs are of growing significance. Indeed, in the past 35 years, the use of technology improvements to lower cost, as well as increase performance, has been a major theme in the computer industry.
Textbooks often ignore the cost half of cost-performance because costs change, thereby dating books, and because the issues are subtle and differ across industry segments. Nevertheless, it’s essential for computer architects to have an under- standing of cost and its factors in order to make intelligent decisions about whether a new feature should be included in designs where cost is an issue. (Imagine archi- tects designing skyscrapers without any information on costs of steel beams and concrete!)
This section discusses the major factors that influence the cost of a computer and how these factors are changing over time.
The Impact of Time, Volume, and Commoditization
The cost of a manufactured computer component decreases over time even without significant improvements in the basic implementation technology. The underlying principle that drives costs down is the learning curve—manufacturing costs decrease over time. The learning curve itself is best measured by change in yield—the percentage of manufactured devices that survives the testing procedure. Whether it is a chip, a board, or a system, designs that have twice the yield will have half the cost.
Understanding how the learning curve improves yield is critical to projecting costs over a product’s life. One example is that the price per megabyte of DRAM has dropped over the long term. Since DRAMs tend to be priced in close relation- ship to cost—except for periods when there is a shortage or an oversupply—price and cost of DRAM track closely.
Microprocessor prices also drop over time, but because they are less standard- ized than DRAMs, the relationship between price and cost is more complex. In a period of significant competition, price tends to track cost closely, although micro- processor vendors probably rarely sell at a loss.
Volume is a second key factor in determining cost. Increasing volumes affect cost in several ways. First, they decrease the time needed to get through the learn- ing curve, which is partly proportional to the number of systems (or chips) man- ufactured. Second, volume decreases cost because it increases purchasing and manufacturing efficiency. As a rule of thumb, some designers have estimated that costs decrease about 10% for each doubling of volume. Moreover, volume decreases the amount of development costs that must be amortized by each com- puter, thus allowing cost and selling price to be closer and still make a profit.
Commodities are products that are sold by multiple vendors in large volumes and are essentially identical. Virtually all the products sold on the shelves of gro- cery stores are commodities, as are standard DRAMs, Flash memory, monitors, and keyboards. In the past 30 years, much of the personal computer industry has become a commodity business focused on building desktop and laptop com- puters running Microsoft Windows.
Because many vendors ship virtually identical products, the market is highly competitive. Of course, this competition decreases the gap between cost and selling price, but it also decreases cost. Reductions occur because a commodity market has both volume and a clear product definition, which allows multiple suppliers to compete in building components for the commodity product. As a result, the over- all product cost is lower because of the competition among the suppliers of the components and the volume efficiencies the suppliers can achieve. This rivalry has led to the low end of the computer business being able to achieve better price-performance than other sectors and has yielded greater growth at the low end, although with very limited profits (as is typical in any commodity business).
Cost of an Integrated Circuit
Why would a computer architecture book have a section on integrated circuit costs? In an increasingly competitive computer marketplace where standard parts—disks, Flash memory, DRAMs, and so on—are becoming a significant por- tion of any system’s cost, integrated circuit costs are becoming a greater portion of the cost that varies between computers, especially in the high-volume, cost- sensitive portion of the market. Indeed, with PMDs’ increasing reliance of whole systems on a chip (SOC), the cost of the integrated circuits is much of the cost of the PMD. Thus computer designers must understand the costs of chips in order to understand the costs of current computers.
Although the costs of integrated circuits have dropped exponentially, the basic process of silicon manufacture is unchanged: A wafer is still tested and chopped into dies that are packaged (see Figures 1.14–1.16). Therefore the cost of a pack- aged integrated circuit is
In this section, we focus on the cost of dies, summarizing the key issues in testing and packaging at the end.
Learning how to predict the number of good chips per wafer requires first learn- ing how many dies fit on a wafer and then learning how to predict the percentage of those that will work. From there it is simple to predict cost:
The most interesting feature of this initial term of the chip cost equation is its sensitivity to die size, shown below.
The number of dies per wafer is approximately the area of the wafer divided by the area of the die. It can be more accurately estimated by
The first term is the ratio of wafer area (πr2) to die area. The second compensates for the “square peg in a round hole” problem—rectangular dies near the periphery of round wafers. Dividing the circumference (πd) by the diagonal of a square die is approximately the number of dies along the edge.
However, this formula gives only the maximum number of dies per wafer. The critical question is: What is the fraction of good dies on a wafer, or the die yield?A simple model of integrated circuit yield, which assumes that defects are randomly distributed over the wafer and that yield is inversely proportional to the complexity of the fabrication process, leads to the following:
This Bose-Einstein formula is an empirical model developed by looking at the yield of many manufacturing lines (Sydow, 2006), and it still applies today. Wafer yield accounts for wafers that are completely bad and so need not be tested. For simplicity, we’ll just assume the wafer yield is 100%. Defects per unit area is a measure of the random manufacturing defects that occur. In 2017 the value was typically 0.08–0.10 defects per square inch for a 28-nm node and 0.10–0.30 for the newer 16 nm node because it depends on the maturity of the process (recall the learning curve mentioned earlier). The metric versions are 0.012–0.016 defects per square centimeter for 28 nm and 0.016–0.047 for 16 nm. Finally, N is a parameter called the process-complexity factor, a measure of manufacturing difficulty. For 28 nm processes in 2017, N is 7.5–9.5. For a 16 nm process,N ranges from 10 to 14.
Although many microprocessors fall between 1.00 and 2.25 cm2, low-end embedded 32-bit processors are sometimes as small as 0.05 cm2, processors used for embedded control (for inexpensive IoT devices) are often less than 0.01 cm2, and high-end server and GPU chips can be as large as 8 cm2.
Given the tremendous price pressures on commodity products such as DRAM and SRAM, designers have included redundancy as a way to raise yield. For a number of years, DRAMs have regularly included some redundant memory cells so that a certain number of flaws can be accommodated. Designers have used sim- ilar techniques in both standard SRAMs and in large SRAM arrays used for caches within microprocessors. GPUs have 4 redundant processors out of 84 for the same reason. Obviously, the presence of redundant entries can be used to boost the yield significantly.
In 2017 processing of a 300 mm (12-inch) diameter wafer in a 28-nm technol- ogy costs between $4000 and $5000, and a 16-nm wafer costs about $7000. Assuming a processed wafer cost of $7000, the cost of the 1.00 cm2 die would be around $16, but the cost per die of the 2.25 cm2 die would be about $58, or almost four times the cost of a die that is a little over twice as large.
What should a computer designer remember about chip costs? The manufactur- ing process dictates the wafer cost, wafer yield, and defects per unit area, so the sole control of the designer is die area. In practice, because the number of defects per unit area is small, the number of good dies per wafer, and therefore the cost per die, grows roughly as the square of the die area. The computer designer affects die size, and thus cost, both by what functions are included on or excluded from the die and by the number of I/O pins.
Before we have a part that is ready for use in a computer, the die must be tested (to separate the good dies from the bad), packaged, and tested again after packag- ing. These steps all add significant costs, increasing the total by half.
The preceding analysis focused on the variable costs of producing a functional die, which is appropriate for high-volume integrated circuits. There is, however, one very important part of the fixed costs that can significantly affect the cost of an integrated circuit for low volumes (less than 1 million parts), namely, the cost of a mask set. Each step in the integrated circuit process requires a separate mask. Therefore, for modern high-density fabrication processes with up to 10 metal layers, mask costs are about $4 million for 16 nm and $1.5 million for 28 nm.
The good news is that semiconductor companies offer “shuttle runs” to dramat- ically lower the costs of tiny test chips. They lower costs by putting many small designs onto a single die to amortize the mask costs, and then later split the dies into smaller pieces for each project. Thus TSMC delivers 80–100 untested dies that are 1.57 1.57 mm in a 28 nm process for $30,000 in 2017. Although these die are tiny, they offer the architect millions of transistors to play with. For example, sev- eral RISC-V processors would fit on such a die.
Although shuttle runs help with prototyping and debugging runs, they don’t address small-volume production of tens to hundreds of thousands of parts. Because mask costs are likely to continue to increase, some designers are incorpo- rating reconfigurable logic to enhance the flexibility of a part and thus reduce the cost implications of masks.
Cost Versus Price
With the commoditization of computers, the margin between the cost to manufac- ture a product and the price the product sells for has been shrinking. Those margins pay for a company’s research and development (R&D), marketing, sales, manufacturing equipment maintenance, building rental, cost of financing, pretax profits, and taxes. Many engineers are surprised to find that most companies spend only 4% (in the commodity PC business) to 12% (in the high-end server business) of their income on R&D, which includes all engineering.
Cost of Manufacturing Versus Cost of Operation
For the first four editions of this book, cost meant the cost to build a computer and price meant price to purchase a computer. With the advent of WSCs, which contain tens of thousands of servers, the cost to operate the computers is significant in addi- tion to the cost of purchase. Economists refer to these two costs as capital expenses (CAPEX) and operational expenses (OPEX).
As Chapter 6 shows, the amortized purchase price of servers and networks is about half of the monthly cost to operate a WSC, assuming a short lifetime of the IT equipment of 3–4 years. About 40% of the monthly operational costs are for power use and the amortized infrastructure to distribute power and to cool the IT equipment, despite this infrastructure being amortized over 10–15 years. Thus, to lower operational costs in a WSC, computer architects need to use energy efficiently.
1.7 Dependability
Historically, integrated circuits were one of the most reliable components of a com- puter. Although their pins may be vulnerable, and faults may occur over commu- nication channels, the failure rate inside the chip was very low. That conventional wisdom is changing as we head to feature sizes of 16 nm and smaller, because both transient faults and permanent faults are becoming more commonplace, so archi- tects must design systems to cope with these challenges. This section gives a quick overview of the issues in dependability, leaving the official definition of the terms and approaches to Section D.3 in Appendix D.
Computers are designed and constructed at different layers of abstraction. We can descend recursively down through a computer seeing components enlarge themselves to full subsystems until we run into individual transistors. Although some faults are widespread, like the loss of power, many can be limited to a single component in a module. Thus utter failure of a module at one level may be con- sidered merely a component error in a higher-level module. This distinction is helpful in trying to find ways to build dependable computers.
One difficult question is deciding when a system is operating properly. This theoretical point became concrete with the popularity of Internet services. Infra- structure providers started offering service level agreements (SLAs) or service level objectives (SLOs) to guarantee that their networking or power service would be dependable. For example, they would pay the customer a penalty if they did not meet an agreement of some hours per month. Thus an SLA could be used to decide whether the system was up or down.
Systems alternate between two states of service with respect to an SLA:
1.Service accomplishment, where the service is delivered as specified.
2.Service interruption, where the delivered service is different from the SLA.
Transitions between these two states are caused by failures (from state 1 to state 2) or restorations (2 to 1). Quantifying these transitions leads to the two main mea- sures of dependability:
- Module reliability is a measure of the continuous service accomplishment (or, equivalently, of the time to failure) from a reference initial instant. Therefore the mean time to failure (MTTF) is a reliability measure. The reciprocal of MTTF is a rate of failures, generally reported as failures per billion hours of operation, or FIT (for failures in time). Thus an MTTF of 1,000,000 hours equals 109/106 or 1000 FIT. Service interruption is measured as mean time to repair (MTTR). Mean time between failures (MTBF) is simply the sum of MTTF+ MTTR. Although MTBF is widely used, MTTF is often the more appropriate term. If a collection of modules has exponentially distributed lifetimes—meaning that the age of a module is not important in probability of failure—the overall failure rate of the collection is the sum of the failure rates of the modules.
- Module availability is a measure of the service accomplishment with respect to the alternation between the two states of accomplishment and interruption. For nonredundant systems with repair, module availability is
Note that reliability and availability are now quantifiable metrics, rather than syn- onyms for dependability. From these definitions, we can estimate reliability of a system quantitatively if we make some assumptions about the reliability of com- ponents and that failures are independent.
The primary way to cope with failure is redundancy, either in time (repeat the operation to see if it still is erroneous) or in resources (have other components to take over from the one that failed). Once the component is replaced and the system is fully repaired, the dependability of the system is assumed to be as good as new. Let’s quantify the benefits of redundancy with an example.
Having quantified the cost, power, and dependability of computer technology, we are ready to quantify performance.
1.8 Measuring, Reporting, and Summarizing Performance
When we say one computer is faster than another one is, what do we mean? The user of a cell phone may say a computer is faster when a program runs in less time, while an Amazon.com administrator may say a computer is faster when it com- pletes more transactions per hour. The cell phone user wants to reduce response time—the time between the start and the completion of an event—also referred to as execution time. The operator of a WSC wants to increase throughput—the total amount of work done in a given time.
In comparing design alternatives, we often want to relate the performance of two different computers, say, X and Y. The phrase “X is faster than Y” is used here to mean that the response time or execution time is lower on X than on Y for the given task. In particular, “X is n times as fast as Y” will mean
Since execution time is the reciprocal of performance, the following relationship holds:
The phrase “the throughput of X is 1.3 times as fast as Y” signifies here that the number of tasks completed per unit time on computer X is 1.3 times the number completed on Y.
Unfortunately, time is not always the metric quoted in comparing the perfor- mance of computers. Our position is that the only consistent and reliable measure of performance is the execution time of real programs, and that all proposed alter- natives to time as the metric or to real programs as the items measured have even- tually led to misleading claims or even mistakes in computer design.
Even execution time can be defined in different ways depending on what we count. The most straightforward definition of time is called wall-clock time, response time, or elapsed time, which is the latency to complete a task, including storage accesses, memory accesses, input/output activities, operating system over- head—everything. With multiprogramming, the processor works on another pro- gram while waiting for I/O and may not necessarily minimize the elapsed time of one program. Thus we need a term to consider this activity. CPU time recognizes this distinction and means the time the processor is computing, not including the time waiting for I/O or running other programs. (Clearly, the response time seen by the user is the elapsed time of the program, not the CPU time.)
Computer users who routinely run the same programs would be the perfect can- didates to evaluate a new computer. To evaluate a new system, these users would simply compare the execution time of their workloads—the mixture of programs and operating system commands that users run on a computer. Few are in this happy situation, however. Most must rely on other methods to evaluate computers, and often other evaluators, hoping that these methods will predict performance for their usage of the new computer. One approach is benchmark programs, which are programs that many companies use to establish the relative performance of their computers.
Benchmarks
The best choice of benchmarks to measure performance is real applications, such as Google Translate mentioned in Section 1.1. Attempts at running programs that are much simpler than a real application have led to performance pitfalls. Examples include
- Kernels, which are small, key pieces of real applications.
- Toy programs, which are 100-line programs from beginning programming assignments, such as Quicksort.
- Synthetic benchmarks, which are fake programs invented to try to match the profile and behavior of real applications, such as Dhrystone.
All three are discredited today, usually because the compiler writer and architect can conspire to make the computer appear faster on these stand-in programs than on real applications. Regrettably for your authors—who dropped the fallacy about using synthetic benchmarks to characterize performance in the fourth edition of this book since we thought all computer architects agreed it was disreputable— the synthetic program Dhrystone is still the most widely quoted benchmark for embedded processors in 2017!
Another issue is the conditions under which the benchmarks are run. One way to improve the performance of a benchmark has been with benchmark-specific compiler flags; these flags often caused transformations that would be illegal on many programs or would slow down performance on others. To restrict this pro- cess and increase the significance of the results, benchmark developers typically require the vendor to use one compiler and one set of flags for all the programs in the same language (such as C++ or C). In addition to the question of compiler flags, another question is whether source code modifications are allowed. There are three different approaches to addressing this question:
1.No source code modifications are allowed.
2.Source code modifications are allowed but are essentially impossible. For example, database benchmarks rely on standard database programs that are tens of millions of lines of code. The database companies are highly unlikely to make changes to enhance the performance for one particular computer.
3.Source modifications are allowed, as long as the altered version produces the same output.
The key issue that benchmark designers face in deciding to allow modification of the source is whether such modifications will reflect real practice and provide useful insight to users, or whether these changes simply reduce the accuracy of the bench- marks as predictors of real performance. As we will see in Chapter 7, domain- specific architects often follow the third option when creating processors for well-defined tasks.
To overcome the danger of placing too many eggs in one basket, collections of benchmark applications, called benchmark suites, are a popular measure of perfor- mance of processors with a variety of applications. Of course, such collections are only as good as the constituent individual benchmarks. Nonetheless, a key advan- tage of such suites is that the weakness of any one benchmark is lessened by the presence of the other benchmarks. The goal of a benchmark suite is that it will char- acterize the real relative performance of two computers, particularly for programs not in the suite that customers are likely to run.
A cautionary example is the Electronic Design News Embedded Microproces- sor Benchmark Consortium (or EEMBC, pronounced “embassy”) benchmarks.
It is a set of 41 kernels used to predict performance of different embedded applications: automotive/industrial, consumer, networking, office automation, and telecommunications. EEMBC reports unmodified performance and “full fury” performance, where almost anything goes. Because these benchmarks use small kernels, and because of the reporting options, EEMBC does not have the reputation of being a good predictor of relative performance of different embedded computers in the field. This lack of success is why Dhrystone, which EEMBC was trying to replace, is sadly still used.
One of the most successful attempts to create standardized benchmark appli- cation suites has been the SPEC (Standard Performance Evaluation Corporation), which had its roots in efforts in the late 1980s to deliver better benchmarks for workstations. Just as the computer industry has evolved over time, so has the need for different benchmark suites, and there are now SPEC benchmarks to cover many application classes. All the SPEC benchmark suites and their reported results are found at http://www.spec.org.
Although we focus our discussion on the SPEC benchmarks in many of the following sections, many benchmarks have also been developed for PCs running the Windows operating system.
Desktop Benchmarks
Desktop benchmarks divide into two broad classes: processor-intensive bench- marks and graphics-intensive benchmarks, although many graphics benchmarks include intensive processor activity. SPEC originally created a benchmark set focusing on processor performance (initially called SPEC89), which has evolved into its sixth generation: SPEC CPU2017, which follows SPEC2006, SPEC2000, SPEC95 SPEC92, and SPEC89. SPEC CPU2017 consists of a set of 10 integer benchmarks (CINT2017) and 17 floating-point benchmarks (CFP2017). Figure 1.17 describes the current SPEC CPU benchmarks and their ancestry.
SPEC benchmarks are real programs modified to be portable and to minimize the effect of I/O on performance. The integer benchmarks vary from part of a C compiler to a go program to a video compression. The floating-point benchmarks include molecular dynamics, ray tracing, and weather forecasting. The SPEC CPU suite is useful for processor benchmarking for both desktop systems and single-processor servers. We will see data on many of these programs throughout this book. However, these programs share little with modern programming lan- guages and environments and the Google Translate application that Section 1.1 describes. Nearly half of them are written at least partially in Fortran! They are even statically linked instead of being dynamically linked like most real pro- grams. Alas, the SPEC2017 applications themselves may be real, but they are not inspiring. It’s not clear that SPECINT2017 and SPECFP2017 capture what is exciting about computing in the 21st century.
In Section 1.11, we describe pitfalls that have occurred in developing the SPEC CPUbenchmark suite, as well as the challenges in maintaining a useful and pre- dictive benchmark suite.
SPEC CPU2017 is aimed at processor performance, but SPEC offers many other benchmarks. Figure 1.18 lists the 17 SPEC benchmarks that are active in 2017.
Server Benchmarks
Just as servers have multiple functions, so are there multiple types of benchmarks. The simplest benchmark is perhaps a processor throughput-oriented benchmark. SPEC CPU2017 uses the SPEC CPU benchmarks to construct a simple throughput benchmark where the processing rate of a multiprocessor can be measured by run- ning multiple copies (usually as many as there are processors) of each SPEC CPU benchmark and converting the CPU time into a rate. This leads to a measurement called the SPECrate, and it is a measure of request-level parallelism from Section 1.2. To measure thread-level parallelism, SPEC offers what they call high- performance computing benchmarks around OpenMP and MPI as well as for accelerators such as GPUs (see Figure 1.18).
Other than SPECrate, most server applications and benchmarks have signifi- cant I/O activity arising from either storage or network traffic, including bench- marks for file server systems, for web servers, and for database and transaction- processing systems. SPEC offers both a file server benchmark (SPECSFS) and a Java server benchmark. (Appendix D discusses some file and I/O system bench- marks in detail.) SPECvirt_Sc2013 evaluates end-to-end performance of virtua- lized data center servers. Another SPEC benchmark measures power, which we examine in Section 1.10.
Transaction-processing (TP) benchmarks measure the ability of a system to handle transactions that consist of database accesses and updates. Airline reserva- tion systems and bank ATM systems are typical simple examples of TP; more sophisticated TP systems involve complex databases and decision-making.
In the mid-1980s, a group of concerned engineers formed the vendor-independent Transaction Processing Council (TPC) to try to create realistic and fair benchmarks for TP. The TPC benchmarks are described at http://www.tpc.org .
The first TPC benchmark, TPC-A, was published in 1985 and has since been replaced and enhanced by several different benchmarks. TPC-C, initially created in 1992, simulates a complex query environment. TPC-H models ad hoc decision support—the queries are unrelated and knowledge of past queries cannot be used to optimize future queries. The TPC-DI benchmark, a new data integration (DI) task also known as ETL, is an important part of data warehousing. TPC-E is an online transaction processing (OLTP) workload that simulates a brokerage firm’s customer accounts.
Recognizing the controversy between traditional relational databases and “No SQL” storage solutions, TPCx-HS measures systems using the Hadoop file system running MapReduce programs, and TPC-DS measures a decision support system that uses either a relational database or a Hadoop-based system. TPC-VMS and TPCx-V measure database performance for virtualized systems, and TPC-Energy adds energy metrics to all the existing TPC benchmarks.
All the TPC benchmarks measure performance in transactions per second. In addition, they include a response time requirement so that throughput performance is measured only when the response time limit is met. To model real-world sys- tems, higher transaction rates are also associated with larger systems, in terms of both users and the database to which the transactions are applied. Finally, the system cost for a benchmark system must be included as well to allow accurate comparisons of cost-performance. TPC modified its pricing policy so that there is a single specification for all the TPC benchmarks and to allow verification of the prices that TPC publishes.
Reporting Performance Results
The guiding principle of reporting performance measurements should be repro- ducibility—list everything another experimenter would need to duplicate the results. A SPEC benchmark report requires an extensive description of the computer and the compiler flags, as well as the publication of both the baseline and the optimized results. In addition to hardware, software, and baseline tuning parameter descriptions, a SPEC report contains the actual performance times, shown both in tabular form and as a graph. A TPC benchmark report is even more complete, because it must include results of a benchmarking audit and cost information. These reports are excellent sources for finding the real costs of computing systems, since manufacturers compete on high performance and costperformance.
Summarizing Performance Results
In practical computer design, one must evaluate myriad design choices for their relative quantitative benefits across a suite of benchmarks believed to be relevant. Likewise, consumers trying to choose a computer will rely on performance mea- surements from benchmarks, which ideally are similar to the users’ applications. In both cases, it is useful to have measurements for a suite of benchmarks so that the performance of important applications is similar to that of one or more benchmarks in the suite and so that variability in performance can be understood. In the best case, the suite resembles a statistically valid sample of the application space, but such a sample requires more benchmarks than are typically found in most suites and requires a randomized sampling, which essentially no benchmark suite uses.
Once we have chosen to measure performance with a benchmark suite, we want to be able to summarize the performance results of the suite in a unique num- ber. A simple approach to computing a summary result would be to compare the arithmetic means of the execution times of the programs in the suite. An alternative would be to add a weighting factor to each benchmark and use the weighted arith- metic mean as the single number to summarize performance. One approach is to use weights that make all programs execute an equal time on some reference com- puter, but this biases the results toward the performance characteristics of the ref- erence computer.
Rather than pick weights, we could normalize execution times to a reference computer by dividing the time on the reference computer by the time on the computer being rated, yielding a ratio proportional to performance. SPEC uses this approach, calling the ratio the SPECRatio. It has a particularly useful property that matches the way we benchmark computer performance throughout this text—namely, comparing performance ratios. For example, suppose that the SPECRatio of computer A on a benchmark is 1.25 times as fast as computer B; then we know
Notice that the execution times on the reference computer drop out and the choice of the reference computer is irrelevant when the comparisons are made as a ratio, which is the approach we consistently use. Figure 1.19 gives an example.
Because a SPECRatio is a ratio rather than an absolute execution time, the mean must be computed using the geometric mean. (Because SPECRatios have no units, comparing SPECRatios arithmetically is meaningless.) The formula is
In the case of SPEC, samplei is the SPECRatio for program i. Using the geometric mean ensures two important properties:
1.The geometric mean of the ratios is the same as the ratio of the geometric means.
2.The ratio of the geometric means is equal to the geometric mean of the perfor- mance ratios, which implies that the choice of the reference computer is irrelevant.
Therefore the motivations to use the geometric mean are substantial, especially when we use performance ratios to make comparisons.
1.9 Quantitative Principles of Computer Design
Now that we have seen how to define, measure, and summarize performance, cost, dependability, energy, and power, we can explore guidelines and principles that are useful in the design and analysis of computers. This section introduces important observations about design, as well as two equations to evaluate alternatives.
Take Advantage of Parallelism
Using parallelism is one of the most important methods for improving perfor- mance. Every chapter in this book has an example of how performance is enhanced through the exploitation of parallelism. We give three brief examples here, which are expounded on in later chapters.
Our first example is the use of parallelism at the system level. To improve the throughput performance on a typical server benchmark, such as SPECSFS or TPC- C, multiple processors and multiple storage devices can be used. The workload of handling requests can then be spread among the processors and storage devices, resulting in improved throughput. Being able to expand memory and the number of processors and storage devices is called scalability, and it is a valuable asset for servers. Spreading of data across many storage devices for parallel reads and writes enables data-level parallelism. SPECSFS also relies on request-level parallelism to use many processors, whereas TPC-C uses thread-level parallelism for faster pro- cessing of database queries.
At the level of an individual processor, taking advantage of parallelism among instructions is critical to achieving high performance. One of the simplest ways to do this is through pipelining. (Pipelining is explained in more detail in Appendix C and is a major focus of Chapter 3.) The basic idea behind pipelining is to overlap instruction execution to reduce the total time to complete an instruction sequence. A key insight into pipelining is that not every instruction depends on its immediate predecessor, so executing the instructions completely or partially in parallel may be possible. Pipelining is the best-known example of ILP.
Parallelism can also be exploited at the level of detailed digital design. For example, set-associative caches use multiple banks of memory that are typically searched in parallel to find a desired item. Arithmetic-logical units use carry- lookahead, which uses parallelism to speed the process of computing sums from linear to logarithmic in the number of bits per operand. These are more examples of data-level parallelism.
Principle of Locality
Important fundamental observations have come from properties of programs. The most important program property that we regularly exploit is the principle of local- ity: programs tend to reuse data and instructions they have used recently. A widely held rule of thumb is that a program spends 90% of its execution time in only 10% of the code. An implication of locality is that we can predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past. The principle of locality also applies to data accesses, though not as strongly as to code accesses.
Two different types of locality have been observed. Temporal locality states that recently accessed items are likely to be accessed soon. Spatial locality says that items whose addresses are near one another tend to be referenced close together in time. We will see these principles applied in Chapter 2.
Focus on the Common Case
Perhaps the most important and pervasive principle of computer design is to focus on the common case: in making a design trade-off, favor the frequent case over the infrequent case. This principle applies when determining how to spend resources, because the impact of the improvement is higher if the occurrence is commonplace. Focusing on the common case works for energy as well as for resource allo- cation and performance. The instruction fetch and decode unit of a processor may be used much more frequently than a multiplier, so optimize it first. It works on dependability as well. If a database server has 50 storage devices for every pro-
cessor, storage dependability will dominate system dependability.
In addition, the common case is often simpler and can be done faster than the infrequent case. For example, when adding two numbers in the processor, we can expect overflow to be a rare circumstance and can therefore improve performance by optimizing the more common case of no overflow. This emphasis may slow down the case when overflow occurs, but if that is rare, then overall performance will be improved by optimizing for the normal case.
We will see many cases of this principle throughout this text. In applying this simple principle, we have to decide what the frequent case is and how much per- formance can be improved by making that case faster. A fundamental law, called Amdahl’s Law, can be used to quantify this principle.
Amdahl’s Law
The performance gain that can be obtained by improving some portion of a com- puter can be calculated using Amdahl’s Law. Amdahl’s Law states that the perfor- mance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
Amdahl’s Law defines the speedup that can be gained by using a particular feature. What is speedup? Suppose that we can make an enhancement to a com- puter that will improve performance when it is used. Speedup is the ratio
Alternatively,
Speedup tells us how much faster a task will run using the computer with the enhance- ment contrary to the original computer.
Amdahl’s Law gives us a quick way to find the speedup from some enhance- ment, which depends on two factors:
1.The fraction of the computation time in the original computer that can be con- verted to take advantage of the enhancement—For example, if 40 seconds of the execution time of a program that takes 100 seconds in total can use an enhancement, the fraction is 40/100. This value, which we call Fractionenhanced, is always less than or equal to 1.
2.The improvement gained by the enhanced execution mode, that is, how much faster the task would run if the enhanced mode were used for the entire pro- gram—This value is the time of the original mode over the time of the enhanced mode. If the enhanced mode takes, say, 4 seconds for a portion of the program, while it is 40 seconds in the original mode, the improvement is 40/4 or 10. We call this value, which is always greater than 1, Speedupenhanced.
The execution time using the original computer with the enhanced mode will be the time spent using the unenhanced portion of the computer plus the time spent using the enhancement:
The overall speedup is the ratio of the execution times:
Amdahl’s Law expresses the law of diminishing returns: The incremental improve- ment in speedup gained by an improvement of just a portion of the computation diminishes as improvements are added. An important corollary of Amdahl’s Law is that if an enhancement is usable only for a fraction of a task, then we can’t speed up the task by more than the reciprocal of 1 minus that fraction.
A common mistake in applying Amdahl’s Law is to confuse “fraction of time con- verted to use an enhancement” and “fraction of time after enhancement is in use.” If, instead of measuring the time that we could use the enhancement in a compu- tation, we measure the time after the enhancement is in use, the results will be incorrect!
Amdahl’s Law can serve as a guide to how much an enhancement will improve performance and how to distribute resources to improve cost-performance. The goal, clearly, is to spend resources proportional to where time is spent. Amdahl’s Law is particularly useful for comparing the overall system performance of two alternatives, but it can also be applied to compare two processor design alterna- tives, as the following example shows.
Amdahl’s Law is applicable beyond performance. Let’s redo the reliability example from page 39 after improving the reliability of the power supply via redundancy from 200,000-hour to 830,000,000-hour MTTF, or 4150 × better.
In the preceding examples, we needed the fraction consumed by the new and improved version; often it is difficult to measure these times directly. In the next section, we will see another way of doing such comparisons based on the use of an equation that decomposes the CPU execution time into three separate components. If we know how an alternative affects these three components, we can determine its overall performance. Furthermore, it is often possible to build simulators that measure these components before the hardware is actually designed.
The Processor Performance Equation
Essentially all computers are constructed using a clock running at a constant rate. These discrete time events are called clock periods, clocks, cycles, or clock cycles. Computer designers refer to the time of a clock period by its duration (e.g., 1 ns) or by its rate (e.g., 1 GHz). CPU time for a program can then be expressed two ways:
or
In addition to the number of clock cycles needed to execute a program, we can
also count the number of instructions executed—the instruction path length or instruction count (IC). If we know the number of clock cycles and the instruction count, we can calculate the average number of clock cycles per instruction (CPI). Because it is easier to work with, and because we will deal with simple processors in this chapter, we use CPI. Designers sometimes also use instructions per clock (IPC), which is the inverse of CPI.
CPI is computed as
This processor figure of merit provides insight into different styles of instruction sets and implementations, and we will use it extensively in the next four chapters.
By transposing the instruction count in the preceding formula, clock cycles can be defined as IC CPI. This allows us to use CPI in the execution time formula:
Expanding the first formula into the units of measurement shows how the pieces fit together:
As this formula demonstrates, processor performance is dependent upon three characteristics: clock cycle (or rate), clock cycles per instruction, and instruction count. Furthermore, CPU time is equally dependent on these three characteristics; for example, a 10% improvement in any one of them leads to a 10% improvement in CPU time.
Unfortunately, it is difficult to change one parameter in complete isolation from others because the basic technologies involved in changing each characteristic are interdependent:
- Clock cycle time—Hardware technology and organization
- CPI—Organization and instruction set architecture
- Instruction count—Instruction set architecture and compiler technology
Luckily, many potential performance improvement techniques primarily enhance one component of processor performance with small or predictable impacts on the other two.
In designing the processor, sometimes it is useful to calculate the number of total processor clock cycles as
where ICi represents the number of times instruction i is executed in a program and CPIi represents the average number of clocks per instruction for instruction i. This form can be used to express CPU time as
and overall CPI as
The latter form of the CPI calculation uses each individual CPIi and the fraction of occurrences of that instruction in a program (i.e., ICi Instruction count). Because it must include pipeline effects, cache misses, and any other memory system inefficiencies, CPIi should be measured and not just calculated from a table in the back of a reference manual.
Consider our performance example on page 52, here modified to use measure- ments of the frequency of the instructions and of the instruction CPI values, which, in practice, are obtained by simulation or by hardware instrumentation.
It is often possible to measure the constituent parts of the processor performance equation. Such isolated measurements are a key advantage of using the processor performance equation versus Amdahl’s Law in the previous example. In particular, it may be difficult to measure things such as the fraction of execution time for which a set of instructions is responsible. In practice, this would probably be computed by summing the product of the instruction count and the CPI for each of the instruc- tions in the set. Since the starting point is often individual instruction count and CPI measurements, the processor performance equation is incredibly useful.
To use the processor performance equation as a design tool, we need to be able to measure the various factors. For an existing processor, it is easy to obtain the exe- cution time by measurement, and we know the default clock speed. The challenge lies in discovering the instruction count or the CPI. Most processors include counters for both instructions executed and clock cycles. By periodically monitoring these counters, it is also possible to attach execution time and instruction count to seg- ments of the code, which can be helpful to programmers trying to understand and tune the performance of an application. Often designers or programmers will want to understand performance at a more fine-grained level than what is available from the hardware counters. For example, they may want to know why the CPI is what it is. In such cases, the simulation techniques used are like those for processors that are being designed.
Techniques that help with energy efficiency, such as dynamic voltage fre- quency scaling and overclocking (see Section 1.5), make this equation harder to use, because the clock speed may vary while we measure the program. A simple approach is to turn off those features to make the results reproducible. Fortunately, as performance and energy efficiency are often highly correlated—taking less time to run a program generally saves energy—it’s probably safe to consider perfor- mance without worrying about the impact of DVFS or overclocking on the results.
1.10 Putting It All Together: Performance, Price, and Power
In the “Putting It All Together” sections that appear near the end of every chapter, we provide real examples that use the principles in that chapter. In this section, we look at measures of performance and power-performance in small servers using the SPECpower benchmark.
Figure 1.20 shows the three multiprocessor servers we are evaluating along with their price. To keep the price comparison fair, all are Dell PowerEdge servers. The first is the PowerEdge R710, which is based on the Intel Xeon 85670 micro- processor with a clock rate of 2.93 GHz. Unlike the Intel Core i7-6700 in Chapters 2–5, which has 20 cores and a 40 MB L3 cache, this Intel chip has 22 cores and a 55 MB L3 cache, although the cores themselves are identical. We selected a two- socket system—so 44 cores total—with 128 GB of ECC-protected 2400 MHz DDR4 DRAM. The next server is the PowerEdge C630, with the same processor, number of sockets, and DRAM. The main difference is a smaller rack-mountable package: “2U” high (3.5 inches) for the 730 versus “1U” (1.75 inches) for the 630.
The third server is a cluster of 16 of the PowerEdge 630 s that is connected together with a 1 Gbit/s Ethernet switch. All are running the Oracle Java HotSpot version 1.7 Java Virtual Machine (JVM) and the Microsoft Windows Server 2012 R2 Datacenter version 6.3 operating system.
Note that because of the forces of benchmarking (see Section 1.11), these are unusually configured servers. The systems in Figure 1.20 have little memory rel- ative to the amount of computation, and just a tiny 120 GB solid-state disk. It is inexpensive to add cores if you don’t need to add commensurate increases in mem- ory and storage!
Rather than run statically linked C programs of SPEC CPU, SPECpower uses a more modern software stack written in Java. It is based on SPECjbb, and it repre- sents the server side of business applications, with performance measured as the number of transactions per second, called ssj_ops for server side Java operations per second. It exercises not only the processor of the server, as does SPEC CPU, but also the caches, memory system, and even the multiprocessor interconnection system. In addition, it exercises the JVM, including the JIT runtime compiler and garbage collector, as well as portions of the underlying operating system.
As the last two rows of Figure 1.20 show, the performance winner is the cluster of 16 R630s, which is hardly a surprise since it is by far the most expensive. The price-performance winner is the PowerEdge R630, but it barely beats the cluster at 213 versus 211 ssj-ops/$. Amazingly, the 16 node cluster is within 1% of the same price-performances of a single node despite being 16 times as large.
While most benchmarks (and most computer architects) care only about per- formance of systems at peak load, computers rarely run at peak load. Indeed, Figure 6.2 in Chapter 6 shows the results of measuring the utilization of tens of thousands of servers over 6 months at Google, and less than 1% operate at an aver- age utilization of 100%. The majority have an average utilization of between 10% and 50%. Thus the SPECpower benchmark captures power as the target workload varies from its peak in 10% intervals all the way to 0%, which is called Active Idle. Figure 1.21 plots the ssj_ops (SSJ operations/second) per watt and the average power as the target load varies from 100% to 0%. The Intel R730 always has the lowest power and the single node R630 has the best ssj_ops per watt across each target workload level. Since watts joules/second, this metric is proportional to SSJ operations per joule:
To calculate a single number to use to compare the power efficiency of systems, SPECpower uses
The overall ssj_ops/watt of the three servers is 10,802 for the R730, 11,157 for the R630, and 10,062 for the cluster of 16 R630s. Therefore the single node R630 has the best power-performance. Dividing by the price of the servers, the ssj_ops/watt/
$1,000 is 879 for the R730, 899 for the R630, and 789 (per node) for the 16-node cluster of R630s. Thus, after adding power, the single-node R630 is still in first place in performance/price, but now the single-node R730 is significantly more efficient than the 16-node cluster.
1.11 Fallacies and Pitfalls
The purpose of this section, which will be found in every chapter, is to explain some commonly held misbeliefs or misconceptions that you should avoid. We call such misbeliefs fallacies. When discussing a fallacy, we try to give a counterex- ample. We also discuss pitfalls—easily made mistakes. Often pitfalls are general- izations of principles that are true in a limited context. The purpose of these sections is to help you avoid making these errors in computers that you design.
Pitfall:All exponential laws must come to an end.
The first to go was Dennard scaling. Dennard’s 1974 observation was that power density was constant as transistors got smaller. If a transistor’s linear region shrank by a factor 2, then both the current and voltage were also reduced by a factor of 2, and so the power it used fell by 4. Thus chips could be designed to operate faster and still use less power. Dennard scaling ended 30 years after it was observed, not because transistors didn’t continue to get smaller but because integrated circuit dependability limited how far current and voltage could drop. The threshold voltage was driven so low that static power became a significant fraction of overall power. The next deceleration was hard disk drives. Although there was no law for disks, in the past 30 years the maximum areal density of hard drives—which deter- mines disk capacity—improved by 30%–100% per year. In more recent years, it has been less than 5% per year. Increasing density per drive has come primarily
from adding more platters to a hard disk drive.
Next up was the venerable Moore’s Law. It’s been a while since the number of transistors per chip doubled every one to two years. For example, the DRAM chip introduced in 2014 contained 8B transistors, and we won’t have a 16B transistor DRAM chip in mass production until 2019, but Moore’s Law predicts a 64B tran- sistor DRAM chip.
Moreover, the actual end of scaling of the planar logic transistor was even pre- dicted to end by 2021. Figure 1.22 shows the predictions of the physical gate length of the logic transistor from two editions of the International Technology Roadmap for Semiconductors (ITRS). Unlike the 2013 report that projected gate lengths to reach 5 nm by 2028, the 2015 report projects the length stopping at 10 nm by 2021. Density improvements thereafter would have to come from ways other than shrinking the dimensions of transistors. It’s not as dire as the ITRS suggests, as companies like Intel and TSMC have plans to shrink to 3 nm gate lengths, but the rate of change is decreasing.
Figure 1.23 shows the changes in increases in bandwidth over time for micro- processors and DRAM—which are affected by the end of Dennard scaling and Moore’s Law—as well as for disks. The slowing of technology improvements is apparent in the dropping curves. The continued networking improvement is due to advances in fiber optics and a planned change in pulse amplitude modu- lation (PAM-4) allowing two-bit encoding so as to transmit information at 400 Gbit/s.
Fallacy:Multiprocessors are a silver bullet.
The switch to multiple processors per chip around 2005 did not come from some breakthrough that dramatically simplified parallel programming or made it easy to build multicore computers. The change occurred because there was no other option due to the ILP walls and power walls. Multiple processors per chip do not guar- antee lower power; it’s certainly feasible to design a multicore chip that uses more power. The potential is just that it’s possible to continue to improve performance by replacing a high-clock-rate, inefficient core with several lower-clock-rate, effi- cient cores. As technology to shrink transistors improves, it can shrink both capac- itance and the supply voltage a bit so that we can get a modest increase in the number of cores per generation. For example, for the past few years, Intel has been adding two cores per generation in their higher-end chips.
As we will see in Chapters 4 and 5, performance is now a programmer’s bur- den. The programmers’ La-Z-Boy era of relying on a hardware designer to make their programs go faster without lifting a finger is officially over. If programmers want their programs to go faster with each generation, they must make their pro- grams more parallel.
The popular version of Moore’s law—increasing performance with each gen- eration of technology—is now up to programmers.
Pitfall:Falling prey to Amdahl’s heartbreaking law.
Virtually every practicing computer architect knows Amdahl’s Law. Despite this, we almost all occasionally expend tremendous effort optimizing some feature before we measure its usage. Only when the overall speedup is disappointing do we recall that we should have measured first before we spent so much effort enhancing it!
Pitfall:A single point of failure.
The calculations of reliability improvement using Amdahl’s Law on page 53 show that dependability is no stronger than the weakest link in a chain. No matter how much more dependable we make the power supplies, as we did in our example, the single fan will limit the reliability of the disk subsystem. This Amdahl’s Law observation led to a rule of thumb for fault-tolerant systems to make sure that every component was redundant so that no single component failure could bring down the whole system. Chapter 6 shows how a software layer avoids single points of failure inside WSCs.
Fallacy:Hardware enhancements that increase performance also improve energy efficiency, or are at worst energy neutral.
Esmaeilzadeh et al. (2011) measured SPEC2006 on just one core of a 2.67 GHz Intel Core i7 using Turbo mode (Section 1.5). Performance increased by a factor of 1.07 when the clock rate increased to 2.94 GHz (or a factor of 1.10), but the i7 used a factor of 1.37 more joules and a factor of 1.47 more watt hours!
Fallacy:Benchmarks remain valid indefinitely.
Several factors influence the usefulness of a benchmark as a predictor of real per- formance, and some change over time. A big factor influencing the usefulness of a benchmark is its ability to resist “benchmark engineering” or “benchmarketing.” Once a benchmark becomes standardized and popular, there is tremendous pres- sure to improve performance by targeted optimizations or by aggressive interpre- tation of the rules for running the benchmark. Short kernels or programs that spend their time in a small amount of code are particularly vulnerable.
For example, despite the best intentions, the initial SPEC89 benchmark suite included a small kernel, called matrix300, which consisted of eight different
300 300 matrix multiplications. In this kernel, 99% of the execution time was in a single line (see SPEC, 1989). When an IBM compiler optimized this inner loop (using a good idea called blocking, discussed in Chapters 2 and 4), performance improved by a factor of 9 over a prior version of the compiler! This benchmark tested compiler tuning and was not, of course, a good indication of overall perfor- mance, nor of the typical value of this particular optimization.
Figure 1.19 shows that if we ignore history, we may be forced to repeat it. SPEC Cint2006 had not been updated for a decade, giving compiler writers sub- stantial time to hone their optimizers to this suite. Note that the SPEC ratios of all benchmarks but libquantum fall within the range of 16–52 for the AMD computer and from 22 to 78 for Intel. Libquantum runs about 250 times faster on AMD and 7300 times faster on Intel! This “miracle” is a result of optimizations by the Intel compiler that automatically parallelizes the code across 22 cores and optimizes memory by using bit packing, which packs together multiple narrow-range inte- gers to save memory space and thus memory bandwidth. If we drop this benchmark and recalculate the geometric means, AMD SPEC Cint2006 falls from 31.9 to 26.5 and Intel from 63.7 to 41.4. The Intel computer is now about 1.5 times as fast as the AMD computer instead of 2.0 if we include libquantum, which is surely closer to their real relative performances. SPECCPU2017 dropped libquantum.
To illustrate the short lives of benchmarks, Figure 1.17 on page 43 lists the status of all 82 benchmarks from the various SPEC releases; Gcc is the lone sur- vivor from SPEC89. Amazingly, about 70% of all programs from SPEC2000 or earlier were dropped from the next release.
Fallacy:The rated mean time to failure of disks is 1,200,000 hours or almost 140 years, so disks practically never fail.
The current marketing practices of disk manufacturers can mislead users. How is such an MTTF calculated? Early in the process, manufacturers will put thousands of disks in a room, run them for a few months, and count the number that fail. They compute MTTF as the total number of hours that the disks worked cumulatively divided by the number that failed.
One problem is that this number far exceeds the lifetime of a disk, which is commonly assumed to be five years or 43,800 hours. For this large MTTF to make some sense, disk manufacturers argue that the model corresponds to a user who buys a disk and then keeps replacing the disk every 5 years—the planned lifetime of the disk. The claim is that if many customers (and their great-grandchildren) did this for the next century, on average they would replace a disk 27 times before a failure, or about 140 years.
A more useful measure is the percentage of disks that fail, which is called the annual failure rate. Assume 1000 disks with a 1,000,000-hour MTTF and that the disks are used 24 hours a day. If you replaced failed disks with a new one having the same reliability characteristics, the number that would fail in a year (8760 hours) is
Stated alternatively, 0.9% would fail per year, or 4.4% over a 5-year lifetime.
Moreover, those high numbers are quoted assuming limited ranges of temper- ature and vibration; if they are exceeded, then all bets are off. A survey of disk drives in real environments (Gray and van Ingen, 2005) found that 3%–7% of drives failed per year, for an MTTF of about 125,000–300,000 hours. An even larger study found annual disk failure rates of 2%–10% (Pinheiro et al., 2007). Therefore the real-world MTTF is about 2–10 times worse than the manufacturer’s MTTF.
Fallacy:Peak performance tracks observed performance.
The only universally true definition of peak performance is “the performance level a computer is guaranteed not to exceed.” Figure 1.24 shows the percentage of peak performance for four programs on four multiprocessors. It varies from 5% to 58%. Since the gap is so large and can vary significantly by benchmark, peak perfor- mance is not generally useful in predicting observed performance.
Pitfall:Fault detection can lower availability.
This apparently ironic pitfall is because computer hardware has a fair amount of state that may not always be critical to proper operation. For example, it is not fatal if an error occurs in a branch predictor, because only performance may suffer.
In processors that try to exploit ILP aggressively, not all the operations are needed for correct execution of the program. Mukherjee et al. (2003) found that less than 30% of the operations were potentially on the critical path for the SPEC2000 benchmarks.
The same observation is true about programs. If a register is “dead” in a pro- gram—that is, the program will write the register before it is read again—then errors do not matter. If you were to crash the program upon detection of a transient fault in a dead register, it would lower availability unnecessarily.
The Sun Microsystems Division of Oracle lived this pitfall in 2000 with an L2 cache that included parity, but not error correction, in its Sun E3000 to Sun E10000 systems. The SRAMs they used to build the caches had intermittent faults, which parity detected. If the data in the cache were not modified, the processor would simply reread the data from the cache. Because the designers did not protect the cache with ECC (error-correcting code), the operating system had no choice but to report an error to dirty data and crash the program. Field engineers found no problems on inspection in more than 90% of the cases.
To reduce the frequency of such errors, Sun modified the Solaris operating sys- tem to “scrub” the cache by having a process that proactively wrote dirty data to memory. Because the processor chips did not have enough pins to add ECC, the only hardware option for dirty data was to duplicate the external cache, using the copy without the parity error to correct the error.
The pitfall is in detecting faults without providing a mechanism to correct them. These engineers are unlikely to design another computer without ECC on external caches.
1.12 Concluding Remarks
This chapter has introduced a number of concepts and provided a quantitative framework that we will expand on throughout the book. Starting with the last edi- tion, energy efficiency is the constant companion to performance.
In Chapter 2, we start with the all-important area of memory system design. We will examine a wide range of techniques that conspire to make memory look infi- nitely large while still being as fast as possible. (Appendix B provides introductory material on caches for readers without much experience and background with them.) As in later chapters, we will see that hardware-software cooperation has become a key to high-performance memory systems, just as it has to high- performance pipelines. This chapter also covers virtual machines, an increasingly important technique for protection.
In Chapter 3, we look at ILP, of which pipelining is the simplest and most com- mon form. Exploiting ILP is one of the most important techniques for building high-speed uniprocessors. Chapter 3 begins with an extensive discussion of basic concepts that will prepare you for the wide range of ideas examined in both chap- ters. Chapter 3 uses examples that span about 40 years, drawing from one of the first supercomputers (IBM 360/91) to the fastest processors on the market in 2017. It emphasizes what is called the dynamic or runtime approach to exploiting ILP. It also talks about the limits to ILP ideas and introduces multithreading, which is fur- ther developed in both Chapters 4 and 5. Appendix C provides introductory mate- rial on pipelining for readers without much experience and background in pipelining. (We expect it to be a review for many readers, including those of our introductory text, Computer Organization and Design: The Hardware/Soft- ware Interface.)
Chapter 4 explains three ways to exploit data-level parallelism. The classic and oldest approach is vector architecture, and we start there to lay down the principles of SIMD design. (Appendix G goes into greater depth on vector architectures.) We next explain the SIMD instruction set extensions found in most desktop micropro- cessors today. The third piece is an in-depth explanation of how modern graphics processing units (GPUs) work. Most GPU descriptions are written from the pro- grammer’s perspective, which usually hides how the computer really works. This section explains GPUs from an insider’s perspective, including a mapping between GPU jargon and more traditional architecture terms.
Chapter 5 focuses on the issue of achieving higher performance using multiple processors, or multiprocessors. Instead of using parallelism to overlap individual instructions, multiprocessing uses parallelism to allow multiple instruction streams to be executed simultaneously on different processors. Our focus is on the domi- nant form of multiprocessors, shared-memory multiprocessors, though we intro- duce other types as well and discuss the broad issues that arise in any multiprocessor. Here again we explore a variety of techniques, focusing on the important ideas first introduced in the 1980s and 1990s.
Chapter 6 introduces clusters and then goes into depth on WSCs, which com- puter architects help design. The designers of WSCs are the professional descen- dants of the pioneers of supercomputers, such as Seymour Cray, in that they are designing extreme computers. WSCs contain tens of thousands of servers, and the equipment and the building that holds them cost nearly $200 million. The con- cerns of price-performance and energy efficiency of the earlier chapters apply to WSCs, as does the quantitative approach to making decisions.
Chapter 7 is new to this edition. It introduces domain-specific architectures as the only path forward for improved performance and energy efficiency given the end of Moore’s Law and Dennard scaling. It offers guidelines on how to build effec- tive domain-specific architectures, introduces the exciting domain of deep neural networks, describes four recent examples that take very different approaches to accelerating neural networks, and then compares their cost-performance.
This book comes with an abundance of material online (see Preface for more details), both to reduce cost and to introduce readers to a variety of advanced topics. Figure 1.25 shows them all. Appendices A–C, which appear in the book, will be a review for many readers.
In Appendix D, we move away from a processor-centric view and discuss issues in storage systems. We apply a similar quantitative approach, but one based on observations of system behavior and using an end-to-end approach to perfor- mance analysis. This appendix addresses the important issue of how to store and retrieve data efficiently using primarily lower-cost magnetic storage technol- ogies. Our focus is on examining the performance of disk storage systems for typ- ical I/O-intensive workloads, such as the OLTP benchmarks mentioned in this chapter. We extensively explore advanced topics in RAID-based systems, which use redundant disks to achieve both high performance and high availability. Finally, Appendix D introduces queuing theory, which gives a basis for trading off utilization and latency.
Appendix E applies an embedded computing perspective to the ideas of each of the chapters and early appendices.
Appendix F explores the topic of system interconnect broadly, including wide area and system area networks that allow computers to communicate.
Appendix H reviews VLIW hardware and software, which, in contrast, are less popular than when EPIC appeared on the scene just before the last edition.
Appendix I describes large-scale multiprocessors for use in high-performance computing.
Appendix J is the only appendix that remains from the first edition, and it covers computer arithmetic.
Appendix K provides a survey of instruction architectures, including the 80x86, the IBM 360, the VAX, and many RISC architectures, including ARM, MIPS, Power, RISC-V, and SPARC.
Appendix L is new and discusses advanced techniques for memory manage- ment, focusing on support for virtual machines and design of address translation for very large address spaces. With the growth in cloud processors, these architec- tural enhancements are becoming more important.
We describe Appendix M next.
1.13 Historical Perspectives and References
Appendix M (available online) includes historical perspectives on the key ideas presented in each of the chapters in this text. These historical perspective sections allow us to trace the development of an idea through a series of machines or to describe significant projects. If you’re interested in examining the initial develop- ment of an idea or processor or want further reading, references are provided at the end of each history. For this chapter, see Section M.2, “The Early Development of Computers,” for a discussion on the early development of digital computers and performance measurement methodologies.
As you read the historical material, you’ll soon come to realize that one of the important benefits of the youth of computing, compared to many other engineering fields, is that some of the pioneers are still alive—we can learn the history by simply asking them!
Case Studies and Exercises by Diana Franklin
Case Study 1: Chip Fabrication Cost
Concepts illustrated by this case study
- Fabrication Cost
- Fabrication Yield
- Defect Tolerance Through Redundancy
Many factors are involved in the price of a computer chip. Intel is spending $7 billion to complete its Fab 42 fabrication facility for 7 nm technology. In this case study, we explore a hypothetical company in the same situation and how different design deci- sions involving fabrication technology, area, and redundancy affect the cost of chips.
1.1[10/10] <1.6> Figure 1.26 gives hypothetical relevant chip statistics that influence the cost of several current chips. In the next few exercises, you will be exploring the effect of different possible design decisions for the Intel chips.
a.[10] <1.6> What is the yield for the Phoenix chip?
b.[10] <1.6> Why does Phoenix have a higher defect rate than BlueDragon?
1.2[20/20/20/20] <1.6> They will sell a range of chips from that factory, and they need to decide how much capacity to dedicate to each chip. Imagine that they will sell two chips. Phoenix is a completely new architecture designed with 7 nm tech- nology in mind, whereas RedDragon is the same architecture as their 10 nm Blue- Dragon. Imagine that RedDragon will make a profit of $15 per defect-free chip. Phoenix will make a profit of $30 per defect-free chip. Each wafer has a 450 mm diameter.
a.[20] <1.6> How much profit do you make on each wafer of Phoenix chips?
b.[20] <1.6> How much profit do you make on each wafer of RedDragon chips?
c.[20] <1.6> If your demand is 50,000 RedDragon chips per month and 25,000 Phoenix chips per month, and your facility can fabricate 70 wafers a month, how many wafers should you make of each chip?
1.3[20/20] <1.6> Your colleague at AMD suggests that, since the yield is so poor, you might make chips more cheaply if you released multiple versions of the same chip, just with different numbers of cores. For example, you could sell Phoenix8, Phoenix4, Phoenix2, and Phoenix1, which contain 8, 4, 2, and 1 cores on each chip, respectively. If all eight cores are defect-free, then it is sold as Phoenix8. Chips with four to seven defect-free cores are sold as Phoenix4, and those with two or three defect-free cores are sold as Phoenix2. For simplification, calculate the yield for a single core as the yield for a chip that is 1/8 the area of the original Phoenix chip. Then view that yield as an independent probability of a single core being defect free. Calculate the yield for each configuration as the probability of at the corre- sponding number of cores being defect free.
a.[20] <1.6> What is the yield for a single core being defect free as well as the yield for Phoenix4, Phoenix2 and Phoenix1?
b.[5] <1.6> Using your results from part a, determine which chips you think it would be worthwhile to package and sell, and why.
c.[10] <1.6> If it previously cost $20 dollars per chip to produce Phoenix8, what will be the cost of the new Phoenix chips, assuming that there are no additional costs associated with rescuing them from the trash?
d.[20] <1.6> You currently make a profit of $30 for each defect-free Phoenix8, and you will sell each Phoenix4 chip for $25. How much is your profit per Phoenix8 chip if you consider (i) the purchase price of Phoenix4 chips to be entirely profit and (ii) apply the profit of Phoenix4 chips to each Phoenix8 chip in proportion to how many are produced? Use the yields calculated from part Problem 1.3a, not from problem 1.1a.
Case Study 2: Power Consumption in Computer Systems
Concepts illustrated by this case study
- Amdahl’s Law
- Redundancy
- MTTF
- Power Consumption
Power consumption in modern systems is dependent on a variety of factors, includ- ing the chip clock frequency, efficiency, and voltage. The following exercises explore the impact on power and energy that different design decisions and use scenarios have.
1.4[10/10/10/10] <1.5> A cell phone performs very different tasks, including stream- ing music, streaming video, and reading email. These tasks perform very different computing tasks. Battery life and overheating are two common problems for cell phones, so reducing power and energy consumption are critical. In this problem, we consider what to do when the user is not using the phone to its full computing capacity. For these problems, we will evaluate an unrealistic scenario in which the cell phone has no specialized processing units. Instead, it has a quad-core, general- purpose processing unit. Each core uses 0.5 W at full use. For email-related tasks, the quad-core is 8× as fast as necessary.
a.[10] <1.5> How much dynamic energy and power are required compared to running at full power? First, suppose that the quad-core operates for 1/8 of the time and is idle for the rest of the time. That is, the clock is disabled for 7/8 of the time, with no leakage occurring during that time. Compare total dynamic energy as well as dynamic power while the core is running.
b.[10] <1.5> How much dynamic energy and power are required using fre- quency and voltage scaling? Assume frequency and voltage are both reduced to 1/8 the entire time.
c.[10] <1.6, 1.9> Now assume the voltage may not decrease below 50% of the original voltage. This voltage is referred to as the voltage floor, and any voltage lower than that will lose the state. Therefore, while the frequency can keep decreasing, the voltage cannot. What are the dynamic energy and power savings in this case?
d.[10] <1.5> How much energy is used with a dark silicon approach? This involves creating specialized ASIC hardware for each major task and power gating those elements when not in use. Only one general-purpose core would be provided, and the rest of the chip would be filled with specialized units. For email, the one core would operate for 25% the time and be turned completely off with power gating for the other 75% of the time. During the other 75% of the time, a specialized ASIC unit that requires 20% of the energy of a core would be running.
1.5[10/10/10] <1.5> As mentioned in Exercise 1.4, cell phones run a wide variety of applications. We’ll make the same assumptions for this exercise as the previous one, that it is 0.5 W per core and that a quad core runs email 3× as fast.
a.[10] <1.5> Imagine that 80% of the code is parallelizable. By how much would the frequency and voltage on a single core need to be increased in order to exe- cute at the same speed as the four-way parallelized code?
b.[10] <1.5> What is the reduction in dynamic energy from using frequency and voltage scaling in part a?
c.[10] <1.5> How much energy is used with a dark silicon approach? In this approach, all hardware units are power gated, allowing them to turn off entirely (causing no leakage). Specialized ASICs are provided that perform the same computation for 20% of the power as the general-purpose processor. Imagine that each core is power gated. The video game requires two ASICS and two cores. How much dynamic energy does it require compared to the baseline of parallelized on four cores?
1.6[10/10/10/10/10/20] <1.5,1.9> General-purpose processes are optimized for general-purpose computing. That is, they are optimized for behavior that is gener- ally found across a large number of applications. However, once the domain is restricted somewhat, the behavior that is found across a large number of the target applications may be different from general-purpose applications. One such appli- cation is deep learning or neural networks. Deep learning can be applied to many different applications, but the fundamental building block of inference—using the learned information to make decisions—is the same across them all. Inference operations are largely parallel, so they are currently performed on graphics proces- sing units, which are specialized more toward this type of computation, and not to inference in particular. In a quest for more performance per watt, Google has cre- ated a custom chip using tensor processing units to accelerate inference operations in deep learning.1 This approach can be used for speech recognition and image recognition, for example. This problem explores the trade-offs between this pro- cess, a general-purpose processor (Haswell E5-2699 v3) and a GPU (NVIDIA K80), in terms of performance and cooling. If heat is not removed from the com- puter efficiently, the fans will blow hot air back onto the computer, not cold air. Note: The differences are more than processor—on-chip memory and DRAM also come into play. Therefore statistics are at a system level, not a chip level.
a.[10] <1.9> If Google’s data center spends 70% of its time on workload A and 30% of its time on workload B when running GPUs, what is the speedup of the TPU system over the GPU system?
b.[10] <1.9> If Google’s data center spends 70% of its time on workload A and 30% of its time on workload B when running GPUs, what percentage of Max IPS does it achieve for each of the three systems?
c.[15] <1.5, 1.9> Building on (b), assuming that the power scales linearly from idle to busy power as IPS grows from 0% to 100%, what is the performance per watt of the TPU system over the GPU system?
d.[10] <1.9> If another data center spends 40% of its time on workload A, 10% of its time on workload B, and 50% of its time on workload C, what are the speedups of the GPU and TPU systems over the general-purpose system?
e.[10] <1.5> A cooling door for a rack costs $4000 and dissipates 14 kW (into the room; additional cost is required to get it out of the room). How many Haswell-, NVIDIA-, or Tensor-based servers can you cool with one cooling door, assuming TDP in Figures 1.27 and 1.28?
f.[20] <1.5> Typical server farms can dissipate a maximum of 200 W per square foot. Given that a server rack requires 11 square feet (including front and back clearance), how many servers from part (e) can be placed on a single rack, and how many cooling doors are required?
Exercises
1.7[10/15/15/10/10] <1.4, 1.5> One challenge for architects is that the design created today will require several years of implementation, verification, and testing before appearing on the market. This means that the architect must project what the tech- nology will be like several years in advance. Sometimes, this is difficult to do.
a.[10] <1.4> According to the trend in device scaling historically observed by Moore’s Law, the number of transistors on a chip in 2025 should be how many times the number in 2015?
b.[15] <1.5> The increase in performance once mirrored this trend. Had perfor- mance continued to climb at the same rate as in the 1990s, approximately what performance would chips have over the VAX-11/780 in 2025?
c.[15] <1.5> At the current rate of increase of the mid-2000s, what is a more updated projection of performance in 2025?
d.[10] <1.4> What has limited the rate of growth of the clock rate, and what are architects doing with the extra transistors now to increase performance?
e.[10] <1.4> The rate of growth for DRAM capacity has also slowed down. For 20 years, DRAM capacity improved by 60% each year. If 8 Gbit DRAM was first available in 2015, and 16 Gbit is not available until 2019, what is the cur- rent DRAM growth rate?
1.8[10/10] <1.5> You are designing a system for a real-time application in which specific deadlines must be met. Finishing the computation faster gains nothing. You find that your system can execute the necessary code, in the worst case, twice as fast as necessary.
a.[10] <1.5> How much energy do you save if you execute at the current speed and turn off the system when the computation is complete?
b.[10] <1.5> How much energy do you save if you set the voltage and frequency to be half as much?
1.9[10/10/20/20] <1.5> Server farms such as Google and Yahoo! provide enough compute capacity for the highest request rate of the day. Imagine that most of the time these servers operate at only 60% capacity. Assume further that the power does not scale linearly with the load; that is, when the servers are operating at 60% capacity, they consume 90% of maximum power. The servers could be turned off, but they would take too long to restart in response to more load. A new system has been proposed that allows for a quick restart but requires 20% of the maximum power while in this “barely alive” state.
a.[10] <1.5> How much power savings would be achieved by turning off 60% of the servers?
b.[10] <1.5> How much power savings would be achieved by placing 60% of the servers in the “barely alive” state?
c.[20] <1.5> How much power savings would be achieved by reducing the volt- age by 20% and frequency by 40%?
d.[20] <1.5> How much power savings would be achieved by placing 30% of the servers in the “barely alive” state and 30% off?
1.10[10/10/20] <1.7> Availability is the most important consideration for designing servers, followed closely by scalability and throughput.
a.[10] <1.7> We have a single processor with a failure in time (FIT) of 100. What is the mean time to failure (MTTF) for this system?
b.[10] <1.7> If it takes one day to get the system running again, what is the avail- ability of the system?
c.[20] <1.7> Imagine that the government, to cut costs, is going to build a super- computer out of inexpensive computers rather than expensive, reliable com- puters. What is the MTTF for a system with 1000 processors? Assume that if one fails, they all fail.
1.11[20/20/20] <1.1, 1.2, 1.7> In a server farm such as that used by Amazon or eBay, a single failure does not cause the entire system to crash. Instead, it will reduce the number of requests that can be satisfied at any one time.
a.[20] <1.7> If a company has 10,000 computers, each with an MTTF of 35 days, and it experiences catastrophic failure only if 1/3 of the computers fail, what is the MTTF for the system?
b.[20] <1.1, 1.7> If it costs an extra $1000, per computer, to double the MTTF, would this be a good business decision? Show your work.
c.[20] <1.2> Figure 1.3 shows, on average, the cost of downtimes, assuming that the cost is equal at all times of the year. For retailers, however, the Christmas season is the most profitable (and therefore the most costly time to lose sales). If a catalog sales center has twice as much traffic in the fourth quarter as every other quarter, what is the average cost of downtime per hour during the fourth quarter and the rest of the year?
1.12[20/10/10/10/15] <1.9> In this exercise, assume that we are considering enhanc- ing a quad-core machine by adding encryption hardware to it. When computing encryption operations, it is 20 times faster than the normal mode of execution. We will define percentage of encryption as the percentage of time in the original execution that is spent performing encryption operations. The specialized hard- ware increases power consumption by 2%.
a.[20] <1.9> Draw a graph that plots the speedup as a percentage of the compu- tation spent performing encryption. Label the y-axis “Net speedup” and label the x-axis “Percent encryption.”
b.[10] <1.9> With what percentage of encryption will adding encryption hard- ware result in a speedup of 2?
c.[10] <1.9> What percentage of time in the new execution will be spent on encryption operations if a speedup of 2 is achieved?
d.[15] <1.9> Suppose you have measured the percentage of encryption to be 50%. The hardware design group estimates it can speed up the encryption hard- ware even more with significant additional investment. You wonder whether adding a second unit in order to support parallel encryption operations would be more useful. Imagine that in the original program, 90% of the encryption operations could be performed in parallel. What is the speedup of providing two or four encryption units, assuming that the parallelization allowed is limited to the number of encryption units?
1.13[15/10] <1.9> Assume that we make an enhancement to a computer that improves some mode of execution by a factor of 10. Enhanced mode is used 50% of the time, measured as a percentage of the execution time when the enhanced mode is in use. Recall that Amdahl’s Law depends on the fraction of the original, unenhanced exe- cution time that could make use of enhanced mode. Thus we cannot directly use this 50% measurement to compute speedup with Amdahl’s Law.
a.[15] <1.9> What is the speedup we have obtained from fast mode?
b.[10] <1.9> What percentage of the original execution time has been converted to fast mode?
1.14[20/20/15] <1.9> When making changes to optimize part of a processor, it is often the case that speeding up one type of instruction comes at the cost of slowing down something else. For example, if we put in a complicated fast floating-point unit, that takes space, and something might have to be moved farther away from the middle to accommodate it, adding an extra cycle in delay to reach that unit. The basic Amdahl’s Law equation does not take into account this trade-off.
a.[20] <1.9> If the new fast floating-point unit speeds up floating-point opera- tions by, on average, 2x, and floating-point operations take 20% of the original program’s execution time, what is the overall speedup (ignoring the penalty to any other instructions)?
b.[20] <1.9> Now assume that speeding up the floating-point unit slowed down data cache accesses, resulting in a 1.5x slowdown (or 2/3 speedup). Data cache accesses consume 10% of the execution time. What is the overall speedup now?
c.[15] <1.9> After implementing the new floating-point operations, what per- centage of execution time is spent on floating-point operations? What percent- age is spent on data cache accesses?
1.15[10/10/20/20] <1.10> Your company has just bought a new 22-core processor, and you have been tasked with optimizing your software for this processor. You will run four applications on this system, but the resource requirements are not equal. Assume the system and application characteristics listed in Table 1.1.The percentage of resources of assuming they are all run in serial. Assume that when you parallelize a portion of the program by X, the speedup for that portion is X.
a.[10] <1.10> How much speedup would result from running application A on the entire 22-core processor, as compared to running it serially?
b.[10] <1.10> How much speedup would result from running application D on the entire 22-core processor, as compared to running it serially?
c.[20] <1.10> Given that application A requires 41% of the resources, if we stat- ically assign it 41% of the cores, what is the overall speedup if A is run paral- lelized but everything else is run serially?
d.[20] <1.10> What is the overall speedup if all four applications are statically assigned some of the cores, relative to their percentage of resource needs, and all run parallelized?
e.[10] <1.10> Given acceleration through parallelization, what new percentage of the resources are the applications receiving, considering only active time on their statically-assigned cores?
1.16[10/20/20/20/25] <1.10> When parallelizing an application, the ideal speedup is speeding up by the number of processors. This is limited by two things: percentage of the application that can be parallelized and the cost of communication. Amdahl’s Law takes into account the former but not the latter.
a.[10] <1.10> What is the speedup with N processors if 80% of the application is parallelizable, ignoring the cost of communication?
b.[20] <1.10> What is the speedup with eight processors if, for every processor added, the communication overhead is 0.5% of the original execution time.
c.[20] <1.10> What is the speedup with eight processors if, for every time the number of processors is doubled, the communication overhead is increased by 0.5% of the original execution time?
d.[20] <1.10> What is the speedup with N processors if, for every time the num- ber of processors is doubled, the communication overhead is increased by 0.5% of the original execution time?
e.[25] <1.10> Write the general equation that solves this question: What is the number of processors with the highest speedup in an application in which P% of the original execution time is parallelizable, and, for every time the number of processors is doubled, the communication is increased by 0.5% of the original execution time?
2.1Introduction 78
2.2Memory Technology and Optimizations 84
2.3Ten Advanced Optimizations of Cache Performance 94
2.4Virtual Memory and Virtual Machines 118
2.5Cross-Cutting Issues: The Design of Memory Hierarchies 126
2.6Putting It All Together: Memory Hierarchies in the
ARM Cortex-A53 and Intel Core i7 6700 129
2.7Fallacies and Pitfalls 142
2.8Concluding Remarks: Looking Ahead 146
2.9Historical Perspectives and References 148
Case Studies and Exercises by Norman P. Jouppi, Rajeev Balasubramonian, Naveen Muralimanohar, and Sheng Li 148