GFS: Evolution on Fast-forward

GFS: Evolution on Fast-forward

by Marshall Kirk McKusick, Sean Quinlan | August 7, 2009

A discussion between Kirk McKusick and Sean Quinlan aboutthe origin and evolution of the Google File System.

During the early stages of development at Google, the initial thinking didnot include plans for building a new file system. While work was still beingdone on one of the earliest versions of the company's crawl and indexingsystem, however, it
became quite clear to the core engineers that they reallyhad no other choice, and GFS (Google File System) was born.

First, given that Google's goal was to build a vast storage network out ofinexpensive commodity hardware, it had to be assumed that component failureswould be the norm—meaning that constant monitoring, error detection, faulttolerance, and automatic
recovery would have to be an integral part of the filesystem. Also, even by Google's earliest estimates, the system's throughputrequirements were going to be daunting by anybody's standards—featuringmulti-gigabyte files and data sets containing terabytes of
information andmillions of objects. Clearly, this meant traditional assumptions about I/Ooperations and block sizes would have to be revisited. There was also thematter of scalability. This was a file system that would surely need to scalelike no other. Of
course, back in those earliest days, no one could havepossibly imagined just how much scalability would be required. They would learnabout that soon enough.

Still, nearly a decade later, most of Google's mind-boggling store of dataand its ever-growing array of applications continue to rely upon GFS. Manyadjustments have been made to the file system along the way, and—together witha fair number of
accommodations implemented within the applications that useGFS—they have made the journey possible.

To explore the reasoning behind a few of the more crucial initial designdecisions as well as some of the incremental adaptations that have been madesince then, ACM asked Sean Quinlan to pull back the covers on the changingfile-system requirements
and the evolving thinking at Google. Since Quinlanserved as the GFS tech leader for a couple of years and continues now as aprincipal engineer at Google, he's in a good position to offer thatperspective. As a grounding point beyond the Googleplex, ACM asked
KirkMcKusick to lead the discussion. He is best known for his work on BSD (BerkeleySoftware Distribution) Unix, including the original design of the Berkeley FFS(Fast File System).

The discussion starts, appropriately enough, at the beginning—with theunorthodox decision to base the initial GFS implementation on a single-masterdesign. At first blush, the risk of a single centralized master becoming abandwidth bottleneck—or,
worse, a single point of failure—seems fairly obvious,but it turns out Google's engineers had their reasons for making this choice.

MCKUSICK One of the moreinteresting—and significant—aspects of the original GFS architecture was thedecision to base it on a single master. Can you walk us through what led tothat decision?

QUINLAN The decision to go with asingle master was actually one of the very first decisions,
mostly just tosimplify the overall design problem. That is, building a distributed masterright from the outset was deemed too difficult and would take too much time
. Also, by going with the single-master
approach, theengineers were able to simplify a lot of problems. Having a central place to control replicationand garbage collection and many other activities was
definitely simpler thanhandling it all on a distributed basis. So the decision was made to centralizethat in one machine.

MCKUSICK Was this mostly about being able to rollout something within a reasonably short time frame?

QUINLAN Yes. In fact, some of theengineers who were involved in that early effort later went on to buildBigTable, a distributed storage system, but that effort took many years. Thedecision to build the original GFS around the
single master really helped getsomething out into the hands of users much more rapidly than would haveotherwise been possible.

Also, in sketching out the use cases they anticipated, it didn't seem thesingle-master design would cause much of a problem. The scale they werethinking about back then was framed in terms of hundreds of terabytes and a fewmillion files. In
fact, the system worked just fine to start with.

MCKUSICK But then what?

QUINLAN Problems started to occur once the size of the underlying storage increased. Goingfrom a few hundred terabytes up to petabytes, and then up to tens of petabytes�
that really required a proportionate increase in the amount of metadatathe master had to maintain. Also,
operations such as scanning the metadatato look for recoveries all scaled linearly with the volume of data. So theamount of work required of the master grew substantially. The amount of storageneeded
to retain all that information grew as well.

In addition, this proved to be a bottleneck for the clients, even thoughthe clients issue few metadata operations themselves—for example, a clienttalks to the master whenever it does an open. When you have thousands ofclients all talking to
the master at the same time, given that the master iscapable of doing only a few thousand operations a second, the average clientisn't able to command all that many operations per second. Also bear in mindthat there are applications such as MapReduce, where
you might suddenly have athousand tasks, each wanting to open a number of files. Obviously, it would takea long time to handle all those requests, and the master would be under a fairamount of duress.

MCKUSICK Now, under the current schema for GFS, youhave one master per cell, right?

QUINLAN That's correct.

MCKUSICK And historically you've hadone cell per data center, right?

QUINLAN That was initially the goal,but it didn't work out like that to a large extent—partly because of thelimitations of the single-master design and partly because isolation proved tobe difficult. As a consequence, people
generally ended up with more than onecell per data center. We also ended up doing what we call a"multi-cell" approach, which basically made it possible to putmultiple GFS masters on top of a pool of chunkservers. That way, thechunkservers could be configured
to have, say, eight GFS masters assigned tothem, and that would give you at least one pool of underlying storage—withmultiple master heads on it, if you will. Then the application was responsiblefor partitioning data across those different cells.

MCKUSICK Presumably each applicationwould then essentially have its own master that would be responsible formanaging its own little file system. Was that basically the idea?

QUINLAN Well, yes and no.Applications would tend to use either one master or a small set of the masters.We also have something we called Name Spaces, which are just a very static wayof partitioning a namespace that people can
use to hide all of this from theactual application. The Logs Processing System offers an example of this approach:once logs exhaust their ability to use just one cell, they move to multiple GFScells; a namespace file describes how the log data is partitioned
across thosedifferent cells and basically serves to hide the exact partitioning from theapplication. But this is all fairly static.

MCKUSICK What's the performance like,in light of all that?

QUINLAN We ended up putting a fairamount of effort into tuning master performance, and it's atypical of Google toput a lot of work into tuning any one particular binary. Generally, ourapproach is just to get things working reasonably
well and then turn our focusto scalability—which usually works well in that you can generally get yourperformance back by scaling things. Because in this instance we had a singlebottleneck that was starting to have an impact on operations, however, we feltthat
investing a bit of additional effort into making the master lighter weightwould be really worthwhile. In the course of scaling from thousands ofoperations to tens of thousands and beyond, the single master had becomesomewhat less of a bottleneck. That was
a case where paying more attention tothe efficiency of that one binary definitely helped keep GFS going for quite abit longer than would have otherwise been possible.

It could be argued that managing to get GFS ready for production in recordtime constituted a victory in its own right and that, by speeding Google tomarket, this ultimately contributed mightily to the company's success. A teamof three was responsible
for all of that—for the core of GFS—and for the systembeing readied for deployment in less than a year.

But then came the price that so often befalls any successful system—thatis, once the scale and use cases have had time to expand far beyond what anyonecould have possibly imagined. In Google's case, those pressures proved to beparticularly intense.

Although organizations don't make a habit of exchanging file-systemstatistics, it's safe to assume that GFS is the largest file system inoperation (in fact, that was probably true even before Google's acquisition ofYouTube). Hence, even though
the original architects of GFS felt they hadprovided adequately for at least a couple of orders of magnitude of growth,Google quickly zoomed right past that.

In addition, the number of applications GFS was called upon to supportsoon ballooned. In an interview with one of the original GFS architects, HowardGobioff (conducted just prior to his surprising death in early 2008), herecalled, "The original
consumer of all our earliest GFS versions wasbasically this tremendously large crawling and indexing system. The second wavecame when our quality team and research groups started using GFS ratheraggressively—and basically, they were all looking to use GFS
to store large datasets. And then, before long, we had 50 users, all of whom required a littlesupport from time to time so they'd all keep playing nicely with eachother."

One thing that helped tremendously was that Google built not only the filesystem but also all of the applications running on top of it. While adjustmentswere continually made in GFS to make it more accommodating to all the new usecases, the
applications themselves were also developed with the variousstrengths and weaknesses of GFS in mind. "Because we built everything, wewere free to cheat whenever we wanted to," Gobioff neatly summarized."We could push problems back and forth between the application
space andthe file-system space, and then work out accommodations between the two."

The matter of sheer scale, however, called for some more substantialadjustments. One coping strategy had to do with the use of multiple"cells" across the network, functioning essentially as related butdistinct file systems. Besides helping to
deal with the immediate problem ofscale, this proved to be a more efficient arrangement for the operations ofwidely dispersed data centers.

Rapid growth also put pressure on another key parameter of the originalGFS design: the choice to establish 64 MB as the standard chunk size. That, ofcourse, was much larger than the typical file-system block size, but onlybecause the files generated
by Google's crawling and indexing system wereunusually large. As the application mix changed over time, however, ways had tobe found to let the system deal efficiently with large numbers of filesrequiring far less than 64 MB (think in terms of Gmail, for example).
Theproblem was not so much with the number of files itself, but rather with thememory demands all of those files made on the centralized master, thus exposingone of the bottleneck risks inherent in the original GFS design.

MCKUSICK I gather from the originalGFS paper [Ghemawat, S., Gobioff, H., Leung, S-T. 2003. The Google File System. SOSP (ACM Symposium onOperating Systems Principles)] that file counts have been a significant issuefor you right
along. Can you go into that a little bit?

QUINLAN The file-count issue came up fairlyearly because of the way people ended up designing their systems around GFS.Let me cite a specific example. Early in my time at Google, I was
involved inthe design of the Logs Processing system. We initially had a model where afront-end server would write a log, which we would then basically copy into GFSfor processing and archival. That was fine to start with, but then the numberof front-end servers
increased, each rolling logs every day. At the same time,the number of log types was going up, and then you'd have front-end serversthat would go through crash loops and generate lots more logs. So we ended upwith a lot more files than we had anticipated based
on our initialback-of-the-envelope estimates.

This became an area we really had to keep an eye on. Finally, we just hadto concede there was no way we were going to survive a continuation of the sortof file-count growth we had been experiencing.

MCKUSICK Let me make sure I'mfollowing this correctly:
your issue with file-count growth is a result of yourneeding to have a piece of metadata on the master for each file, and thatmetadata has to fit in the master's memory.

QUINLAN That's correct.

MCKUSICK And there are only a finitenumber of files you can accommodate before the master runs out of memory?

QUINLAN Exactly. And there aretwo bits of metadata. One identifies the file, and the other points out thechunks that back that file. If you had a chunk that containedonly
1 MB, it would take up only 1 MB of disk space, but it still would requirethose two bits of metadata on the master. If your average file size ends updipping below 64 MB, the ratio of the number of objects on your master to whatyou have in storage starts to
go down. That's where you run into problems.

Going back to that logs example, it quickly became apparent that thenatural mapping we had thought of—and which seemed to make perfect sense backwhen we were doing our back-of-the-envelope estimates—turned out not to beacceptable at all. We
needed to find a way to work around this by figuring outhow we could combine some number of underlying objects into larger files. Inthe case of the logs, that wasn't exactly rocket science, but it did require alot of effort.

MCKUSICK That sounds like the old dayswhen IBM had only a minimum disk allocation, so it provided you with a utilitythat let you pack a bunch of files together and then create a table of contentsfor that.

QUINLAN Exactly. For us, eachapplication essentially ended up doing that to varying degrees. That proved tobe less burdensome for some applications than for others. In the case of ourlogs, we hadn't really been planning to delete
individual log files. It wasmore likely that we would end up rewriting the logs to anonymize them or dosomething else along those lines. That way, you don't get thegarbage-collection problems that can come up if you delete only some of thefiles within a bundle.

For some other applications, however, the file-count problem was moreacute. Many times, the most natural design for some application just wouldn'tfit into GFS—even though at first glance you would think the file count wouldbe perfectly acceptable,
it would turn out to be a problem. When we startedusing more shared cells, we put quotas on both file counts and storage space.The limit that people have ended up running into most has been, by far, thefile-count quota. In comparison, the underlying storage
quota rarely proves tobe a problem.

MCKUSICK What longer-term strategyhave you come up with for dealing with the file-count issue? Certainly, itdoesn't seem that a distributed master is really going to help with that—not ifthe master still has to keep all the
metadata in memory, that is.

QUINLAN The distributed master certainly allowsyou to grow file counts, in line with the number of machines you're willing tothrow at it. That certainly helps.

One of the appeals of the distributed multimaster model is that if youscale everything up by two orders of magnitude, then getting down to a 1-MBaverage file size is going to be a lot different from having a 64-MB averagefile size. If you end
up going below 1 MB, then you're also going to run intoother issues that you really need to be careful about. For example, if you endup having to read 10,000 10-KB files, you're going to be doing a lot moreseeking than if you're just reading 100 1-MB files.

My gut feeling is that if you design for an average 1-MB file size, thenthat should provide for a much larger class of things than does a design thatassumes a 64-MB average file size. Ideally, you would like to imagine a systemthat goes all
the way down to much smaller file sizes, but 1 MB seems areasonable compromise in our environment.

MCKUSICK What have you been doing todesign GFS to work with 1-MB files?

QUINLAN We haven't been doinganything with the existing GFS design. Our distributed master system that willprovide for 1-MB files is essentially a whole new design. That way, we can aimfor something on the order of 100 million
files per master. You can also havehundreds of masters.

MCKUSICK So, essentially no singlemaster would have all this data on it?

QUINLAN That's the idea.

With the recent emergence within Google of BigTable, a distributed storagesystem for managing structured data, one potential remedy for the file-countproblem—albeit perhaps not the very best one—is now available.

The significance of BigTable goes far beyond file counts, however.Specifically, it was designed to scale into the petabyte range across hundredsor thousands of machines, as well as to make it easy to add more machines tothe system and automatically
start taking advantage of those resources withoutreconfiguration. For a company predicated on the notion of employing thecollective power, potential redundancy, and economies of scale inherent in amassive deployment of commodity hardware, these rate as significant
advantagesindeed.

Accordingly, BigTable is now used in conjunction with a growing number ofGoogle applications. Although it represents a departure of sorts from the past,it also must be said that BigTable was built on GFS, runs on GFS, and wasconsciously designed
to remain consistent with most GFS principles. Considerit, therefore, as one of the major adaptations made along the way to help keepGFS viable in the face of rapid and widespread change.

MCKUSICK You now have this thingcalled BigTable. Do you view that as an application in its own right?

QUINLAN From the GFS point of view,it's an application, but it's clearly more of an infrastructure piece.

MCKUSICK If I understand thiscorrectly, BigTable is essentially a lightweight relational database.

QUINLAN It's not really a relational database. Imean, we're not doing SQL and it doesn't really support joins and such. ButBigTable is a structured storage system that lets you have lots of key-valuepairs
and a schema
.

MCKUSICK Who are the real clients ofBigTable?

QUINLAN BigTable is increasinglybeing used within Google for crawling and indexing systems, and we use it a lotwithin many of our client-facing applications. The truth of the matter is thatthere are tons of BigTable clients.
Basically, any app with lots of small dataitems tends to use BigTable. That's especially true wherever there's fairlystructured data.

MCKUSICK I guess the question I'mreally trying to pose here is: Did BigTable just get stuck into a lot of theseapplications as an attempt to deal with the small-file problem, basically bytaking a whole bunch of small things
and then aggregating them together?

QUINLAN That has certainly been oneuse case for BigTable, but it was actually intended for a much more generalsort of problem. If you're using BigTable in that way—that is, as a way offighting the file-count problem where you
might have otherwise used a filesystem to handle that—then you would not end up employing all of BigTable'sfunctionality by any means. BigTable isn't really ideal for that purpose inthat it requires resources for its own operations that are nontrivial. Also,
ithas a garbage-collection policy that's not super-aggressive, so that might notbe the most efficient way to use your space. I'd say that the people who havebeen using BigTable purely to deal with the file-count problem probably haven'tbeen terribly happy,
but there's no question that it is one way for people tohandle that problem.

MCKUSICK What I've read about GFSseems to suggest that the idea was to have only two basic data structures: logsand SSTables (Sorted String Tables). Since I'm guessing the SSTables must beused to handle key-value pairs and that
sort of thing, how is that differentfrom BigTable?

QUINLAN The main difference is thatSSTables are immutable, while BigTable provides mutable key value storage, anda whole lot more. BigTable itself is actually built on top of logs andSSTables. Initially, it stores incoming data
into transaction log files. Thenit gets compacted—as we call it—into a series of SSTables, which in turnget compacted together over time. In some respects, it's reminiscent of alog-structure file system. Anyway, as you've observed, logs and SSTables
doseem to be the two data structures underlying the way we structure most of ourdata. We have log files for mutable stuff as it's being recorded. Then, onceyou have enough of that, you sort it and put it into this structure that has anindex.

Even though GFS does not provide a Posix interface, it still has a prettygeneric file-system interface, so people are essentially free to write any sortof data they like. It's just that, over time, the majority of our users haveended up using
these two data structures. We also have something called protocolbuffers, which is our data description language. The majority of data endsup being protocol buffers in these two structures.

Both provide for compression and checksums. Even though there are somepeople internally who end up reinventing these things, most people are contentjust to use those two basic building blocks.

Because GFSwas designed initially to enable a crawling and indexing system, throughput waseverything. In fact, the original paper written about the systemmakes this quite explicit: "High
sustained bandwidth is more importantthan low latency. Most of our target applications place a premium on processingdata in bulk at a high rate, while few have stringent response-timerequirements for an individual read and write."

But then Google either developed or embraced many user-facing Internetservices for which this is most definitely not the case.

One GFS shortcoming that this immediately exposed had to do with theoriginal single-master design. A single point of failure may not have been adisaster for batch-oriented applications, but it was certainly unacceptable forlatency-sensitive
applications, such as video serving. The later addition ofautomated failover capabilities helped, but even then service could be out forup to a minute.

The other major challenge for GFS, of course, has revolved around findingways to build latency-sensitive applications on top of a file system designedaround an entirely different set of priorities.

MCKUSICK It's well documented that theinitial emphasis in designing GFS was on batch efficiency as opposed to lowlatency. Now that has come back to cause you trouble, particularly in terms ofhandling things such as videos. How
are you handling that?

QUINLAN The GFS design model from the get-go wasall about achieving throughput, not about the latency at which that might beachieved. To give you a concrete example, if you're writing
afile, it will typically be written in triplicate—meaning you'll actually bewriting to three chunkservers. Should one of those chunkservers die or hiccupfor a long period of time, the GFS master will notice the problem and schedulewhat we call a
pullchunk, which means it will basically replicate one ofthose chunks. That will get you back up to three copies, and then the systemwill pass control back to the client, which will continue writing.

When we do a pullchunk we limit it to something on the order of 5-10 MB asecond. So, for 64 MB, you're talking about 10 seconds for this recovery totake place. There are lots of other things like this that might take 10 secondsto a minute, which
works just fine for batch-type operations. If you're doing alarge MapReduce operation, you're OK just so long as one of the items is not areal straggler, in which case you've got yourself a different sort of problem.Still, generally speaking, a hiccup on the
order of a minute over the course ofan hour-long batch job doesn't really show up. If you are working on Gmail,however, and you're trying to write a mutation that represents some useraction, then getting stuck for a minute is really going to mess you up.

We've had similar issues with our master failover. Initially, GFS had noprovision for automatic master failover. It was a manual process. Although itdidn't happen a lot, whenever it did, the cell might be down for an hour. Evenour initial master-failover
implementation required on the order of minutes.Over the past year, however, we've taken that down to something on the order oftens of seconds.

MCKUSICK Still, for user-facing applications, that's notacceptable.

QUINLAN Right. While theseinstances—where you have to provide for failover and error recovery—may havebeen acceptable in the batch situation, they're definitely not OK from alatency point of view for a user-facing application.
Another issue here is thatthere are places in the design where we've tried to optimize for throughput bydumping thousands of operations into a queue and then just processing throughthem. That leads to fine throughput, but it's not great for latency. You caneasily
get into situations where you might be stuck for seconds at a time in aqueue just waiting to get to the head of the queue.

Our user base has definitely migrated from being a MapReduce-based worldto more of an interactive world that relies on things such as BigTable. Gmailis an obvious example of that. Videos aren't quite as bad where GFS isconcerned because you
get to stream data, meaning you can buffer. Still, tryingto build an interactive database on top of a file system that was designed fromthe start to support more batch-oriented operations has certainly proved to bea pain point.

MCKUSICK How exactly have you managedto deal with that?

QUINLAN Within GFS, we've managed toimprove things to a certain degree, mostly by designing the applications todeal with the problems that come up. Take BigTable as a good concrete example.The BigTable transaction log is actually
the biggest bottleneck for getting atransaction logged. In effect, we decided, "Well, we're going to seehiccups in these writes, so what we'll do is to have two logs open at any onetime. Then we'll just basically merge the two. We'll write to one and if thatgets
stuck, we'll write to the other. We'll merge those logs once we do areplay—if we need to do a replay, that is." We tended to design our applicationsto function like that—which is to say they basically try to hide that latencysince they know the system underneath
isn't really all that great.

The guys who built Gmail went to a multihomed model, so if one instance ofyour Gmail account got stuck, you would basically just get moved to anotherdata center. Actually, that capability was needed anyway just to ensureavailability. Still,
part of the motivation was that they wanted to hide theGFS problems.

MCKUSICK I think it's fair to say that, by movingto a distributed-master file system, you're definitely going to be able toattack some of those latency issues.

QUINLAN That was certainly one of ourdesign goals. Also, BigTable itself is a very failure-aware system that triesto respond to failures far more rapidly than we were able to before. Using thatas our metadata storage helps with
some of those latency issues as well.

The engineers who worked on the earliest versions of GFS weren'tparticularly shy about departing from traditional choices in file-system designwhenever they felt the need to do so. It just so happens that the approachtaken to consistency is
one of the aspects of the system where this isparticularly evident.

Part of this, of course, was driven by necessity. Since Google's plansrested largely on massive deployments of commodity hardware, failures andhardware-related faults were a given. Beyond that, according to the originalGFS paper, there were
a few compatibility issues. "Many of our disksclaimed to the Linux driver that they supported a range of IDE protocolversions but in fact responded reliably only to the more recent ones. Since theprotocol versions are very similar, these drives mostly worked
but occasionallythe mismatches would cause the drive and the kernel to disagree about thedrive's state. This would corrupt data silently due to problems in the kernel.This problem motivated our use of checksums to detect data corruption."

That didn't mean just any checksumming, however, but instead rigorousend-to-end checksumming, with an eye to everything from disk corruption toTCP/IP corruption to machine backplane corruption.

Interestingly, for all that checksumming vigilance, the GFS engineeringteam also opted for an approach to consistency that's relatively loose byfile-system standards. Basically, GFS simply accepts that there will be timeswhen people will end
up reading slightly stale data. Since GFS is used mostly as an append-onlysystem as opposed to an overwriting system, this generally means those peoplemight end up missing something that was appended to the end of the
file afterthey'd already opened it
. To the GFS designers, this seemed anacceptable cost (although it turns out that there are applications for whichthis proves problematic).

Also, as Gobioff explained, "The risk of stale data in certaincircumstances is just inherent to a highly distributed architecture thatdoesn't ask the master to maintain all that much information. We definitelycould have made things a lot tighter
if we were willing to dump a lot more datainto the master and then have it maintain more state. But that just reallywasn't all that critical to us."

Perhaps an even more important issue here is that the engineers makingthis decision owned not just the file system but also the applications intendedto run on the file system. According to Gobioff, "The thing is that wecontrolled both the horizontal
and the vertical—the file system and theapplication. So we could be sure our applications would know what to expectfrom the file system. And we just decided to push some of the complexity out tothe applications to let them deal with it."

Still, there are some at Google who wonder whether that was the right callif only because people can sometimes obtain different data in the course ofreading a given file multiple times, which tends to be so strongly at odds withtheir whole notion
of how data storage is supposed to work.

MCKUSICK Let's talk about consistency.The issue seems to be that it presumably takes some amount of time to geteverything fully written to all the replicas. I think you said somethingearlier
to the effect that GFS essentially requires that this all be fullywritten before you can continue.

QUINLAN That's correct.

MCKUSICK If that's the case, then howcan you possibly end up with things that aren't consistent?

QUINLAN Client failures have a way offouling things up. Basically, the model in GFS is that the client justcontinues to push the write until it succeeds.
If the client ends up crashing in themiddle of an operation, things are left in a bit of an indeterminate state.

Early on, that was sort of considered to be OK, but over time, wetightened the window for how long that inconsistency could be tolerated, andthen we slowly continued to reduce that. Otherwise, whenever the data is inthat inconsistent state,
you may get different lengths for the file. That canlead to some confusion. We had to have some backdoor interfaces for checkingthe consistency of the file data in those instances. We also have somethingcalled RecordAppend, which is an interface designed for
multiple writers toappend to a log concurrently. There the consistency was designed to be very loose. Inretrospect, that turned out to be a lot more painful than anyone expected.

MCKUSICK What exactly was loose? Ifthe primary replica picks what the offset is for each write and then makes surethat actually occurs, I don't see where the inconsistencies are going to comeup.

QUINLAN What happens is that the primary willtry. It will pick an offset, it will do the writes, but then one of them won'tactually get written. Then the primary might change, at which point it can picka
different offset
. RecordAppend does not offer any replay protection either.You could end up getting the data multiple times in the file.

There were even situations where you could get the data in a differentorder. It might appear multiple times in one chunk replica, but not necessarilyin all of them. If you were reading the file, you could discover the data indifferent ways at
different times. At the record level, you could discover therecords in different orders depending on which chunks you happened to bereading.

MCKUSICK Was this done by design?

QUINLAN At the time, it must haveseemed like a good idea, but in retrospect
I think the consensus is that it proved to be morepainful than it was worth. It just doesn't meet the expectationspeople have of a file system, so they end up getting surprised. Then they hadto figure
out work-arounds.

MCKUSICK In retrospect, how would youhandle this differently?

QUINLAN I think it makes more senseto have a single writer per file.

MCKUSICK All right, but what happenswhen you have multiple people wanting to append to a log?

QUINLAN You serialize the writes through asingle process that can ensure the replicas are consistent.

MCKUSICK There's also this businesswhere you essentially snapshot a chunk. Presumably, that's something you usewhen you're essentially replacing a replica, or whenever some chunkserver goesdown and you need to replace some of
its files.

QUINLAN Actually, two things aregoing on there. One, as you suggest, is the recovery mechanism, whichdefinitely involves copying around replicas of the file. The way that works inGFS is that we basically revoke the lock so that
the client can't write itanymore, and this is part of that latency issue we were talking about.

There's also a separate issue, which is to support the snapshot feature ofGFS. GFS has the most general-purpose snapshot capability you can imagine. Youcould snapshot any directory somewhere, and then both copies would be entirelyequivalent.
They would share the unchanged data. You could change either oneand you could further snapshot either one. So it was really more of a clonethan what most people think of as a snapshot. It's an interesting thing, but itmakes for difficulties—especially as you
try to build more distributed systemsand you want potentially to snapshot larger chunks of the file tree.

I also think it's interesting that the snapshot feature hasn't been usedmore since it's actually a very powerful feature. That is, from a file-systempoint of view, it really offers a pretty nice piece of functionality.
But putting snapshots into file systems, as I'm sure you know, is a real pain.

MCKUSICK:  I know. I've done it. It'sexcruciating—especially in an overwriting file system.

QUINLAN Exactly. This is a case wherewe didn't cheat, but from an implementation perspective, it's hard to createtrue snapshots. Still, it seems that in this case, going the full deal was theright decision. Just the same, it's
an interesting contrast to some of theother decisions that were made early on in terms of the semantics.

All in all, the report card on GFS nearly 10 years later seems positive.There have been problems and shortcomings, to be sure, but there's surely noarguing with Google's success and GFS has without a doubt played an importantrole in that. What's
more, its staying power has been nothing short ofremarkable given that Google's operations have scaled orders of magnitudebeyond anything the system had been designed to handle, while the applicationmix Google currently supports is not one that anyone could
have possiblyimagined back in the late '90s.

Still, there's no question that GFS faces many challenges now. For onething, the awkwardness of supporting an ever-growing fleet of user-facing,latency-sensitive applications on top of a system initially designed forbatch-system throughput is
something that's obvious to all.

The advent of BigTable has helped somewhat in this regard. As it turnsout, however, BigTable isn't actually all that great a fit for GFS. In fact, itjust makes the bottleneck limitations of the system's single-master design moreapparent than
would otherwise be the case.

For these and other reasons, engineers at Google have been working formuch of the past two years on a new distributed master system designed to takefull advantage of BigTable to attack some of those problems that have provedparticularly difficult
for GFS.

Accordingly, it now seems that beyond all the adjustments made to ensurethe continued survival of GFS, the newest branch on the evolutionary tree willcontinue to grow in significance over the years to come.

原文地址:http://queue.acm.org/detail.cfm?id=1594206

上一篇:GEOS库的学习之一:介绍和编译


下一篇:*HDU3172 并查集