I need the following after reviewing the paper
Problem Statement – Issues discussed by the author
Approach & design – How the authors approach to the issue & what proposed ideas they mentioned
Strengths and Weakness – strengths & weakness of the proposed approach & design, and about the paper.
Evaluation(Performance) – How the authors evaluated the proposed system, what parameters they used to test the performance
Conclusion(In readers perspective)
Along with these, I need to have a detailed explanation of the paper section-wise:
Mechanism for scaling
Use, Surprises and design errors
Comparision with related work
The Chubby lock service for loosely-coupled distributed systems
Mike Burrows, Google Inc.
We describe our experiences with the Chubby lock ser- vice, which is intended to provide coarse-grained lock- ing as well as reliable (though low-volume) storage for a loosely-coupled distributed system. Chubby provides an interface much like a distributed file system with ad- visory locks, but the design emphasis is on availability and reliability, as opposed to high performance. Many instances of the service have been used for over a year, with several of them each handling a few tens of thou- sands of clients concurrently. The paper describes the initial design and expected use, compares it with actual use, and explains how the design had to be modified to accommodate the differences.
This paper describes a lock service called Chubby. It is intended for use within a loosely-coupled distributed sys- tem consisting of moderately large numbers of small ma- chines connected by a high-speed network. For example, a Chubby instance (also known as a Chubby cell) might serve ten thousand 4-processor machines connected by 1Gbit/s Ethernet. Most Chubby cells are confined to a single data centre or machine room, though we do run at least one Chubby cell whose replicas are separated by thousands of kilometres.
The purpose of the lock service is to allow its clients to synchronize their activities and to agree on basic in- formation about their environment. The primary goals included reliability, availability to a moderately large set of clients, and easy-to-understand semantics; through- put and storage capacity were considered secondary. Chubby’s client interface is similar to that of a simple file system that performs whole-file reads and writes, aug- mented with advisory locks and with notification of var- ious events such as file modification.
We expected Chubby to help developers deal with coarse-grained synchronization within their systems, and in particular to deal with the problem of electing a leader from among a set of otherwise equivalent servers. For
example, the Google File System  uses a Chubby lock to appoint a GFS master server, and Bigtable  uses Chubby in several ways: to elect a master, to allow the master to discover the servers it controls, and to permit clients to find the master. In addition, both GFS and Bigtable use Chubby as a well-known and available loca- tion to store a small amount of meta-data; in effect they use Chubby as the root of their distributed data struc- tures. Some services use locks to partition work (at a coarse grain) between several servers.
Before Chubby was deployed, most distributed sys- tems at Google used ad hoc methods for primary elec- tion (when work could be duplicated without harm), or required operator intervention (when correctness was es- sential). In the former case, Chubby allowed a small sav- ing in computing effort. In the latter case, it achieved a significant improvement in availability in systems that no longer required human intervention on failure.
Readers familiar with distributed computing will rec- ognize the election of a primary among peers as an in- stance of the distributed consensus problem, and realize we require a solution using asynchronous communica- tion; this term describes the behaviour of the vast ma- jority of real networks, such as Ethernet or the Internet, which allow packets to be lost, delayed, and reordered. (Practitioners should normally beware of protocols based on models that make stronger assumptions on the en- vironment.) Asynchronous consensus is solved by the Paxos protocol [12, 13]. The same protocol was used by Oki and Liskov (see their paper on viewstamped replica- tion [19, §4]), an equivalence noted by others [14, §6]. Indeed, all working protocols for asynchronous consen- sus we have so far encountered have Paxos at their core. Paxos maintains safety without timing assumptions, but clocks must be introduced to ensure liveness; this over- comes the impossibility result of Fischer et al. [5, §1].
Building Chubby was an engineering effort required to fill the needs mentioned above; it was not research. We claim no new algorithms or techniques. The purpose of this paper is to describe what we did and why, rather than to advocate it. In the sections that follow, we de- scribe Chubby’s design and implementation, and how it
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 335
has changed in the light of experience. We describe un- expected ways in which Chubby has been used, and fea- tures that proved to be mistakes. We omit details that are covered elsewhere in the literature, such as the details of a consensus protocol or an RPC system.
One might argue that we should have built a library em- bodying Paxos, rather than a library that accesses a cen- tralized lock service, even a highly reliable one. A client Paxos library would depend on no other servers (besides the name service), and would provide a standard frame- work for programmers, assuming their services can be implemented as state machines. Indeed, we provide such a client library that is independent of Chubby.
Nevertheless, a lock service has some advantages over a client library. First, our developers sometimes do not plan for high availability in the way one would wish. Of- ten their systems start as prototypes with little load and loose availability guarantees; invariably the code has not been specially structured for use with a consensus proto- col. As the service matures and gains clients, availability becomes more important; replication and primary elec- tion are then added to an existing design. While this could be done with a library that provides distributed consensus, a lock server makes it easier to maintain exist- ing program structure and communication patterns. For example, to elect a master which then writes to an ex- isting file server requires adding just two statements and one RPC parameter to an existing system: One would acquire a lock to become master, pass an additional inte- ger (the lock acquisition count) with the write RPC, and add an if-statement to the file server to reject the write if the acquisition count is lower than the current value (to guard against delayed packets). We have found this tech- nique easier than making existing servers participate in a consensus protocol, and especially so if compatibility must be maintained during a transition period.
Second, many of our services that elect a primary or that partition data between their components need a mechanism for advertising the results. This suggests that we should allow clients to store and fetch small quanti- ties of data—that is, to read and write small files. This could be done with a name service, but our experience has been that the lock service itself is well-suited for this task, both because this reduces the number of servers on which a client depends, and because the consistency fea- tures of the protocol are shared. Chubby’s success as a name server owes much to its use of consistent client caching, rather than time-based caching. In particular, we found that developers greatly appreciated not having
to choose a cache timeout such as the DNS time-to-live value, which if chosen poorly can lead to high DNS load, or long client fail-over times.
Third, a lock-based interface is more familiar to our programmers. Both the replicated state machine of Paxos and the critical sections associated with exclusive locks can provide the programmer with the illusion of sequen- tial programming. However, many programmers have come across locks before, and think they know to use them. Ironically, such programmers are usually wrong, especially when they use locks in a distributed system; few consider the effects of independent machine fail- ures on locks in a system with asynchronous communi- cations. Nevertheless, the apparent familiarity of locks overcomes a hurdle in persuading programmers to use a reliable mechanism for distributed decision making.
Last, distributed-consensus algorithms use quorums to make decisions, so they use several replicas to achieve high availability. For example, Chubby itself usually has five replicas in each cell, of which three must be run- ning for the cell to be up. In contrast, if a client system uses a lock service, even a single client can obtain a lock and make progress safely. Thus, a lock service reduces the number of servers needed for a reliable client system to make progress. In a loose sense, one can view the lock service as a way of providing a generic electorate that allows a client system to make decisions correctly when less than a majority of its own members are up. One might imagine solving this last problem in a dif- ferent way: by providing a “consensus service”, using a number of servers to provide the “acceptors” in the Paxos protocol. Like a lock service, a consensus service would allow clients to make progress safely even with only one active client process; a similar technique has been used to reduce the number of state machines needed for Byzan- tine fault tolerance . However, assuming a consensus service is not used exclusively to provide locks (which reduces it to a lock service), this approach solves none of the other problems described above.
These arguments suggest two key design decisions: • We chose a lock service, as opposed to a library or
service for consensus, and • we chose to serve small-files to permit elected pri-
maries to advertise themselves and their parameters, rather than build and maintain a second service.
Some decisions follow from our expected use and from our environment: • A service advertising its primary via a Chubby file
may have thousands of clients. Therefore, we must allow thousands of clients to observe this file, prefer- ably without needing many servers.
• Clients and replicas of a replicated service may wish to know when the service’s primary changes. This
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association336
suggests that an event notification mechanism would be useful to avoid polling.
• Even if clients need not poll files periodically, many will; this is a consequence of supporting many devel- opers. Thus, caching of files is desirable.
• Our developers are confused by non-intuitive caching semantics, so we prefer consistent caching.
• To avoid both financial loss and jail time, we provide security mechanisms, including access control. A choice that may surprise some readers is that we
do not expect lock use to be fine-grained, in which they might be held only for a short duration (seconds or less); instead, we expect coarse-grained use. For example, an application might use a lock to elect a primary, which would then handle all access to that data for a consider- able time, perhaps hours or days. These two styles of use suggest different requirements from a lock server.
Coarse-grained locks impose far less load on the lock server. In particular, the lock-acquisition rate is usu- ally only weakly related to the transaction rate of the client applications. Coarse-grained locks are acquired only rarely, so temporary lock server unavailability de- lays clients less. On the other hand, the transfer of a lock from client to client may require costly recovery proce- dures, so one would not wish a fail-over of a lock server to cause locks to be lost. Thus, it is good for coarse- grained locks to survive lock server failures, there is little concern about the overhead of doing so, and such locks allow many clients to be adequately served by a modest number of lock servers with somewhat lower availability.
Fine-grained locks lead to different conclusions. Even brief unavailability of the lock server may cause many clients to stall. Performance and the ability to add new servers at will are of great concern because the trans- action rate at the lock service grows with the combined transaction rate of clients. It can be advantageous to re- duce the overhead of locking by not maintaining locks across lock server failure, and the time penalty for drop- ping locks every so often is not severe because locks are held for short periods. (Clients must be prepared to lose locks during network partitions, so the loss of locks on lock server fail-over introduces no new recovery paths.)
Chubby is intended to provide only coarse-grained locking. Fortunately, it is straightforward for clients to implement their own fine-grained locks tailored to their application. An application might partition its locks into groups and use Chubby’s coarse-grained locks to allocate these lock groups to application-specific lock servers. Little state is needed to maintain these fine-grain locks; the servers need only keep a non-volatile, monotonically- increasing acquisition counter that is rarely updated. Clients can learn of lost locks at unlock time, and if a simple fixed-length lease is used, the protocol can be simple and efficient. The most important benefits of this
5 servers of a Chubby cell client
application chubby library
. . .
m RPCs m mastermmm
Figure 1: System structure
scheme are that our client developers become responsible for the provisioning of the servers needed to support their load, yet are relieved of the complexity of implementing consensus themselves.
2.2 System structure
Chubby has two main components that communicate via RPC: a server, and a library that client applications link against; see Figure 1. All communication between Chubby clients and the servers is mediated by the client library. An optional third component, a proxy server, is discussed in Section 3.1.
A Chubby cell consists of a small set of servers (typi- cally five) known as replicas, placed so as to reduce the likelihood of correlated failure (for example, in different racks). The replicas use a distributed consensus protocol to elect a master; the master must obtain votes from a majority of the replicas, plus promises that those replicas will not elect a different master for an interval of a few seconds known as the master lease. The master lease is periodically renewed by the replicas provided the master continues to win a majority of the vote.
The replicas maintain copies of a simple database, but only the master initiates reads and writes of this database. All other replicas simply copy updates from the master, sent using the consensus protocol.
Clients find the master by sending master location requests to the replicas listed in the DNS. Non-master replicas respond to such requests by returning the iden- tity of the master. Once a client has located the master, the client directs all requests to it either until it ceases to respond, or until it indicates that it is no longer the master. Write requests are propagated via the consensus protocol to all replicas; such requests are acknowledged when the write has reached a majority of the replicas in the cell. Read requests are satisfied by the master alone; this is safe provided the master lease has not expired, as no other master can possibly exist. If a master fails, the other replicas run the election protocol when their master leases expire; a new master will typically be elected in a few seconds. For example, two recent elections took 6s and 4s, but we see values as high as 30s (§4.1).
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 337
If a replica fails and does not recover for a few hours, a simple replacement system selects a fresh machine from a free pool and starts the lock server binary on it. It then updates the DNS tables, replacing the IP address of the failed replica with that of the new one. The current mas- ter polls the DNS periodically and eventually notices the change. It then updates the list of the cell’s members in the cell’s database; this list is kept consistent across all the members via the normal replication protocol. In the meantime, the new replica obtains a recent copy of the database from a combination of backups stored on file servers and updates from active replicas. Once the new replica has processed a request that the current master is waiting to commit, the replica is permitted to vote in the elections for new master.
2.3 Files, directories, and handles
Chubby exports a file system interface similar to, but simpler than that of UNIX . It consists of a strict tree of files and directories in the usual way, with name components separated by slashes. A typical name is:
The ls prefix is common to all Chubby names, and stands for lock service. The second component (foo) is the name of a Chubby cell; it is resolved to one or more Chubby servers via DNS lookup. A special cell name local indicates that the client’s local Chubby cell should be used; this is usually one in the same building and thus the one most likely to be accessible. The remain- der of the name, /wombat/pouch, is interpreted within the named Chubby cell. Again following UNIX, each di- rectory contains a list of child files and directories, while each file contains a sequence of uninterpreted bytes.
Because Chubby’s naming structure resembles a file system, we were able to make it available to applications both with its own specialized API, and via interfaces used by our other file systems, such as the Google File System. This significantly reduced the effort needed to write basic browsing and name space manipulation tools, and reduced the need to educate casual Chubby users.
The design differs from UNIX in a ways that ease dis- tribution. To allow the files in different directories to be served from different Chubby masters, we do not expose operations that can move files from one directory to an- other, we do not maintain directory modified times, and we avoid path-dependent permission semantics (that is, access to a file is controlled by the permissions on the file itself rather than on directories on the path leading to the file). To make it easier to cache file meta-data, the system does not reveal last-access times.
The name space contains only files and directories, collectively called nodes. Every such node has only one name within its cell; there are no symbolic or hard links.
Nodes may be either permanent or ephemeral. Any node may be deleted explicitly, but ephemeral nodes are also deleted if no client has them open (and, for directo- ries, they are empty). Ephemeral files are used as tempo- rary files, and as indicators to others that a client is alive. Any node can act as an advisory reader/writer lock; these locks are described in more detail in Section 2.4.
Each node has various meta-data, including three names of access control lists (ACLs) used to control reading, writing and changing the ACL names for the node. Unless overridden, a node inherits the ACL names of its parent directory on creation. ACLs are themselves files located in an ACL directory, which is a well-known part of the cell’s local name space. These ACL files con- sist of simple lists of names of principals; readers may be reminded of Plan 9’s groups . Thus, if file F’s write ACL name is foo, and the ACL directory contains a file foo that contains an entry bar, then user bar is permit- ted to write F. Users are authenticated by a mechanism built into the RPC system. Because Chubby’s ACLs are simply files, they are automatically available to other ser- vices that wish to use similar access control mechanisms.
The per-node meta-data includes four monotonically- increasing 64-bit numbers that allow clients to detect changes easily: • an instance number; greater than the instance number
of any previous node with the same name. • a content generation number (files only); this in-
creases when the file’s contents are written. • a lock generation number; this increases when the
node’s lock transitions from free to held. • an ACL generation number; this increases when the
node’s ACL names are written. Chubby also exposes a 64-bit file-content checksum so clients may tell whether files differ.
Clients open nodes to obtain handles that are analo- gous to UNIX file descriptors. Handles include: • check digits that prevent clients from creating or
guessing handles, so full access control checks need be performed only when handles are created (com- pare with UNIX, which checks its permissions bits at open time, but not at each read/write because file de- scriptors cannot be forged).
• a sequence number that allows a master to tell whether a handle was generated by it or by a previous master.
• mode information provided at open time to allow the master to recreate its state if an old handle is presented to a newly restarted master.
2.4 Locks and sequencers
Each Chubby file and directory can act as a reader-writer lock: either one client handle may hold the lock in exclu- sive (writer) mode, or any number of client handles may
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association338
hold the lock in shared (reader) mode. Like the mutexes known to most programmers, locks are advisory. That is, they conflict only with other attempts to acquire the same lock: holding a lock called F neither is necessary to access the file F , nor prevents other clients from do- ing so. We rejected mandatory locks, which make locked objects inaccessible to clients not holding their locks: • Chubby locks often protect resources implemented by
other services, rather than just the file associated with the lock. To enforce mandatory locking in a meaning- ful way would have required us to make more exten- sive modification of these services.
• We did not wish to force users to shut down appli- cations when they needed to access locked files for debugging or administrative purposes. In a complex system, it is harder to use the approach employed on most personal computers, where administrative soft- ware can break mandatory locks simply by instructing the user to shut down his applications or to reboot.
• Our developers perform error checking in the conven- tional way, by writing assertions such as “lock X is held”, so they benefit little from mandatory checks. Buggy or malicious processes have many opportuni- ties to corrupt data when locks are not held, so we find the extra guards provided by mandatory locking to be of no significant value.
In Chubby, acquiring a lock in either mode requires write permission so that an unprivileged reader cannot prevent a writer from making progress.
Locking is complex in distributed systems because communication is typically uncertain, and processes may fail independently. Thus, a process holding a lock L may issue a request R, but then fail. Another process may ac- quire L and perform some action before R arrives at its destination. If R later arrives, it may be acted on without the protection of L, and potentially on inconsistent data. The problem of receiving messages out of order has been well studied; solutions include virtual time , and vir- tual synchrony , which avoids the problem by ensuring that messages are processed in an order consistent with the observations of every participant.
It is costly to introduce sequence numbers into all the interactions in an existing complex system. Instead, Chubby provides a means by which sequence numbers can be introduced into only those interactions that make use of locks. At any time, a lock holder may request a se- quencer, an opaque byte-string that describes the state of the lock immediately after acquisition. It contains the name of the lock, the mode in which it was acquired (exclusive or shared), and the lock generation number. The client passes the sequencer to servers (such as file servers) if it expects the operation to be protected by the lock. The recipient server is expected to test whether the sequencer is still valid and has the appropriate mode;
if not, it should reject the request. The validity of a sequencer can be checked against the server’s Chubby cache or, if the server does not wish to maintain a ses- sion with Chubby, against the most recent sequencer that the server has observed. The sequencer mechanism re- quires only the addition of a string to affected messages, and is easily explained to our developers.
Although we find sequencers simple to use, important protocols evolve slowly. Chubby therefore provides an imperfect but easier mechanism to reduce the risk of de- layed or re-ordered requests to servers that do not sup- port sequencers. If a client releases a lock in the normal way, it is immediately available for other clients to claim, as one would expect. However, if a lock becomes free because the holder has failed or become inaccessible, the lock server will prevent other clients from claiming the lock for a period called the lock-delay. Clients may specify any lock-delay up to some bound, currently one minute; this limit prevents a faulty client from making a lock (and thus some resource) unavailable for an arbitrar- ily long time. While imperfect, the lock-delay protects unmodified servers and clients from everyday problems caused by message delays and restarts.
Chubby clients may subscribe to a range of events when they create a handle. These events are delivered to the client asynchronously via an up-call from the Chubby li- brary. Events include: • file contents modified—often used to monitor the lo-
cation of a service advertised via the file. • child node added, removed, or modified—used to im-
plement mirroring (§2.12). (In addition to allowing new files to be discovered, returning events for child nodes makes it possible to monitor ephemeral files without affecting their reference counts.)
• Chubby master failed over—warns clients that other events may have been lost, so data must be rescanned.
• a handle (and its lock) has become invalid—this typi- cally suggests a communications problem.
• lock acquired—can be used to determine when a pri- mary has been elected.
• conflicting lock request from another client—allows the caching of locks. Events are delivered after the corresponding action has
taken place. Thus, if a client is informed that file contents have changed, it is guaranteed to see the new data (or data that is yet more recent) if it subsequently reads the file.
The last two events mentioned are rarely used, and with hindsight could have been omitted. After primary election for example, clients typically need to commu- nicate with the new primary, rather than simply know that a primary exists; thus, they wait for a file modifi-
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 339
cation event indicating that the new primary has written its address in a file. The conflicting lock event in theory permits clients to cache data held on other servers, using Chubby locks to maintain cache consistency. A notifi- cation of a conflicting lock request would tell a client to finish using data associated with the lock: it would finish pending operations, flush modifications to a home loca- tion, discard cached data, and release. So far, no one has adopted this style of use.
Clients see a Chubby handle as a pointer to an opaque structure that supports various operations. Handles are created only by Open(), and destroyed with Close().
Open() opens a named file or directory to produce a handle, analogous to a UNIX file descriptor. Only this call takes a node name; all others operate on handles.
The name is evaluated relative to an existing directory handle; the library provides a handle on ”/” that is always valid. Directory handles avoid the difficulties of using a program-wide current directory in a multi-threaded pro- gram that contains many layers of abstraction .
The client indicates various options: • how the handle will be used (reading; writing and
locking; changing the ACL); the handle is created only if the client has the appropriate permissions.
• events that should be delivered (see §2.5). • the lock-delay (§2.4). • whether a new file or directory should (or must) be
created. If a file is created, the caller may supply ini- tial contents and initial ACL names. The return value indicates whether the file was in fact created. Close() closes an open handle. Further use of the han-
dle is not permitted. This call never fails. A related call Poison() causes outstanding and subsequent operations on the handle to fail without closing it; this allows a client to cancel Chubby calls made by other threads without fear of deallocating the memory being accessed by them.
The main calls that act on a handle are: GetContentsAndStat() returns both the contents and
meta-data of a file. The contents of a file are read atom- ically and in their entirety. We avoided partial reads and writes to discourage large files. A related call GetStat() returns just the meta-data, while ReadDir() returns the names and meta-data for the children of a directory.
SetContents() writes the contents of a file. Option- ally, the client may provide a content generation number to allow the client to simulate compare-and-swap on a file; the contents are changed only if the generation num- ber is current. The contents of a file are always written atomically and in their entirety. A related call SetACL() performs a similar operation on the ACL names associ- ate