Dog Food

For a long time now, I’ve been running a server in the Rackspace cloud to do various things for me, including a sort of Dropbox equivalent to which I sync various files I want to be accessible from everywhere. Historically this has involved a combination of sshfs and encfs, but it’s about time I started eating my own dog food and using (some parts of) CloudFS for this. Yes, I know some people don’t like the “dog food” terminology, but the oft-cited “drinking our own champagne” alternative is even less applicable in this case. I wrote it. It’s still dog food until I say otherwise. Even as I write this, I’m copying files from the old setup to the new one based on pretty vanilla GlusterFS plus the at-rest encryption translator from CloudFS, all mounted on my desktop at work. I’ll have to run that through an ssh tunnel for now – until I finish writing the SSL translator – to deal with authentication and in-flight encryption issues. Similarly, I need to finish the ID-mapping translator before I could recommend this for use by more than one person per machine. With those caveats, though, I’d still say that the result is usable and secure enough for my own purposes (including compliance with Red Hat’s infosec policies). If anybody else is interested in getting on the “personal CloudFS” bandwagon, I’ll post some detailed instructions on how you can do this yourself.

GlusterFS Extended Attributes

Extended attributes are one of the best kept secrets in modern filesystems. Here you have a fully general feature to attach additional information to files, supported by most modern filesystems, and yet hardly anybody seems to use it. As it turns out, though, GlusterFS uses extended attributes – xattrs – quite extensively to do almost all of the things that it does from replication to distribution to striping. Because they’re such a key part of how GlusterFS works, and yet so little understood outside of that context, xattrs have become the subject of quite a few questions which I’ll try to answer. First, though, let’s just review what you can do with xattrs. At the most basic level, an xattr consists of a string-valued key and a string or binary value – usually on the order of a few bytes up to a few dozen. There are operations to get/set xattrs, and to list them. This alone is sufficient to support all sorts of functionality. For example, SELinux security contexts and POSIX ACLs are both stored as xattrs, with the underlying filesystems not needing to know anything about their interpretation. In fact, I was just dealing with some issues around these kinds of xattrs today . . . but that’s a story for another time.

The sneaky bit here is that the act of getting or setting xattrs on a file can trigger any kind of action at the server where it lives, with the potential to pass information both in (via setxattr) and out (via getxattr). That amounts to a form of RPC which components at any level in the system can use without requiring special support from any of the other components in between, and this trick is used extensively throughout GlusterFS. For example, the rebalancing code uses a magic xattr call to trigger recalculation of the “layouts” that determine which files get placed on which servers (more about this in a minute). The “quick-read” translator uses a magic xattr call to simulate an open/read/close sequence – saving two out of three round trips for small files. There are several others, but I’m going to concentrate on just two: the trusted.glusterfs.dht xattr used by the DHT (distribution) translator, and the trusted.afr.* xattrs used by the AFR (replication) translator.

The way that DHT works is via consistent hashing, in which file names are hashed and the hashes looked up in a table where each range of hashes is assigned exactly one “brick” (volume on some server). This assignment is done on each directory when it’s created. Directories must exist on all bricks, with each copy having a distinct trusted.glusterfs.dht xattr describing what range of hash values it’s responsible for. This xattr contains the following (all 32-bit values):

  • The count of ranges assigned to the brick (for this directory). This is always one currently, and other values simply won’t work.
  • A format designator for the following ranges. Currently zero, but this time it’s not even checked so it doesn’t matter.
  • For each range, a starting and ending hash value. Note that there’s no way in this scheme to specify a zero-length range, nor can ranges “wrap around” from 0xffffffff to 0.

When a directory is looked up, therefore, all the code needs to do is collect these xattrs and combine the ranges they contain into a table. There’s also code to look for gaps and overlaps, which seem to have been quite a problem lately. It doesn’t take long to see that there are some serious scalability issues with this approach, such as the requirement for directories to exist on every brick or the need to recalculate xattrs on every brick whenever new bricks are added or removed. I have to address these issues, but for now the scheme works pretty well.

The most complicated usage of xattrs is not in DHT but in AFR. Here, the key is the trusted.afr.* xattrs, where the * can be the name of any brick in the replica set other than the one where the xattr appears. Huh? Well, let’s say you have an AFR volume consisting of subvolumes test1-client-0 (on server0) and test1-client-1 (on server1). A file on test1-client-0 might therefore have an xattr called trusted.afr.test1-client-1. The reason for this is that the purpose of AFR is to recover from failures. Therefore, the state of an operation can’t just be recorded in the same place where a failure can wipe out both the operation and the record of it. Instead, operations are done one place and recorded everywhere else. (Yes, this is wasteful when there are more than two replicas; that’s another thing I plan to address some day). The way this information is stored is as “pending operation counts” with xattrs recording counts (each 32-bit) for three different kinds of operations:

  • Data operations – mostly writes but also e.g. truncates
  • Metadata operations – e.g. chmod/chown/chgrp, and xattrs (yes, this gets recursive)
  • Namespace operations – create, delete, rename, etc.

Whenever a modifcation is made to the filesystem, the counters are updated everywhere first. In fact, GlusterFS defines a few extra xattr operations (e.g. atomic increments) just to support AFR. Once all of the counters have been incremented, the actual operation is sent to all replicas. As each node completes the operation, the counters everywhere else are decremented once more. Ultimately, all of the counters should go back to zero. If a node X crashes in the middle, or is unavailable to begin with, then every other replica’s counter for X will remain non-zero. This state can easily be recognized the next time the counters are fetched and compared – experienced GlusterFS users probably know that a stat() call will do this. The exact relationships between all of the counters will usually indicate which brick is out of date, so that “self-heal” can go in the right direction. The most fearsome case, it should be apparent, is when the xattrs for the same file/directory on two bricks have non-zero counters for each other. This is the infamous “split brain” case, which can be difficult or even impossible for the system to resolve automatically.

Those are the two most visible uses of xattrs in GlusterFS but, as I said before, there are others. For example, trusted.gfid is a sort of generation number used to detect duplication of inode numbers (because DHT’s local-to-global mapping function is prone to such duplication whenever the server set changes). My personal favorite is trusted.glusterfs.test which appears (with the value “working”) in the root directory of every brick. This is used as a “probe” to determine whether xattrs are supported, but then never cleaned up even after the probe has yielded its result. The result of all this xattr use and abuse is, of course, a confusing plethora of xattrs attached to everything in a GlusterFS volume. That’s why it’s so important when saving/restoring to use a method that handles xattrs properly. Hopefully, I’ve managed to show how crucial these little “extra” bits of information are and perhaps given people some ideas for how to spot or fix problems related to their use.

CloudFS: Why?

As the name implies, CloudFS is a file system for the cloud. What does that mean? First, it means that it’s a filesystem, with the behaviors that people – and programs – expect of filesystems, and not some completely different set of behaviors characteristic of a database or blob store or something else. Here are some examples:

  • You access data in a filesystem by mounting it and issuing a familiar set of open/close/read/write calls so that every language and library and program under the sun that can use a filesystem can use this one.
  • Files are arranged into directories which can be nested arbitrarily, without requiring the user to establish and follow some separate convention on top of a single-level hierarchy.
  • The data model is a byte stream – not blocks, not records or rows – in which reads and writes can be done at any offset for any length.
  • Performance and consistency for small writes in large files are not reduced to near zero by doing read/modify/write on whole files.
  • Files and directories have owners, permissions, and other information associated with them besides their contents.

You might notice some things that are missing – e.g. locks or atomic cross-directory rename. That’s because most applications don’t rely on these features, often because they’re supported poorly or inconsistently by existing filesystems, and they’re things that anyone working specifically in the cloud should try to avoid. That last part, in turn, is because some features are impossible (or at least nearly so) to implement acceptable in the cloud. If nobody would be satisfied with the result anyway then trying is a waste of time – time that can be better spent implementing other features that really will be needed.

Many of the things that need a filesystem are not whole applications being developed from the ground up. using every possible filesystem feature. They’re libraries and frameworks that are used by other applications, and they just need basic filesystem functionality as I’ve outlined above. If you’ve constructed your application out of a dozen such pieces, the very last thing you want to do or should be doing is to dive into each and every one of these alien bits of code to make them use some other kind of data store instead. If you can build your entire application from the ground up to use something else that’s a better fit for your needs than a traditional filesystem, then that’s great. More power to you. Meanwhile, I believe that a much larger number of programmers would be better served by having a plain old-fashioned filesystem . . . albeit one that’s based on the latest technology.

OK, so much for the filesystem part. What about the cloud part? Aren’t there already filesystems you can use in the cloud? There sure are. In fact, I do exactly that myself all the time. It’s a fine thing. However, there are a few problems with doing things this way. One is that you have to manage the servers and their configuration yourself. Not only is that just one more burden as you’re trying to do Something Else, but it doesn’t allow you to take advantage of shared-service economies. Part of the cloud value proposition is supposed to be that when you aggregate resources across many users their individually unpredictable growth or bursts balance out (James Hamilton’s “non-correlated peaks” idea). The resulting aggregate predictability (plus the ability to amortize the cost of hiring real experts) allows things like capacity provisioning and monitoring to be done more efficiently in one place than if they had to be done separately by each user. Also, when you run heavy I/O within your computer resources you’re using the wrong tool for the job. It’s much better to run physical I/O on machines provisioned and tuned for that kind of thing than to run virtual I/O on machines provisioned and tuned for something else entirely. For all of these reasons, having the provider set up a filesystem as a permanent, shared resource is preferable to having each user set up their own . . . so long as it’s done in a way that preserves the users’ and providers’ needs. On the user side, you want to share between all of a user’s machines but you don’t want users reading each others’ data (let alone writing it). On the provider side, you need features such as quotas and accurate billing so users can actually be charged for what they use.

This brings us to the “why” of CloudFS: because sometimes people need a filesystem, because anything you put in the cloud as a shared service needs to meet certain requirements, and because none of the filesystems that might otherwise fit people’s needs meet those requirements. Without getting too deep into the details of exactly what features CloudFS providers to fill this gap – that will be my next post – I think it’s safe to say a big gap that needs filling.

CloudFS: What?

Yesterday, I wrote about why there’s a need for CloudFS. Today, I’m going to write about how CloudFS attempts to satisfy that need. As before, there’s a “filesystem” part and a “cloud” part, which I’ll address in that order. Let’s start with the idea that writing a whole new cloud filesystem from scratch would be a mistake. It takes a lot of time and resources to develop, then it’s another long struggle to have it accepted by users. Instead, let’s start by considering what an existing filesystem would need to be like for us to use it as a base for further cloud-specific enhancements.

  • It must support basic filesystem functionality, as outlined in the last post.
  • It must be open source.
  • It must be mature enough that users will at least consider trusting their data to it.
  • It must be shared, i.e. accessible over a network.
  • It must be horizontally scalable. Cloud economics are based on maximizing the ratio between users and resource pools, so the bigger we can make the pools the more everyone will benefit. Since this is a general-purpose and not HPC-specific environment, we need to consider metadata scalability (ops/second) as well as data (MB/second) and we should be very wary of “solutions” that centralize any kind of operation in one place ( distributing too broadly either).
  • It must provide decent data protection, even on commodity hardware with only internal storage. I’m not even going to get into the technical merits of shared storage vs. “shared nothing” because the “fact on the ground” is that the people building large clouds prefer the latter regardless of what we might think.
  • It should ideally support RDMA as well as TCP/IP. This is not strictly necessary, but is highly useful for a server-private “back-end” network and is becoming more and more relevant for “front end” networks as higher-performance cloud environments gain traction.

There are really only three production-level distributed/parallel filesystems that come close to these requirements: Lustre, PVFS2, and GlusterFS. There’s also pNFS, which is a protocol rather than a ready-to-use implementation and addresses neither metadata scaling nor data protection, and Ceph, which is in many ways more advanced than any of the others but is still too young to meet the “trust it in production” standard for most people. Then there are dozens of other projects, most of which either don’t provide basic filesystem behavior, aren’t sufficiently mature, and/or don’t provide real horizontal scalability. Some day one of those might well emerge and distinguish itself, but there’s work to be done now.

As tempting as it might be to go into detail about the precise reasons why I consider various alternatives unsuitable, I’m going to refrain. The key point is that GlusterFS is the only one that can indisputably claim to meet all of the requirements above. It does so at a cost, though. It is almost certainly the slowest of those I’ve mentioned, per node on equivalent hardware, at least as such things are traditionally measured. This is not primarily the result of it being based on FUSE, which is often maligned – without a shred of supporting evidence – as the source of every problem in filesystems based on it. The difference is minimal unless you’re using replication, which is necessary to provide data protection within the filesystem itself instead of falling back to the “just use RAID” and “just install failover software (even though we can’t tell you how)” excuses used by some of the others. Every one of these options has at least one major “deficiency” making it an imperfect fit for CloudFS, so the question becomes which problems you’d rather be left with when it comes time to deploy your filesystem as shared infrastructure. When you already have horizontal scalability, low per-node performance is a startlingly easy problem to work around.

The other main advantage of GlusterFS, compared to the others, is extensibility. Precisely because it’s FUSE-based, and because it has a modular “translator” interface besides, it makes extension such as we need to do much easier than if the new code had to be strongly intertwined with old code and debugged in the kernel. Many of the libraries we’ll need to use, such as encryption, are trivially available out in user-space but would require a major LKML fight to use from the kernel. No, thanks. The advantages of higher development “velocity” are hard to understate and are already apparent in how far GlusterFS has come relative to its competitors since about two years ago (when I tried it and it was awful). If somebody decides later on that they do want to wring every last bit of performance out of CloudFS by moving the implementation into the kernel, that’s still easier after the algorithms and protocols have been thoroughly debugged. I’ve been developing kernel code by starting in user space for a decade and a half, and don’t intend to stop now.

At this point, we’re ready to ask: what features are still missing? If we have decent scalability and decent data protection and so on, what must we still add before we can deploy the result as a shared resource in a cloud environment? It turns out that most of these features have to do with protection – with making it safe to attempt such a deployment.

  • We need strong authentication, not so much for its own sake but because the cloud user’s identity is an essential input to some of the other features we need.
  • We need encryption. Specifically, we need encryption to protect not only against other users seeing our data but also against the much more likely “insider threat” of the cloud provider doing so. This rules out the easy approaches that involve giving the provider either symmetric or private keys, no matter how securely such a thing might be done. Volume level encryption is out, both for that reason but also because users share volumes so there’s no one key to use. The only encryption that really works here is at the filesystem level and on the client side.
  • We need namespace isolation – e.g. separate subdirectories for each cloud user.
  • We need UID/GID isolation. Since we’re using a filesystem interface, each request is going to be tagged with a user ID and one or more group IDs which serve as the basis for access control. However, my UID=50 is not the same as your UID=50, nor is it the same as the provider’s UID=50. Nonetheless, my UID=50 needs to be stored as some unique UID within CloudFS on the server side and then converted back later.
  • We need to be able to set and enforce a quota, to prevent users from consuming more space than they’re entitled to.
  • We need to track usage, precisely, including space used and data transferred in/out, on a per-user basis for billing purposes.

This is where the translator architecture really comes into its own. Every one of these features can be implemented as a separate translator, although in some cases it might make sense to combine several functions into one translator. For example, the namespace and UID/GID isolation are similar enough that they can be done together, as are a quota and usage tracking. What we end up with is a system offering each user similar performance and similar security to what they’d have with a privately deployed filesystem using volume-level encryption on top of virtual block devices, but with only one actual filesystem that needs to be managed by the provider.