Extended attributes are one of the best kept secrets in modern filesystems. Here you have a fully general feature to attach additional information to files, supported by most modern filesystems, and yet hardly anybody seems to use it. As it turns out, though, GlusterFS uses extended attributes – xattrs – quite extensively to do almost all of the things that it does from replication to distribution to striping. Because they’re such a key part of how GlusterFS works, and yet so little understood outside of that context, xattrs have become the subject of quite a few questions which I’ll try to answer. First, though, let’s just review what you can do with xattrs. At the most basic level, an xattr consists of a string-valued key and a string or binary value – usually on the order of a few bytes up to a few dozen. There are operations to get/set xattrs, and to list them. This alone is sufficient to support all sorts of functionality. For example, SELinux security contexts and POSIX ACLs are both stored as xattrs, with the underlying filesystems not needing to know anything about their interpretation. In fact, I was just dealing with some issues around these kinds of xattrs today . . . but that’s a story for another time.
The sneaky bit here is that the act of getting or setting xattrs on a file can trigger any kind of action at the server where it lives, with the potential to pass information both in (via setxattr) and out (via getxattr). That amounts to a form of RPC which components at any level in the system can use without requiring special support from any of the other components in between, and this trick is used extensively throughout GlusterFS. For example, the rebalancing code uses a magic xattr call to trigger recalculation of the “layouts” that determine which files get placed on which servers (more about this in a minute). The “quick-read” translator uses a magic xattr call to simulate an open/read/close sequence – saving two out of three round trips for small files. There are several others, but I’m going to concentrate on just two: the trusted.glusterfs.dht xattr used by the DHT (distribution) translator, and the trusted.afr.* xattrs used by the AFR (replication) translator.
The way that DHT works is via consistent hashing, in which file names are hashed and the hashes looked up in a table where each range of hashes is assigned exactly one “brick” (volume on some server). This assignment is done on each directory when it’s created. Directories must exist on all bricks, with each copy having a distinct trusted.glusterfs.dht xattr describing what range of hash values it’s responsible for. This xattr contains the following (all 32-bit values):
- The count of ranges assigned to the brick (for this directory). This is always one currently, and other values simply won’t work.
- A format designator for the following ranges. Currently zero, but this time it’s not even checked so it doesn’t matter.
- For each range, a starting and ending hash value. Note that there’s no way in this scheme to specify a zero-length range, nor can ranges “wrap around” from 0xffffffff to 0.
When a directory is looked up, therefore, all the code needs to do is collect these xattrs and combine the ranges they contain into a table. There’s also code to look for gaps and overlaps, which seem to have been quite a problem lately. It doesn’t take long to see that there are some serious scalability issues with this approach, such as the requirement for directories to exist on every brick or the need to recalculate xattrs on every brick whenever new bricks are added or removed. I have to address these issues, but for now the scheme works pretty well.
The most complicated usage of xattrs is not in DHT but in AFR. Here, the key is the trusted.afr.* xattrs, where the * can be the name of any brick in the replica set other than the one where the xattr appears. Huh? Well, let’s say you have an AFR volume consisting of subvolumes test1-client-0 (on server0) and test1-client-1 (on server1). A file on test1-client-0 might therefore have an xattr called trusted.afr.test1-client-1. The reason for this is that the purpose of AFR is to recover from failures. Therefore, the state of an operation can’t just be recorded in the same place where a failure can wipe out both the operation and the record of it. Instead, operations are done one place and recorded everywhere else. (Yes, this is wasteful when there are more than two replicas; that’s another thing I plan to address some day). The way this information is stored is as “pending operation counts” with xattrs recording counts (each 32-bit) for three different kinds of operations:
- Data operations – mostly writes but also e.g. truncates
- Metadata operations – e.g. chmod/chown/chgrp, and xattrs (yes, this gets recursive)
- Namespace operations – create, delete, rename, etc.
Whenever a modifcation is made to the filesystem, the counters are updated everywhere first. In fact, GlusterFS defines a few extra xattr operations (e.g. atomic increments) just to support AFR. Once all of the counters have been incremented, the actual operation is sent to all replicas. As each node completes the operation, the counters everywhere else are decremented once more. Ultimately, all of the counters should go back to zero. If a node X crashes in the middle, or is unavailable to begin with, then every other replica’s counter for X will remain non-zero. This state can easily be recognized the next time the counters are fetched and compared – experienced GlusterFS users probably know that a stat() call will do this. The exact relationships between all of the counters will usually indicate which brick is out of date, so that “self-heal” can go in the right direction. The most fearsome case, it should be apparent, is when the xattrs for the same file/directory on two bricks have non-zero counters for each other. This is the infamous “split brain” case, which can be difficult or even impossible for the system to resolve automatically.
Those are the two most visible uses of xattrs in GlusterFS but, as I said before, there are others. For example, trusted.gfid is a sort of generation number used to detect duplication of inode numbers (because DHT’s local-to-global mapping function is prone to such duplication whenever the server set changes). My personal favorite is trusted.glusterfs.test which appears (with the value “working”) in the root directory of every brick. This is used as a “probe” to determine whether xattrs are supported, but then never cleaned up even after the probe has yielded its result. The result of all this xattr use and abuse is, of course, a confusing plethora of xattrs attached to everything in a GlusterFS volume. That’s why it’s so important when saving/restoring to use a method that handles xattrs properly. Hopefully, I’ve managed to show how crucial these little “extra” bits of information are and perhaps given people some ideas for how to spot or fix problems related to their use.