GlusterFS Algorithms: Distribution

A lot of people seem to be curious about how GlusterFS works, not just in the sense of effects but in terms of internal algorithms etc. as well. Heres an example from this morning. The documentation at this level really is kind of sparse, so I might as well start filling some of the gaps. Today Ill talk about DHT, which is the real core of how GlusterFS aggregates capacity and performance across multiple servers. Its responsibility is to place each file on exactly one of its subvolumes unlike either replication (which places copies on all of its subvolumes) or striping (which places pieces onto all of its subvolumes). Its a routing function, not splitting or copying.

The basic method used in DHT is consistent hashing. Each subvolume (brick) is assigned a range within a 32-bit hash space, covering the entire range with no holes or overlaps. Then each file is also assigned a value in that same space, by hashing its name. Exactly one brick will have an assigned range including the files hash value, and so the file should be on that brick. However, there are many cases where that wont be the case, such as when the set of bricks (and therefore the range assignment of ranges) has changed since the file was created, or when a brick is nearly full. Much of the complexity in DHT involves these special cases, which well discuss in a moment. First, though, its worth making a couple more observations about the basic algorithm.

  • The assignment of hash ranges to bricks is determined by extended attributes stored on directories (heres a description of those data structures). This means the distribution is directory-specific. You could well distribute files differently e.g. across different sets of bricks in different directories if you know what youre doing, but its quite unsafe. Firstly its unsafe because youd really better know what youre doing. Secondly its unsafe because theres no management support for this, so the next time you do a rebalance (more about that later) it will happily stomp on your carefully hand-crafted xattrs. In the fairly near future, I hope to add a feature to recognize hand-set xattrs as such and leave them alone. In the more distant future, there might be management support for assigning bricks to various pools or classes of storage, and defining per-directory placement policies in terms of those.
  • Consistent hashing is usually thought of as hashing around a circle, but in GlusterFS its more linear. Theres no need to wrap around at zero, because theres always a break (between one bricks range and anothers) at zero.
  • If a brick is missing, there will be a hole in the hash space. Even worse, if hash ranges are reassigned while a brick is offline, some of the new ranges might overlap with the (now out of date) range stored on that brick, creating a bit of confusion about where files should be. GlusterFS tries really hard to avoid these problems, but it also checks aggressively to make sure nothing slips through. If you ever see messages in your logs about holes or overlaps, that means the checking code is doing its job.

So, those are the basics. How about those special cases? Its probably easiest to look at the read path first, where were trying to find a file that we expect to be there. Heres the sequence of operations.

  1. Make sure we have the hash-range assignments (the layout) for each bricks copy of the parent directory. This information is cached, so well usually have it already.
  2. Hash the file name and look up the corresponding brick in the layout.
  3. Send a LOOKUP request to that brick, specifying the file path.
  4. If the LOOKUP comes back positive, we found the file and were done.
  5. Otherwise, re-send the LOOKUP to all bricks to see who really has the file.
  6. If nobody gives a positive reply, the file really isnt there and were done again.
  7. Go back to the brick where the file should have been, and create a link file (described below) pointing to the real location.
  8. Return the LOOKUP result to the caller.

Whats a link file, then? Have you ever looked on one of your bricks and seen zero-length files with weird permissions (sticky bit set)? Those are link files. If you look closer, youll also see that they have trusted.dht.linkfile xattrs with brick names in them. Thats how we avoid the broadcast mentioned above. On subsequent lookups, if we find a link file we just follow it to the real brick. Considering that we only go through this lookup procedure once per file per client anyway (location information is cached), the cost of guessing wrong is therefore pretty minimal. I once implemented a scheme where we do an exponentially expanding search instead of an immediate broadcast, hoping to achieve a better balance of lookup latency vs. network traffic, but in the end it just didnt seem to make a difference so the extra complexity wasnt worth it. Now, lets look at the file-creation path.

  1. Assume weve already done a lookup, so we already have the layout information cached and we know the file doesnt already exist anywhere.
  2. Hash the file name and look up the corresponding brick in the layout.
  3. If that brick is full, choose another brick (doesnt really matter how) that isnt instead.
  4. Send a CREATE request to the chosen brick for that file.
  5. If we diverted because of a full brick, go back and add a link file to the brick chosen by pure hashing. The next client will almost certainly need it.

This brings us to rebalancing, which is one of the key challenges and therefore one of the most interesting research areas IMO in this kind of system. The first thing to know about GlusterFS rebalancing is that its not automatic. If you add a new brick, even new files wont be put on it until you do the fix-layout part of rebalance, and old files wont be put on it until you do the migrate-data part. What do these do?

  • Fix-layout just walks the directory tree recalculating and modifying the trusted.glusterfs.dht xattrs to reflect the new list of bricks. It does this in a pretty simple way, assigning exactly one range of length MAXINT/nbricks to each brick in turn starting at zero.
  • Migrate-data is much more costly. For each file, it calculates where the file should be according to the new layout. Then, if the file is not already there, it moves the file by copying and renaming over the original. Theres some tricky code to make sure this is transparent and atomic and so forth, but thats the algorithm.

In my personal opinion, there are problemsenhancement opportunities in both of these areas. Lets take these in reverse order. Migrate-data is slow. What it should do is run in parallel on all of the bricks, with each brick either pushing data that is currently local but needs to be elsewhere or pulling data thats the other way around. What it does instead is run on one node, potentially moving files for which it is neither source nor destination. This is a big deal, because it causes rebalance to take days when it should take hours or weeks when it should take days, on larger installations. The amount of I/O involved is also why you dont necessarily want this to be an automatic process.

While the migrate-data issue is at the level of mechanics and implementation, the fix-layout issue is at more of a conceptual level. To put it simply, when we add a new brick we should reallocate approximately 1/new_brick_count hash values. Because our layout calculations are naive, we will usually reallocate much more than that exacerbating the migrate-data problem because reallocated hash values correspond to moved files. Time for a picture.

The outer ring represents the state with just three bricks hash value zero at the top, split into three equal ranges. The inner ring represents the state after adding a fourth brick. Any place where the inner and outer rings are different colors represents a range that has been reassigned from one brick to another implying a migration of data likewise. If you look carefully, youll see that were moving half of the data when it should be only a quarter 8% blue to orange, 17% orange to yellow, and 25% yellow to green. What could we do thats better? Not much, if we stay within the limitation of a single brick claiming a single range, but there really doesnt need to be such a limitation. Instead, we could borrow from Dynamo and assign multiple virtual node IDs for brick four, giving it a total of 25% drawn equally from bricks one through three. (If you look at this as a clock, thats one hour each at three, seven, and eleven oclock). Taking this approach too far can cause layout tables to get unreasonably large, so sometimes it makes more sense to forego this optimization or even reverse it by combining/swapping ranges even though that will cause unnecessary data migration. Done sensibly, this can keep the table size under control without resorting to the scale-limiting hash bucket approach of some Dynamo derivatives. There are even more sophisticated ways to address this problem, but thats a whole different post and this should be enough to convey the general problem space.

Thats how DHT works today, and some thoughts about how it might evolve in the future. The next article will focus on replication.