Yesterday, I wrote about why there’s a need for CloudFS. Today, I’m going to write about how CloudFS attempts to satisfy that need. As before, there’s a “filesystem” part and a “cloud” part, which I’ll address in that order. Let’s start with the idea that writing a whole new cloud filesystem from scratch would be a mistake. It takes a lot of time and resources to develop, then it’s another long struggle to have it accepted by users. Instead, let’s start by considering what an existing filesystem would need to be like for us to use it as a base for further cloud-specific enhancements.
- It must support basic filesystem functionality, as outlined in the last post.
- It must be open source.
- It must be mature enough that users will at least consider trusting their data to it.
- It must be shared, i.e. accessible over a network.
- It must be horizontally scalable. Cloud economics are based on maximizing the ratio between users and resource pools, so the bigger we can make the pools the more everyone will benefit. Since this is a general-purpose and not HPC-specific environment, we need to consider metadata scalability (ops/second) as well as data (MB/second) and we should be very wary of “solutions” that centralize any kind of operation in one place ( distributing too broadly either).
- It must provide decent data protection, even on commodity hardware with only internal storage. I’m not even going to get into the technical merits of shared storage vs. “shared nothing” because the “fact on the ground” is that the people building large clouds prefer the latter regardless of what we might think.
- It should ideally support RDMA as well as TCP/IP. This is not strictly necessary, but is highly useful for a server-private “back-end” network and is becoming more and more relevant for “front end” networks as higher-performance cloud environments gain traction.
There are really only three production-level distributed/parallel filesystems that come close to these requirements: Lustre, PVFS2, and GlusterFS. There’s also pNFS, which is a protocol rather than a ready-to-use implementation and addresses neither metadata scaling nor data protection, and Ceph, which is in many ways more advanced than any of the others but is still too young to meet the “trust it in production” standard for most people. Then there are dozens of other projects, most of which either don’t provide basic filesystem behavior, aren’t sufficiently mature, and/or don’t provide real horizontal scalability. Some day one of those might well emerge and distinguish itself, but there’s work to be done now.
As tempting as it might be to go into detail about the precise reasons why I consider various alternatives unsuitable, I’m going to refrain. The key point is that GlusterFS is the only one that can indisputably claim to meet all of the requirements above. It does so at a cost, though. It is almost certainly the slowest of those I’ve mentioned, per node on equivalent hardware, at least as such things are traditionally measured. This is not primarily the result of it being based on FUSE, which is often maligned – without a shred of supporting evidence – as the source of every problem in filesystems based on it. The difference is minimal unless you’re using replication, which is necessary to provide data protection within the filesystem itself instead of falling back to the “just use RAID” and “just install failover software (even though we can’t tell you how)” excuses used by some of the others. Every one of these options has at least one major “deficiency” making it an imperfect fit for CloudFS, so the question becomes which problems you’d rather be left with when it comes time to deploy your filesystem as shared infrastructure. When you already have horizontal scalability, low per-node performance is a startlingly easy problem to work around.
The other main advantage of GlusterFS, compared to the others, is extensibility. Precisely because it’s FUSE-based, and because it has a modular “translator” interface besides, it makes extension such as we need to do much easier than if the new code had to be strongly intertwined with old code and debugged in the kernel. Many of the libraries we’ll need to use, such as encryption, are trivially available out in user-space but would require a major LKML fight to use from the kernel. No, thanks. The advantages of higher development “velocity” are hard to understate and are already apparent in how far GlusterFS has come relative to its competitors since about two years ago (when I tried it and it was awful). If somebody decides later on that they do want to wring every last bit of performance out of CloudFS by moving the implementation into the kernel, that’s still easier after the algorithms and protocols have been thoroughly debugged. I’ve been developing kernel code by starting in user space for a decade and a half, and don’t intend to stop now.
At this point, we’re ready to ask: what features are still missing? If we have decent scalability and decent data protection and so on, what must we still add before we can deploy the result as a shared resource in a cloud environment? It turns out that most of these features have to do with protection – with making it safe to attempt such a deployment.
- We need strong authentication, not so much for its own sake but because the cloud user’s identity is an essential input to some of the other features we need.
- We need encryption. Specifically, we need encryption to protect not only against other users seeing our data but also against the much more likely “insider threat” of the cloud provider doing so. This rules out the easy approaches that involve giving the provider either symmetric or private keys, no matter how securely such a thing might be done. Volume level encryption is out, both for that reason but also because users share volumes so there’s no one key to use. The only encryption that really works here is at the filesystem level and on the client side.
- We need namespace isolation – e.g. separate subdirectories for each cloud user.
- We need UID/GID isolation. Since we’re using a filesystem interface, each request is going to be tagged with a user ID and one or more group IDs which serve as the basis for access control. However, my UID=50 is not the same as your UID=50, nor is it the same as the provider’s UID=50. Nonetheless, my UID=50 needs to be stored as some unique UID within CloudFS on the server side and then converted back later.
- We need to be able to set and enforce a quota, to prevent users from consuming more space than they’re entitled to.
- We need to track usage, precisely, including space used and data transferred in/out, on a per-user basis for billing purposes.
This is where the translator architecture really comes into its own. Every one of these features can be implemented as a separate translator, although in some cases it might make sense to combine several functions into one translator. For example, the namespace and UID/GID isolation are similar enough that they can be done together, as are a quota and usage tracking. What we end up with is a system offering each user similar performance and similar security to what they’d have with a privately deployed filesystem using volume-level encryption on top of virtual block devices, but with only one actual filesystem that needs to be managed by the provider.