CloudFS: Why?

As the name implies, CloudFS is a file system for the cloud. What does that mean? First, it means that it’s a filesystem, with the behaviors that people – and programs – expect of filesystems, and not some completely different set of behaviors characteristic of a database or blob store or something else. Here are some examples:

  • You access data in a filesystem by mounting it and issuing a familiar set of open/close/read/write calls so that every language and library and program under the sun that can use a filesystem can use this one.
  • Files are arranged into directories which can be nested arbitrarily, without requiring the user to establish and follow some separate convention on top of a single-level hierarchy.
  • The data model is a byte stream – not blocks, not records or rows – in which reads and writes can be done at any offset for any length.
  • Performance and consistency for small writes in large files are not reduced to near zero by doing read/modify/write on whole files.
  • Files and directories have owners, permissions, and other information associated with them besides their contents.

You might notice some things that are missing – e.g. locks or atomic cross-directory rename. That’s because most applications don’t rely on these features, often because they’re supported poorly or inconsistently by existing filesystems, and they’re things that anyone working specifically in the cloud should try to avoid. That last part, in turn, is because some features are impossible (or at least nearly so) to implement acceptable in the cloud. If nobody would be satisfied with the result anyway then trying is a waste of time – time that can be better spent implementing other features that really will be needed.

Many of the things that need a filesystem are not whole applications being developed from the ground up. using every possible filesystem feature. They’re libraries and frameworks that are used by other applications, and they just need basic filesystem functionality as I’ve outlined above. If you’ve constructed your application out of a dozen such pieces, the very last thing you want to do or should be doing is to dive into each and every one of these alien bits of code to make them use some other kind of data store instead. If you can build your entire application from the ground up to use something else that’s a better fit for your needs than a traditional filesystem, then that’s great. More power to you. Meanwhile, I believe that a much larger number of programmers would be better served by having a plain old-fashioned filesystem . . . albeit one that’s based on the latest technology.

OK, so much for the filesystem part. What about the cloud part? Aren’t there already filesystems you can use in the cloud? There sure are. In fact, I do exactly that myself all the time. It’s a fine thing. However, there are a few problems with doing things this way. One is that you have to manage the servers and their configuration yourself. Not only is that just one more burden as you’re trying to do Something Else, but it doesn’t allow you to take advantage of shared-service economies. Part of the cloud value proposition is supposed to be that when you aggregate resources across many users their individually unpredictable growth or bursts balance out (James Hamilton’s “non-correlated peaks” idea). The resulting aggregate predictability (plus the ability to amortize the cost of hiring real experts) allows things like capacity provisioning and monitoring to be done more efficiently in one place than if they had to be done separately by each user. Also, when you run heavy I/O within your computer resources you’re using the wrong tool for the job. It’s much better to run physical I/O on machines provisioned and tuned for that kind of thing than to run virtual I/O on machines provisioned and tuned for something else entirely. For all of these reasons, having the provider set up a filesystem as a permanent, shared resource is preferable to having each user set up their own . . . so long as it’s done in a way that preserves the users’ and providers’ needs. On the user side, you want to share between all of a user’s machines but you don’t want users reading each others’ data (let alone writing it). On the provider side, you need features such as quotas and accurate billing so users can actually be charged for what they use.

This brings us to the “why” of CloudFS: because sometimes people need a filesystem, because anything you put in the cloud as a shared service needs to meet certain requirements, and because none of the filesystems that might otherwise fit people’s needs meet those requirements. Without getting too deep into the details of exactly what features CloudFS providers to fill this gap – that will be my next post – I think it’s safe to say a big gap that needs filling.