User Space Filesystems

Apparently Linus has made another of his grand pronouncements, on a subject relevant to this project (thanks to Pete Zaitcev for bringing it to my attention).

People who think that userspace filesystems are realistic for anything but toys are just misguided.

I beg to differ, on the basis that many people are deploying user-space filesystems in production to good effect, and that by definition means they’re not toys. Besides the obvious example of GlusterFS, PVFS2 is almost entirely in user space and it has been used to solve some very serious problems on some seriously large systems for years. Everything Linus has worked on is a toy compared to this. There are several other examples, but that one should be sufficient.

So where does Linus’s dismissive attitude come from? Only he can say, of course, but I’ve seen the same attitude from many kernel hackers and in many cases I do know where it comes from. A lot of people who have focused their attention on the minutiae of what’s going on inside processors and memory and interrupt controllers tend to lose track of things that might happen past the edge of the motherboard. This is a constant annoyance to people who work on external networking or storage, and the problem is particularly acute with distributed systems that involve both. Sure there are inefficiencies in moving I/O out to user space, but those can be positively dwarfed by inefficiencies that occur between systems. A kernel implementation of a bad distributed algorithm is most emphatically not going to beat a user-space implementation of a better one. When you’re already dealing with the constraints of a high-performance distributed system, having to deal with the additional constraints of working in the kernel might actually slow you down. It’s not that it can’t be done; it’s just not the best way to address that class of problems.

The inefficiency of moving I/O out to user space is also somewhat self-inflicted. A lot of that inefficiency has to do with data copies, but let’s consider the possibility that there might be fewer such copies if there were better ways for user-space code to specify actions on buffers that it can’t actually access directly. We actually implemented some of these at Revivio, and they worked. Why aren’t such things part of the mainline kernel? Because the gatekeepers don’t want them to be. Linus’s hatred of microkernels and anything like them is old and well known. Many other kernel developers have similar attitudes. If they think a feature only has one significant use case, and it’s a use case they oppose for other reasons, are they going to be supportive of work to provide that feature? Of course not. They’re going to reject it as needless bloat and complexity, which shouldn’t be allowed to affect the streamlined code paths that exist to do things the way they think things should be done. There’s not actually anything wrong with that, but it does mean that when they claim that user-space filesystems will incur unnecessary overhead they’re not expressing an essential truth about user-space filesystems. They’re expressing a truth about their support of user-space filesystems in Linux, which is quite different.

A lot of user-space filesystems -perhaps even a majority – really are toys. Then again, is anybody using kernel-based exofs or omfs more seriously than Argonne is using PVFS? If you make something easier to do, more people will do it. Not all of those people will be as skilled as those who would have done it The Hard Way. FUSE has definitely made it easier to write filesystems, and a lot of tyros have made toys with it, but it’s also possible for serious people to make serious filesystems with it. Remember, a lot of people once thought Linux and the machines it ran on were toys. Many still are, even literally. I always thought that broadening the community and encouraging experimentation were supposed to be good things, without which Linux itself wouldn’t have succeeded. Apparently I’m misguided.

Note: some of the comments have been promoted into a follow-up post
 

15 Responses

You can follow any responses to this entry through the RSS 2.0 feed.

Both comments and pings are currently closed.

  1. JR says:

    Link to said grand pronouncement?

  2. SBN says:

    Thanks for the article.People in Linus’s position have a responsibility to avoid making such judgements. A new technology requires significant effort to optimize it and it will get derailed with “It will never work because Linus said so” attitude.

  3. There already is an FLOSS OS that implements user-space and is older than Linux. It’s called Minix. http://www.minix3.org/

  4. James Burnash says:

    NIce thoughtful post about the challenges and realities of distributed file systems. I do like the fact that you can respectfully disagree while putting forth your own arguments – we need more of this.

    I’ve used GlusterFS and PVFS for different scenarios, and (within their own limits and outside of bugs) they have performed quite admirably to task.

  5. Jeff Darcy says:

    Shawn: Yes, there is Minix, but it’s a dangerous thing to mention in this context ;) Besides, Minix is a pedagogical OS, which is a wonderful thing but for current purposes almost the same as a toy. I think a better microkernel example would be L4, which is clearly being used for non-toy projects and is demonstrably superior to Linux in terms of both performance and security.

  6. [...] Darcy, of Red Hat and CloudFS fame, wrote a wonderful response, which you should read first before continuing [...]

  7. Manhong Dai says:

    I have been using glusterfs for 3 years. It connects a 32-node cluster to five 12T storage nodes. It is simple, scalable and very stable.

    We tried lusterfs before glusterfs, it is good, but not what we want. Here is the reasons
    1, It is tied to a special kernel, which limits the storage’s usability. We want to use the storage for HPC, storage, backup, etc.
    2, It requires some dedicated meta nodes, however, we always want to strench each dollar.
    3, Now it is owned by Oracle.

    We also waited for the pNFS, but my life is too short.

    Other than user-space file systems, if somebody knows a open source or free distributed kernel-space file system that can handle millions of small files well for HPC and file service, please let us know.

    In my opinion, GlusterFS is the best distributed filesystem as of today. No matter how user-space file system is discriminated, it will grow bigger. Eventually it will put Linux on a bigger stage.

  8. P.B.Shelley says:

    How about admitting that with FUSE, data has to be copied to kernel, then your user space component, and then back to kernel to write to disk? If you claim that this overhead can result in better performance, please PROVE it, instead of citing how many successful user space filesystems you have out there. All of them can perform better if they live in the kernel!

  9. Jeff Darcy says:

    You’re missing the point on several levels, P.B. I already mentioned the issue of extra data copies, but also made the point that it doesn’t relegate all user-space file systems to toys. Let’s see how many reasons there might be for that.

    (1) The copies you mention are artifacts of the Linux FUSE implementation, and are not inherent to user-space file systems in general. Other systems do this more intelligently. PVFS2 does it more intelligently *on Linux*. With RDMA, communication could be direct from the application to application, without even the overhead of going through the kernel. FUSE itself could be more efficient if resistance from the kernel grognards and their sycophants could be overcome. Even if one could make the case that filesystems based on FUSE as it exists today are all toys, Linus’s statement *as he made it* would still be untrue.

    (2) The copies don’t matter in many environments, especially in distributed systems. If your system is network, memory, or I/O bound anyway – whether that’s because of provisioning or algorithms – then the copies are only consuming otherwise-idle CPU cycles. This is especially true since most systems sold today are way out of balance in favor of CPU/memory over network or disk I/O anyway.

    (3) There’s an important distinction between latency and throughput. The FUSE overheads mostly affect latency. If latency is your chief concern, then you probably shouldn’t be using any kind of distributed file system regardless of whether it’s in kernel or user space. If throughput is your chief concern, which is the more common case, you need a system that allows you to aggregate the power of many servers without hitting algorithmic limits. Such systems are hard enough to debug already, without the added difficulty of putting it them the kernel prematurely. I’m not against putting code in the kernel *when all of the algorithms are settled*, but projects can go well beyond “toy” status well before that.

    (4) There are concerns besides performance. There are bazillions of libraries that one can use easily from user space. Many of them can not and should not ever be reimplemented in the kernel simply because that would bloat the kernel beyond belief. In some cases there would be other serious implications, such as a kernel-ported security library losing its certification in the process.

    (5) Results from actual field usage trumps synthetic micro-benchmarks any day, and either trumps empty theorizing like yours. If Argonne and Pandora and dozens of others can use PVFS2 and GlusterFS and HDFS for serious work, then they’re not toys. The point is already proven. End of story.

  10. Mace Moneta says:

    From a performance perspective, I agree with Linus. From an ease of use and utility perspective, FUSE is great.

    I use sshfs even though I have NFS configured, because of the easy adhoc functionality when needed. I use encfs because I only need to encrypt one small directory of personal information, and don’t need the system-wide overhead.

    Every tool has its place and purpose.

  11. Jitesh says:

    Linus agrees with you!

    Line from his email
    “fuse works fine if the thing being exported is some random low-use interface to a fundamentally slow device.”

    He agrees that if the bottlenecks are somewhere else, FUSE is fine, which is sort of the point you are trying to make. Peace :-)

  12. Jeff Darcy says:

    I don’t think “low-use interface to a fundamentally slow device” is a very good description of the I/O system attached to Intrepid,and I don’t see anything else in that email about bottlenecks being somewhere else, so I’m not sure how you managed to arrive at that interpretation.

  13. Jitesh says:

    True, but do you really expect anyone to give a statement that is universally and politically true? Should he think about cloud filesystems and distributed filesystems when making a simple point?

    The point being that if the underlying infrastructure/device/whatever-term-you-put-in-here is fundamentally slow, then no one cares about time spent in extra-buffer copies anyway and FUSE is actually *not* useless.

    Eventually it depends on how you choose to interpret what he said. To me: the context of the email didn’t include Cloud or distributed FS, so interpreting the statement in that context is just wrong and over-reacting. His claim is very true for local filesystems and that is the area it was intended for.

  14. [...] as many have pointed out, calling all such systems “toys” isn’t completely fair. But then it [...]