Here are his reasons for generally avoiding ZFS from what I consider most important to least.
- The kernel team may break it at any time, and won't care if they do.
- It doesn't seem to be well-maintained.
- Performance is not that great compared to the alternatives.
- Using it opens you up to the threat of lawsuits from Oracle. Given history, this is a real threat. (This is one that should be high for Linus but not for me - there is no conceivable reason that Oracle would want to threaten me with a lawsuit.)
The last commit is from 3 hours ago: https://github.com/zfsonlinux/zfs/commits/master. They have dozens of commits per month. The last minor release, 0.8, brought significant improvements (my favorite: FS-level encryption).
Or maybe this is referred to the 5.0 kernel (initial) incompatibility? That wasn't the ZFS dev team's fault.
> Performance is not that great compared to the alternatives.
There are no (stable) alternatives. BTRFS certainly not, as it's "under heavy development"¹ (since... forever).
> The kernel team may break it at any time, and won't care if they do.
That's true, however, the amount is breakage is no different from any other out-of-tree module, and it unlikely to happen with a patch version of a working kernel (in fact, it happen with the 5.0 release).
> Using it opens you up to the threat of lawsuits from Oracle. Given history, this is a real threat. (This is one that should be high for Linus but not for me - there is no conceivable reason that Oracle would want to threaten me with a lawsuit.)
"Using" it won't open to lawsuits; ZFS has a CDDL license, which is a free and open-source software license.
The problem is (taking Ubuntu as representative) shipping the compiled module along with the kernel, which is an entirely different matter.
Googles Java implementation wasn't GPL licensed, so neither its implementation nor its interface could have been covered by the OpenJDK being GPLv2. I think RMS wouldn't sit by idly either if someone took GCC and forked it under the Apache license.
> There are no (stable) alternatives. BTRFS certainly not, as it's "under heavy development"¹ (since... forever).
Note that they don't mean "it's unstable," just "there are significant improvements between versions." Most importantly:
> The filesystem disk format is stable; this means it is not expected to change unless there are very strong reasons to do so. If there is a format change, filesystems which implement the previous disk format will continue to be mountable and usable by newer kernels.
...and only _new features_ are expected to stabilise:
> As with all software, newly added features may need a few releases to stabilize.
So overall, at least as far as their own claims go, this is not "heavy development" as in "don't use."
Some features such as Raid5 were still firmly in "don't use if you value your data" territory last I looked. So it is important to be informed as to what can be used and what might be more dangerous with btrfs
Keep in mind that RAID5 isn’t feasible with multi-TB disks (the probability of failed blocks when rebuilding the array is far too high). That said, RAID6 also suffers the same write-hole problem with Btrfs. Personally I choose RAIDZ2 instead.
> Keep in mind that RAID5 isn’t feasible with multi-TB disks (the probability of failed blocks when rebuilding the array is far too high).
What makes you say that? I've seen plenty of people make this claim based on URE rates, but I've also not seen any evidence that it is a real problem for a 3-4 drive setup. Modern drives are specced at 1 URE per 10^15 bits read (or better), so less than 1 URE in 125 TB read. Even if a rebuild did fail, you could just start over from a backup. Sure, if the array is mission critical and you have the money, use something with more redundancy, but I don't think RAID5 is infeasible in general.
Last time I checked (a few years ago I must say), a 10^15 URE was only for enteprise-grade drives and not for consumer-level, where most drives have a 10^14 URE. Which means your build is almost guaranteed to fail on a large-ish raid setup.
So yeah, RAID is still feasible with multi-TB disks if you have the money to buy disks with the appropriate reliability. For the common folk, raid is effectively dead with today's disk sizes.
Theoretically, if you have a good RAID5, without serious wire-hole and similar issues, then it is strictly better than no RAID and worse than RAID5 and RAID1.
* All localized error are correctable, unless they overlap on different disks, or result in drive ejection. This precisely fixes UREs of non-raid drives.
* If a complete drive fails, then you have a chance of losing some data from the UREs / localized errors. This is approximately the same as if you used no RAID.
As for URE incidence rate - people use multi-TB drives without RAID, yet data loss does not seem prevalent. I'd say it depends .. a lot.
If you use a crappy RAID5, that ejects a drive on a drive partial/transient/read failure, then yes, it's bad, even worse than no RAID.
That being said, I have no idea whether a good RAID5 implementation is available, one that is well interfaced or integrated into filesystem.
I have a couple of Seagate IronWolf drives that are rated at 1 URE per 10^15 bits read and, sure, depending on the capacity you want (basically 8 TB and smaller desktop drives are super cheap), they do cost up to 40% more than their Barracuda cousins, but we're still well within the realm of cheap SATA storage.
Manufacturer-specified UBE rates are extremely conservative. If UBE were a thing then you'd notice transient errors during ZFS scrubs, which are effectively a "rebuild" that doesn't rebuild anything.
Feasible is different than possible, and carries a strong connotation of being suitable/able to be done successfully. Many things are possible, many of those things are not feasible.
Btrfs has many more problems than dataloss with RAID5.
It has terrible performance problems under many typical usage scenarios. This is a direct consequence in the choice of core on-disc data structures. There's no workaround without a complete redesign.
It can become unbalanced and cease functioning entirely. Some workloads can trigger this in a matter of hours. Unheard of for any other filesystem.
It suffers from critical dataloss bugs in setups other than RAID5. They have solved a number of these, but when reliability is its key selling point many of us have concerns that there is still a high chance that many still exist, particularly in poorly-exercised codepaths which are run in rare circumstances such as when critical faults occur.
There's differing opinions of BTRFS's suitability in production - it's the default filesystem of SUSE on one hand, on the other RedHat has deprecated BTRFS support because they see it as not being production ready and they don't see it being production ready in the near future. They also feel that the more legacy linux filesystems have added features to compete.
But then, your personal requirements/use cases might not be the same as Facebook's. (And this does not only apply to Btrfs[1]/ZFS, it also applies to GlusterFS, use of specific hardware, ...)
[1] which I used for nearly two years on a small desktop machine on a daily basis; ended up with (minor?) errors on the file system that could not be repaired and decided to switch to ZFS. No regrets, nor similar errors since.
Kroger (and their subsidiaries like QFC, Fred Meyer, Fry's Marketplace, etc), Walmart, Safeway (and Albertsons/Randalls) all use Suse with BTRFS for their point of sale systems.
Synology uses standard linux md (for btrfs too). Even SHR (Synology Hybrid RAID) is just different partitions on the drive allocated to different volumes, so you can use mixed-capacity drives effectively.
Right, instead of BTRFS RAID5/6, they use Linux md raid, but I believe they have custom patches to BTRFS to "punch through" information from md, so that when BTRFS has a checksum mismatch it can use the md raid mirror disk for repair.
It's quite usable, but of course, do not trust it with your unique unbacked-up data yet.
I use it as a main FS for a desktop workstation and I'm pretty happy with it. Waiting impatiently for EC to be implemented for efficient pooling of multiple devices.
Regarding caching: "Bcachefs allows you to specify disks (or groups thereof) to be used for three categories of I/O: foreground, background, and promote. Foreground devices accept writes, whose data is copied to background devices asynchronously, and the hot subset of which is copied to the promote devices for performance."
To my knowledge, caching layers are supported but require some setup and don't have much documentation to setup rn.
If all you need is a simple root FS that is CoW and checksummed, bcachefs works pretty good, in my experience. I've been using it productively as a root and home FS for about two years or so.
why would you want to embed raid5/6 in the filesystem layer? Linux has battle-tested mdraid for this, I'm not going to trust a new filesystem's own implementation over it.
Same for encryption, there are already existing crypto layers both on the block and filesystem (as an overlay) level.
Because the FS can be deeply integrated with the RAID implementation. With a normal RAID, if the data at some address is different between the two disks, there's no way for the fs to tell which is correct, because the RAID code essentially just picks one, it can't even see the other. With ZFS for example, there is a checksum stored with the data, so when you read, zfs will check the data on both and pick the correct one. It will also overwrite the incorrect version with the correct one, and log the error.
It's the same kind of story with encryption, if its built in you can do things like incremental backups of an encrypted drive, without ever decrypting it on the target.
> when you read, zfs will check the data on both and pick the correct one.
Are you sure about that? Always reading both doubles read I/O, and benchmarks show no such effect.
> there's no way for the fs to tell which is correct
This is not an immutable fact that precludes keeping the RAID implementation separate. If the FS reads data and gets a checksum mismatch, it should be able to use ioctls (or equivalent) to select specific copies/shards and figure out which ones are good. I work on one of the four or five largest storage systems in the world, and have written code to do exactly this (except that it's Reed-Solomon rather than RAID). I've seen it detect and fix bad blocks, many times. It works, even with separate layers.
This supposed need for ZFS to absorb all RAID/LVM/page-cache behavior into itself is a myth; what really happened is good old-fashioned NIH. Understanding other complex subsystems is hard, and it's more fun to write new code instead.
> If the FS reads data and gets a checksum mismatch, it should be able to use ioctls (or equivalent) to select specific copies/shards and figure out which ones are good. I work on one of the four or five largest storage systems in the world, and have written code to do exactly this (except that it's Reed-Solomon rather than RAID).
This is all great, and I assume it works great. But it is no way generalizable to all the filesystems Linux has to support (at least at the moment). I could only see this working in a few specific instances with a particular set of FS setups. Even more complicating is the fact that most RAIDS are hardware based, so just using ioctls to pull individual blocks wouldn’t work for many (all?) drivers. Convincing everyone to switch over to software raids would take a lot of effort.
There is a legitimate need for these types of tools in the sub-PB, non-clustered, storage arena. If you’re working on a sufficiently large storage system, these tools and techniques are probably par for the course. That said, I definitely have lost 100GBs of data from a multi-PB storage system from a top 500 HPC system due to bit rot. (One bad byte in a compressed data file left the data after the bad byte unrecoverable). This would not have happened on ZFS.
ZFS was/is a good effort to bring this functionality lower down the storage hierarchy. And it worked because it had knowledge about all of the storage layers. Checksumming files/chunks helps best if you know about the file system and which files are still present. And it only makes a difference if you can access the lower level storage devices to identify and fix problems.
> it is no way generalizable to all the filesystems Linux has to support
Why not? If it's a standard LVM API then it's far more general than sucking everything into one filesystem like ZFS did. Much of this block-mapping interface already exists, though I'm not sure whether it covers this specific use case.
> This supposed need for ZFS to absorb all RAID/LVM/page-cache behavior into itself is a myth; what really happened is good old-fashioned NIH.
At the time that ZFS was written (early 2000s) and released to the public (2006), this was not a thing and the idea was somewhat novel / 'controversial'. Jeff Bonwick, ZFS co-creator, lays out their thinking:
I debated some of this with Bonwick (and Cantrill who really had no business being involved but he's pernicious that way) at the time. That blog post is, frankly, a bit misleading. The storage "stack" isn't really a stack. It's a DAG. Multiple kinds of devices, multiple filesystems plus raw block users (yes they still exist and sometimes even have reason to), multiple kinds of functionality in between. An LVM API allows some of this to have M users above and N providers below, for M+N total connections instead of M*N. To borrow Bonwick's own condescending turn of phrase, that's math. The "telescoping" he mentions works fine when your storage stack really is a stack, which might have made sense in a not-so-open Sun context, but in the broader world where multiple options are available at every level it's still bad engineering.
> ... but in the broader world where multiple options are available at every level it's still bad engineering.
When Sun added ZFS to Solaris, they did not get rid of UFS and/or SVM, nor prevent Veritas from being installed. When FreeBSD added ZFS, they did not get rid of UFS or GEOM either.
If an admin wanted or wants (or needs) to use the 'old' way of doing things they can.
Heh. I was wondering if you were following (perhaps participating in) this thread. "Pernicious" was perhaps a meaner word than I meant. How about "ubiquitous"?
The fact that traditionally RAID, LVM, etc. are not part of the filesystem is just an accident of history. It's just that no one wanted to rewrite their single disk filesystems now that they needed to support multiple disks. And the fact that administering storage is so uniquely hard is a direct result of that.
However it happened, modularity is still a good thing. It allows multiple filesystems (and other things that aren't quite filesystems) to take advantage of the same functionality, even concurrently, instead of each reinventing a slightly different and likely inferior wheel. It should not be abandoned lightly. Is "modularity bad" really the hill you want to defend?
> However it happened, modularity is still a good thing.
It may be a good thing, and it may not. Linux has a bajillion file systems, some more useful than others, and that is unique in some ways.
Solaris and other enterprise-y Unixes at the time only had one. Even the BSDs generally only have a few that they run on instead of ext2/3/4, XFS, ReiserFS (remember when that was going to take over?), btrfs, bcachefs, etc, etc, etc.
At most, a company may have purchased a license for Veritas:
By rolling everything together, you get ACID writes, atomic space-efficient low-overhead snapshots, storage pools, etc. All this just be removing one layer of indirection and doing some telescoping:
It's not "modularity bad", but that to achieve the same result someone would have had to write/expand a layer-to-layer API to achieve the same results, and no one did. Also, as a first-order estimate of complexity: how many lines of code (LoC) are there in mdraid/LVM/ext4 versus ZFS (or UFS+SVM on Solaris).
Other than esoteric high performance use cases, I'm not really sure why you would really need a plethora of filesystems. And the list of them that can be actually trusted is very short.
I'd like to agree, but I don't think the exceptions are all that esoteric. Like most people I'd consider XFS to be the default choice on Linux. It's a solid choice all around, and also has some features like project quota and realtime that others don't. OTOH, even in this thread there's plenty of sentiment around btrfs and bcachefs because of their own unique features (e.g. snapshots). Log-structured filesystems still have a lot of promise to do better on NVM, though that promise has been achingly slow to materialize. Most importantly, having generic functionality implemented in a generic subsystem instead of in a specific filesystem allows multiple approaches to be developed and compared on a level playing field, which is better for innovation overall. Glomming everything together stifles innovation on any specific piece, as network/peripheral-bus vendors discovered to their chagrin long ago.
> With a normal RAID, if the data at some address is different between the two disks, there's no way for the fs to tell which is correct, because the RAID code essentially just picks one, it can't even see the other.
That's problem only with RAID1, only when copies=2 (granted, most often used case) and only when the underlying device cannot report which sector has gone bad.
why would you want to embed raid5/6 in the filesystem layer?
There are valid reasons, most having to do with filesystem usage and optimization. Off the top of my head:
- more efficient re-syncs after failure (don't need to re-sync every block, only the blocks that were in use on the failed disk)
- can reconstruct data not only on disk self-reporting, but also on filesystem metadata errors (CRC errors, inconsistent dentries)
- different RAID profiles for different parts of the filesystem (think: parity raid for large files, raid10 for database files, no raid for tmp, N raid1 copies for filesystem metadata)
and for filesystem encryption:
- CBC ciphers have a common weakness: the block size is constant. If you use FS-object encryption instead of whole-FS encryption, the block size, offset and even the encryption keys can be varied across the disk.
I think to even call volume management a "layer" as though traditional storage was designed from first principles, is a mistake.
Volume management is a just a hack. We had all of these single-disk filesystems, but single disks were too small. So volume management was invented to present the illusion (in other words, lie) that they were still on single disks.
If you replace "disk" with "DIMM", it's immediately obvious that volume management is ridiculous. When you add a DIMM to a machine, it just works. There's no volume management for DIMMs.
Indeed there is no volume management for RAM. You have to reboot to rebuild the memory layout! RAM is higher in the caching hierarchy and can be rebuilt at smaller cost. You can't resize RAM while keeping data because nobody bothered to introduce volume management for RAM.
Storage is at the bottom of the caching hierarchy where people get inventive to avoid rebuilding. Rebuilding would be really costly there. Hence we use volume management to spare us the cost of rebuilding.
RAM also tends to have uniform performance. Which is not true for disk storage. So while you don't usually want to control data placement in RAM, you very much want to control what data goes on what disk. So the analogy confuses concepts rather than illuminating commonalities.
One of my old co-workers said that one of the most impressive things he's seen in his career was a traveling IBM tech demo in the back of a semi truck where they would physically remove memory, CPUs, and disks from the machine without impacting the live computation being executed apart from making it slower, and then adding those resources back to the machine and watching them get recognized and utilized again.
> why would you want to embed raid5/6 in the filesystem layer?
One of the creators of ZFS, Jess Bonwick, explained it in 2007:
> While designing ZFS we observed that the standard layering of the storage stack induces a surprising amount of unnecessary complexity and duplicated logic. We found that by refactoring the problem a bit -- that is, changing where the boundaries are between layers -- we could make the whole thing much simpler.
It's not about ZFS. It's about CoW filesystems in general; since they offer functionalities beyond the FS layer, they are both filesystems and logical volume managers.
RAIDZ is part of the VDEV (Virtual Device) layer. Layered on top of this is the ZIO (ZFS I/O layer). Together, these form the SPA (Storage Pool Allocator).
On top of this layer we have the ARC, L2ARC and ZIL. (Adaptive Replacement Caches and ZFS Intent Log).
Then on top of this layer we have the DMU (Data Management Unit), and then on top of that we have the DSL (Dataset and Snapshot Layer). Together, the SPA and DSL layers implement the Meta-Object Set layer, which in turn provides the Object Set layer. These implement the primitives for building a filesystem and the various file types it can store (directories, files, symlinks, devices etc.) along with the ZPL and ZAP layers (ZFS POSIX Layer and ZFS Attribute Processor), which hook into the VFS.
ZFS isn't just a filesystem. It contains as many, if not more, levels of layering than any RAID and volume management setup composed of separate parts like mdraid+LVM or similar, but much better integrated with each other.
It can also store stuff that isn't a filesystem. ZVOLs are fixed size storage presented as block devices. You could potentially write additional storage facilities yourself as extensions, e.g. an object storage layer.
> We've wasted enough effort over obscure licensing minutia.
Which was precisely Sun/Oracle's goal when they released ZFS under the purposefully GPL incompatible CDDL. Sun was hoping to make OpenSolaris the next Linux whilst ensuring that no code from OpenSolaris could be moved back to linux. I can't think of another plausible reason why they would write a new open source license for their open source operating system and making such a license incompatible with the GPL.
Some people argue that Sun (or the Sun engineer) as creator of the license made the CDDL intentionally GPL incompatible.[13] According to Danese Cooper one of the reasons for basing the CDDL on the Mozilla license was that the Mozilla license is GPL-incompatible. Cooper stated, at the 6th annual Debian conference, that the engineers who had written the Solaris kernel requested that the license of OpenSolaris be GPL-incompatible.[18]
Mozilla was selected partially because it is GPL incompatible. That was part
of the design when they released OpenSolaris. ... the engineers who wrote Solaris
... had some biases about how it should be released, and you have to respect that.
> Simon Phipps (Sun's Chief Open Source Officer at the time), who had introduced Cooper as "the one who actually wrote the CDDL",[19] did not immediately comment, but later in the same video, he says, referring back to the license issue, "I actually disagree with Danese to some degree",[20] while describing the strong preference among the engineers who wrote the code for a BSD-like license, which was in conflict with Sun's preference for something copyleft, and that waiting for legal clearance to release some parts of the code under the then unreleased GNU GPL v3 would have taken several years, and would probably also have involved mass resignations from engineers (unhappy with either the delay, the GPL, or both—this is not clear from the video). Later, in September 2006, Phipps rejected Cooper's assertion in even stronger terms.[21]
So of the available licenses at the time, Engineering wanted BSD and Legal wanted GPLv3, so the compromise was CDDL.
Not at all really. Danese Cooper says that Cantrill is not a reliable witness and one can say he also has an agenda to distort the facts in this way [1].
> Simon Phipps (Sun's Chief Open Source Officer at the time), who had introduced Cooper as "the one who actually wrote the CDDL",[19] did not immediately comment, but later in the same video, he says, referring back to the license issue, "I actually disagree with Danese to some degree",[20] while describing the strong preference among the engineers who wrote the code for a BSD-like license, which was in conflict with Sun's preference for something copyleft, and that waiting for legal clearance to release some parts of the code under the then unreleased GNU GPL v3 would have taken several years, and would probably also have involved mass resignations from engineers (unhappy with either the delay, the GPL, or both—this is not clear from the video). Later, in September 2006, Phipps rejected Cooper's assertion in even stronger terms.[21]
Danese Cooper, one of the people at Sun who helped create the CDDL, responded in the comment section of that very video:
Lovely except it really was decided to explicitly make OpenSolaris incompatible with GPL. That was one of the design points of the CDDL. I was in that room, Bryan and you were not, but I know its fun to re-write history to suit your current politics. I pleaded with Sun to use a BSD family license or the GPL itself and they would consider neither because that would have allowed D-Trace to end up in Linux. You can claim otherwise all you want...this was the truth in 2005.
This needs to be more widely known. Sun was never as open or innovative as its engineer/advertisers claim, and the revisionism is irksome. I saw what they had copied from earlier competitors like Apollo and then claimed as their own ideas. I saw the protocol fingerprinting their clients used to make non-Sun servers appear slower than they really were. They did some really good things, and they did some really awful things, but to hear proponents talk it was all sunshine and roses except for a few misguided execs. Nope. It was all up and down the organization.
The thing is - it was a time of pirates. In an environment defined by the ruthlessness of characters like Gates, Jobs, and Ellison, they were among the best-behaved of the bunch. Hence the reputation for being nice: they were markedly nicer than the hive of scum and villainy that the sector was at the time. And they did some interesting things that arguably changed the landscape (Java etc), even if they failed to fully capitalize on them.
(In many ways, it still is a time of pirates, we just moved a bit higher in the stack...)
> In an environment ... they were among the best-behaved
I wouldn't say McNealy was that different than any of those, though others like Joy and Bechtolsheim had a more salutary influence. To the extent that there was any overall difference, it seemed small. Working on protocol interop with DEC products and Sun products was no different at all. Sun went less-commodity with SPARC and SBus, they got in bed with AT&T to make their version of UNIX seem more standard than competitors' even though it was more "unique" in many ways, there were the licensing games, etc. Better than Oracle, yeah, but I wouldn't go too much further than that.
Just to be clear, I'm not saying they weren't innovative. I'm saying they weren't as innovative as they claim. Apollo, Masscomp, Pyramid, Sequent, Encore, Stellar, Ardent, Elxsi, Cydrome, and others were also innovating plenty during Sun's heyday, as were DEC and even HP. To hear ex-Sun engimarketers talk, you'd think they were the only ones. Reality is that they were in the mix. Their fleetingly greater success had more to do with making some smart (or lucky?) strategic choices than with any overall level of innovation or quality, and mistaking one for the other is a large part of why that success didn't last.
Java was pretty innovative. The worlds most advanced virtual machine, a JIT that often outperforms C in long running server scenarios, and the foundation of probably 95% of enterprise software.
ANDF had already done (or at least tried to do) the "write once, run anywhere" thing. The JVM followed in the footsteps of similar longstanding efforts at UCSD, IBM and elsewhere. There was some innovation, but "world's most advanced virtual machine" took thousands of people (many of them not at Sun) decades to achieve. Sun's contribution was primarily in popularizing these ideas. Technically, it was just one more step on an established path.
Sure plenty of the ideas in Java were invented before, standing on the shoulders of giants and all that. The JIT came from Self, the Object system from Smalltalk, but Java was the first implementation that put all those together into a coherent platform.
Yeah, it's hard to understand this without context. Sun saw D-Trace and ZFS as the differentiators of Solaris from Linux, a massive competitive advantage that they simply could not (and would not) relinquish. Opensourcing was a tactical move, they were not going to give away their crown jewels with it.
The whole open-source steer by SUN was a very disingenous strategy, forced by the changed landscape in order to try and salvage some parvence of relevance. Most people saw right through it, which is why SUN ended up as it did shortly thereafter: broke, acquired, and dismantled.
> Simon Phipps (Sun's Chief Open Source Officer at the time), who had introduced Cooper as "the one who actually wrote the CDDL",[19] did not immediately comment, but later in the same video, he says, referring back to the license issue, "I actually disagree with Danese to some degree",[20] while describing the strong preference among the engineers who wrote the code for a BSD-like license, which was in conflict with Sun's preference for something copyleft, and that waiting for legal clearance to release some parts of the code under the then unreleased GNU GPL v3 would have taken several years, and would probably also have involved mass resignations from engineers (unhappy with either the delay, the GPL, or both—this is not clear from the video). Later, in September 2006, Phipps rejected Cooper's assertion in even stronger terms.[21]
So of the available licenses at the time, Engineering wanted BSD and Legal wanted GPLv3, so the compromise was CDDL.
I don't think something that is the subject of an ongoing multi-billion-dollar lawsuit can rightly be called "obscure licensing minutia." It is high-profile and its actual effects have proven pretty significant.
It's not just licensing. ZFS has some deep-rooted flaws that can only be solved by block pointer rewrite, something that has an ETA of "maybe eventually".
You can't make a copy-on-write copy of a file. You can't deduplicate existing files, or existing snapshots. You can't defragment. You can't remove devices from a pool.
That last one is likely to get some kind of hacky workaround. But nobody wants to do the invasive changes necessary for actual BPR to enable that entire list.
Wow. As a casual user - someone who at one point was trying to choose between RAID, LVM and ZFS for an old NAS - some of those limitations of ZFS seem pretty basic. I would have taken it for granted that I could remove a device from a pool or defragment.
> There are no (stable) alternatives. BTRFS certainly not, as it's "under heavy development"¹ (since... forever).
Unless you are living in 2012 on a RHEL/CENTOS 6/7 machine, btrfs has been stable for way too long. I have been using btrfs as the sole filesystem on my laptop in standard mode, on my desktop as RAID0 and my NAS as RAID1 for more that two years. I have experienced absolutely zero data loss. Infact, btrfs recovered my laptop and desktop from broken package updates many times.
You might have had some issues when you tried btrfs on distros like RHEL that did not backport the patches to their stable versions because they don't support btrfs commercially. Try something like openSUSE that backports btrfs patches to stable versions or use something like arch.
> That's true, however, the amount is breakage is no different from any other out-of-tree module, and it unlikely to happen with a patch version of a working kernel (in fact, it happen with the 5.0 release).
This is a filesystem that we are talking. In no circumstances will any self respecting sysadmin use a file system that has even a small change of breaking with a system update.
I also used btrfs not too long ago in RAID1. I had a disk failure and voila, the array would be read-only from now on and I would have to recreate it from scratch and copy data over. I even utilized the different data recovery methods (at some point the array would not be mountable no matter what) and in the end that resulted in around 5% of the data being corrupt. I won't rule out my own stupidity in the recovery steps, but after this and the two other times when my RAID1 array went read-only _again_ I just can't trust btrfs for anything other than single device DUP mode operation.
Meanwhile ZFS has survived disk failures, removing 2 disks from an 8 disk RAIDZ3 array and then putting them back, random SATA interface connection issues that were resolved by reseating the HDD, and will probably survive anything else that I throw at it.
A former employer was threatened by Oracle because some downloads for the (only free for noncommercial use) VirtualBox Extension Pack came from an IP block owned by the organization. Home users are probably safe, but Oracle's harassment engine has incredible reach.
My employer straight up banned the use of VirtualBox entirely _just in case_. They'd rather pay for VMWare Fusion licenses than deal with any potential crap from Oracle.
Anecdotal, but VirtualBox has always been a bit flaky for me.
VMWare Fusion, on the other hand, powers the desktop environment I've used as a daily work machine for the last 6 months, and I've had absolutely zero problems other than trackpad scrolling getting emulated as mouse wheel events (making pixel-perfect scroll impossible).
Despite that one annoyance, it's definitely worth paying for if you're using it for any serious or professional purpose.
> This is throwing the baby along with the bathwater.
It might be, but let's just say that Oracle aren't big fans of $WORK, and our founders are big fans of them. Thus our legal department are rather tetchy about anything that could give them even the slightest chance of doing anything.
> What requires "commercial considerations" is the extension pack.
And our legal department are nervous about that being installed, even by accident, so they prefer to minimise the possibility.
Well ... that sounds initially unreasonable, but then if I think about it a bit more I'm not sure how you'd actually enforce a non-commercial use only license without some basic heuristic like "companies are commercial".
Is the expectation here that firms offering software under non-commercial-use-is-free licenses just run it entirely on the honour system? And isn't it true that many firms use unlicensed software, hence the need for audits?
IIRC VirtualBox offers to download the Extension Pack without stating it's not free for commercial use. There isn't even a link to the EULA in the download dialog as far as I can tell (from Google Images, at least). Conversely, VirtualBox itself is free for commercial use. Feels more like a honeypot than license auditing.
They can also apply stronger heuristics, like popping up a dialogue box if the computer is centrally-managed (e.g.: Mac MDM, Windows domain, Windows Pro/Enterprise, etc.).
You're thinking of the Guest Additions which is part of the base Virtualbox package and free for commercial use.
The (commercially licensed) Extensions pack provide "Support for USB 2.0 and USB 3.0 devices, VirtualBox RDP, disk encryption, NVMe and PXE boot for Intel cards"[1] and some other functionality e.g. webcam passthrough [2]. There may be additional functionality enabled by the Extension pack I cannot find at a glance, but those are the main things.
A tad offtopic, but on my 2017 Macbook Pro the "pack" was called VMWare Fusion.
With my MBP as host and Ubuntu as guest, I found that VirtualBox (with and without guest extensions installed) had a lot of graphical performance issues that Fusion did not.
They harass universities about it too. Which is ludicrous, because universities often have residence halls, and people who live there often download VirtualBox extensions.
It does, but it's not 100% clear if administrative employees of universities count as educational. Sure, if you are teaching a class with it, go for it; but running a VM in it for the university accounting office is not as clear.
Education might not be the same as research in this license's terms. And there are even software vendors picking nits about writing a thesis being either research or education, depending on their mood and purse fill level...
Linus is distributing the kernel, a very different beast from using a kernel module. I can't imagine Oracle targeting someone for using ZFS on Linux without first establishing that the distribution of ZFS on Linux is illegal.
Re-reading my comment in daylight I realize I got one detail almost exactly wrong: we were always <= 2 developers, but it seems everyone understood the point anyway - we were tiny, but not too tiny for Oracles licensing department.
Can you expand? I'm no expert - use linux daily but have always just used distro default file system. Linus' reasons for not integrating seems pretty sensible to me. Oracle certainly has form on the litigation front.
Linus' reasons for not integrating ZFS are absolutely valid and it's no doubt that ZFS can never be included in the mainline kernel. There's absolutely no debate there.
However the person he is replying to was not actually asking to have ZFS included in the mainline kernel. As noted above, that could never happen, and I believe that Linus is only bringing it up to deflect from the real issue. What they were actually asking is for Linux to revert a change that was made for no other reason than to hinder the use of ZFS.
Linux includes a system which restricts what APIs are available to each module based on the license of the module. GPL modules get the full set of APIs whereas non-GPL modules get a reduced set. This is done strictly for political reasons and has no known legal basis as far as I'm aware.
Not too long ago a change was made to reduce the visibility of a certain API required by ZFS so only GPL modules could use it. It's not clear why the change was made, but it was certainly not to improve the functionality of the kernel in any way. So the only plausible explanation to me is that it was done just to hinder the use of ZFS with Linux, which has been a hot political issue for some time now.
If I remember correctly, the reasoning for the GPL module stuff was/is, that if kernel modules integrate deeply with the kernel, they fall under gpl. So the GPL flag is basically a guideline of what kernel developers believe is safe to use from non gpl-compatible modules
But from what I can see, marking the "save SIMD registers" function as GPL is a blatant lie by a kernel developer that wanted to spite certain modules.
Saving and restoring registers is an astoundingly generic function. If you list all the kernel exports and sort by how much they make your work derivative, it should be near the very bottom.
You are not supposed to use FP/SSE in kernel mode.
It was always frowned upon:
> In other words: it's still very much a special case, and if the question was "can I just use FP in the kernel" then the answer is still a resounding NO, since other architectures may not support it AT ALL.
> Linus Torvalds, 2003
and these specific functions, that were marked as GPL were already deprecated for well over a decade.
> You are not supposed to use FP/SSE in kernel mode.
> It was always frowned upon
Whether it's frowned upon is a completely different issue from whether it intertwines your data so deeply with the kernel that it makes your code a derivative work subject to the GPL license. Which it doesn't.
> if the question was "can I just use FP in the kernel" then the answer is still a resounding NO, since other architectures may not support it AT ALL.
It's not actually using floating point, it's using faster instructions for integer math, and it has a perfectly viable fallback for architectures that don't have those instructions. But why use the slower version when there's no real reason to?
> and these specific functions, that were marked as GPL were already deprecated for well over a decade.
But the GPL export is still there, isn't it? It's not that functionality is being removed, it's that functionality is being shifted to only have a GPL export with no license-based justification for doing so.
So what meets the criteria of being a "special case" and what doesn't? One of the examples that Linus gives is RAID checksumming. How come RAID checksumming is a special case but ZFS checksumming isn't? I don't think it has anything to do with the nature of the usage, the only problem is that the user is ZFS.
RAID checksuming is in the kernel, and when Linus says jump, the RAID folks ask back how high.
He is not going to beg people outside kernel, whether he is allowed to change something that may break their module. On the contrary, they must live with any breackage that is thrown at them.
Again, that symbol was deprecated for well over a decade. How long does it take to be allowed to remove it?
Sometimes in life we do things even though we are not explicitly obligated to do them. Nobody is asking for ZFS to get explicitly maintained support in the Linux kernel. They are simply asking for this one small inconsequential change to be reverted just this one time, since it would literally be no harm to the kernel developers to do so, and it would provide substantial benefits to any user wanting to use ZFS. Furthermore the amount of time that kernel developers have spent arguing in favour of this change has been significantly greater than the time it would have taken to just revert it.
> Again, that symbol was deprecated for well over a decade.
But not the GPL equivalent of the symbol. That symbol is not deprecated.
This is the commonly recited argument but I don't believe it was ever proven to be legally necessary. Furthermore, even if it was, it's not clear what level of integration is "too deep". So in practice, it's just a way for kernel developers to add political restrictions as they see fit.
Proven legally necessary, as in, a court ever telling anyone in that situation to stop doing it. Or even to start doing it in the first place. There's just no legal justification behind it whatsoever.
One can interpret this as something legally significant, or an embarrassing private anecdote, or nothing substantial at all, maybe even just talk. However, I'd give them the benefit of the doubt. Not the least since they could be the ones against Oracle's legal dept...
What he is referring to is the use of the GPL export restriction to strong-arm companies into releasing their code as GPL. It's nothing to do with a legal requirement, he is just an open source licensing hardhead. See: https://lwn.net/Articles/603145/
> This being "Oracle," and its litigious nature, how can you truly be aware or sure?
The functionality I'm describing has absolutely nothing to do with ZFS or Oracle in any way. If you really think the reach of Oracle is so great, then why not block all Oracle code from ever running on the OS? That seems to me to be just as justified as this change.
This want a case of "purposely hinder", but rather the zfs nodule broke because of some kernel changes. The kernel is careful to never break userspace and never break its own merged modules. But if you're a third-party module then you're on your own. The kernel developers can't be responsible for maintaining compatibility with your stuff.
The changes conveniently accomplished nothing except for breaking ZFS. Furthermore, just because they don't officially support ZFS doesn't mean they must stonewall all the users who desire the improved compatibility. Reverting this small change would not be a declaration that ZFS is officially supported.
> - Performance is not that great compared to the alternatives.
CoW filesystems do trade performance for data safety. Or did you mean there are other _stable/production_ CoW filesystems with better performance? If so, please do point them out!
Isn't this a problem for any over provisioned storage pool ? You can avoid that if you want by not over provisioning & checking space consumed by CoW snapshots. Also what does ZFS do if you run out of blocks ?
I have actually managed to run out of blocks on XFS on thin LV and it's an interesting experience. XFS always survoved just fine, but some files basically vanished. Looks like mostly those that were open and being written to at exhaustion time, like for example a mariadb database backing store. Files that were just sitting there were perfectly fine as far as I could tell.
Still, you definitely should never put data on a volume where a pool can be exhausted, without a backup as I don't think there is really a bulletproof way for a filesystem to handle that happening suddenly.
>Isn't this a problem for any over provisioned storage pool ?
ZFS doesn't over-provision anything by default. The only case I'm aware of where you can over-provision with ZFS is when you explicitly choose to thin provision zvols (virtual block devices with a fixed size). This can't be done with regular file systems which grow as needed, though you can reserve space for them.
File systems do handle running out of space (for a loose definition of handle) but they never expect the underlying block device to run out of space, which is what happens with over-provisioning. That's a problem common to any volume manager that allows you to over provision.
Can't you over provision even just by creating too many many snapshots ? Even if you never make the filesystems bigger then the backing pool, the snapshots will allocate some blocks from the pool and over time, boom.
Snapshots can't cause over-provisioning, not for file systems. If I mutate my data and keep snapshots forever, eventually my pool will run out of free space. But that's not a problem of over-provisioning, that's just running out of space.
With ZFS, if I take a snapshot and then delete 10GB of data my file system will appear to have shrunk by 10GB. If I compare the output of df before and after deleting the data, df will tell me that "size" and "used" have decreased by 10GB while "available" remained constant. Once the snapshot is deleted that 10GB will be made available again and the "size" and "available" columns in df will increase. It avoids over-provisioning by never promising more available space than it can guarantee you're able to write.
I think you're trying to relate ZFS too much to how LVM works, where LVM is just a volume manager that exposes virtual devices. The analogue to thin provisioned LVM volumes is thin-provisioned zvols, not regular ZFS file systems. I can choose to use ZFS in place of LVM as a volume manager with XFS as my file system. Over-provisioned zvols+XFS will have functionally equivalent problems as over-provisioned LVM+XFS.
ZFS doesn't work this way. The free blocks in the ZFS pool are available to all datasets (filesystems). The datasets themselves don't take up any space up front until you add data to them. Snapshots don't take up any space initially. They only take up space when the original dataset is modified, and altered blocks are moved onto a "deadlist". Since the modification allocates new blocks, if the pool runs out of space it will simply return ENOSPC at some point. There's no possibility of over-provisioning.
ZFS has quotas and reservations. The former is a maximum allocation for a dataset. The latter is a minimum guaranteed allocation. Neither actually allocate blocks from the pool. These don't relate in any comparable way to how LVM works. They are just numbers to check when allocating blocks.
LVM thin pools had (maybe still have - I haven't used them recently) another issue though, where running out of metadata space caused the volumes in the thinpool to become corrupt and unreadable.
ZFS does overprovision all filesystems in a zpool by default. Create 10 new filesystems and 'df' will now display 10x the space of the parent fs. A full fs is handled differently than your volume manager running out of blocks. But the normal case is overprovisioning.
That's not really overprovisioning. That's just a factor of the space belonging to a zpool, but 'df' not really having a sensible way of representing that.
That is not over-provisioning, it's just that 'df' doesn't have the concept of pooled storage. With pools it's possible for different file systems to share their "available" space. BTRFS also has its own problems with ouput when using df and getting strange results.
If I have a 10GB pool and I create 10 empty file systems, the sizes reported in df will be 100GB. It's not quite a lie either, because each of those 10 file systems does in fact have 10GB of space available I could write 10GB to any one of them. If I write 1GB to one of those file systems, the "size" and "available" spaces for the other nine will all shrink despite not having a single byte of data written to them.
With ZFS and df the "size" column is really only measuring the maximum possible size (at this point in time, assuming nothing else is written) so it isn't very meaningful, but the "used" and "available" columns do measure something useful.
In my example the sum of possible future allocations for ZFS is still only 10GB total. Each of the ten file systems, considered individually, does truthfully have 10GB available to it before any data is written. The difference is that with over-provisioning (like LVM+XFS), if I write 10GB of data to one file system the others will still report 10GB of free space, but with ZFS or BTRFS they'll report 0GB available, so I can never actually attempt to allocate 100GB of data.
You could build a pool-aware version of DF that reflects this, by grouping file systems in a pool together and reporting that the pool has 10GB available. But frankly there's not enough benefit to doing that because people with storage pools already understand summing up all the available space from df's output is not meaningful. Tools like zpool list and BTRFS's df equivalent already correctly report the total free space in the pool.
>- Using it opens you up to the threat of lawsuits from Oracle. Given history, this is a real threat. (This is one that should be high for Linus but not for me - there is no conceivable reason that Oracle would want to threaten me with a lawsuit.)
No. Distributing (ie. precompiled distro with ZFS) will. You are free to run any software on your machine as you so desire.
- The kernel team may break it at any time, and won't care if they do.
- It doesn't seem to be well-maintained.
- Performance is not that great compared to the alternatives.
- Using it opens you up to the threat of lawsuits from Oracle. Given history, this is a real threat. (This is one that should be high for Linus but not for me - there is no conceivable reason that Oracle would want to threaten me with a lawsuit.)