COW Block Storage and Why Keystone Prefers ZFS
Copy-on-write filesystems represent a fundamental shift in how storage systems handle data modification. For infrastructure requiring reliability and data integrity, ZFS has become the preferred choice.
Copy-on-Write Explained
Traditional filesystems modify data in place. When you edit a file:
- Locate the block on disk
- Overwrite the existing data
- Hope nothing goes wrong mid-write
If power fails during step 2, you have corrupted data. Journaling filesystems add recovery logs, but the fundamental approach remains: mutate in place.
Copy-on-write (COW) takes a different approach:
- Write new data to a fresh location
- Update the pointer to reference the new location
- The old data remains until explicitly freed
The write is atomic—you either have the old data or the new data, never a corrupted mix.
ZFS Architecture
ZFS combines a volume manager and filesystem into an integrated stack.
Zpools
Storage pools aggregate physical devices:
# Create a mirrored pool
zpool create tank mirror /dev/sda /dev/sdbPools can span multiple devices with various redundancy levels (mirror, RAIDZ1, RAIDZ2, RAIDZ3).
Datasets
Datasets are like filesystems within a pool:
zfs create tank/home
zfs create tank/var
zfs create tank/dockerEach dataset can have its own properties (compression, quotas, snapshots).
Checksums
Every block is checksummed. ZFS validates data on every read. If a checksum fails and you have redundancy (mirror or RAIDZ), ZFS automatically repairs from a good copy.
Silent data corruption ("bit rot") is detected and corrected. Traditional filesystems wouldn't know the data was wrong.
Self-Healing
With redundancy, ZFS repairs detected corruption automatically. Regular "scrubs" proactively verify all data:
zpool scrub tankWhy ZFS for Keystone
Keystone infrastructure prioritizes data integrity and operational reliability:
Data Integrity is Paramount
Every block checksummed means silent corruption is impossible. For databases and critical state, this isn't optional.
Native Snapshots
Snapshots are instantaneous and space-efficient (only storing differences):
zfs snapshot tank/data@before-upgrade
# Do risky operation
# If it fails:
zfs rollback tank/data@before-upgradeSend/Receive for Replication
Snapshots can be sent to remote systems:
zfs send tank/data@snap1 | ssh remote zfs recv backup/data
# Incremental:
zfs send -i @snap1 tank/data@snap2 | ssh remote zfs recv backup/dataThis enables efficient backup and disaster recovery.
Compression
Transparent compression saves space without application changes:
zfs set compression=lz4 tank/dataLZ4 is fast enough that compression often improves performance (less data to write to disk).
Deduplication
For specific workloads (like VM images with similar base layers), deduplication eliminates redundant blocks. Use judiciously—it requires significant RAM.
ZFS vs Alternatives
Btrfs
Similar features (COW, snapshots, checksums), different maturity. Btrfs has had more stability issues historically, though it continues to improve. RAID5/6 equivalents are less proven than ZFS RAIDZ.
ext4 + LVM
Traditional approach: ext4 filesystem on LVM volumes. Works, well-tested, but:
- No checksums
- Snapshots require LVM complexity
- No self-healing
XFS
Excellent performance for large files. But:
- No checksums
- No native snapshots
- Designed for a different era
Practical Considerations
Memory
ZFS uses RAM for its ARC (Adaptive Replacement Cache). Plan for 1GB base plus 1GB per TB of storage as a starting point. More is better for performance.
Licensing
ZFS is CDDL licensed, which conflicts with the Linux kernel's GPL. This means ZFS can't be distributed as part of the Linux kernel. In practice, you install it as a kernel module (OpenZFS). This is a legal/political concern more than a technical one.
NixOS Support
NixOS has excellent ZFS support:
boot.supportedFilesystems = [ "zfs" ];
boot.zfs.forceImportRoot = false;
networking.hostId = "abcd1234"; # Required for ZFSDeclarative pool and dataset management, automatic kernel module handling, and integration with the NixOS boot process.