Keystone SystemsKS Systems

COW Block Storage and Why Keystone Prefers ZFS

Copy-on-write filesystems represent a fundamental shift in how storage systems handle data modification. For infrastructure requiring reliability and data integrity, ZFS has become the preferred choice.

Copy-on-Write Explained

Traditional filesystems modify data in place. When you edit a file:

  1. Locate the block on disk
  2. Overwrite the existing data
  3. Hope nothing goes wrong mid-write

If power fails during step 2, you have corrupted data. Journaling filesystems add recovery logs, but the fundamental approach remains: mutate in place.

Copy-on-write (COW) takes a different approach:

  1. Write new data to a fresh location
  2. Update the pointer to reference the new location
  3. The old data remains until explicitly freed

The write is atomic—you either have the old data or the new data, never a corrupted mix.

ZFS Architecture

ZFS combines a volume manager and filesystem into an integrated stack.

Zpools

Storage pools aggregate physical devices:

# Create a mirrored pool
zpool create tank mirror /dev/sda /dev/sdb

Pools can span multiple devices with various redundancy levels (mirror, RAIDZ1, RAIDZ2, RAIDZ3).

Datasets

Datasets are like filesystems within a pool:

zfs create tank/home
zfs create tank/var
zfs create tank/docker

Each dataset can have its own properties (compression, quotas, snapshots).

Checksums

Every block is checksummed. ZFS validates data on every read. If a checksum fails and you have redundancy (mirror or RAIDZ), ZFS automatically repairs from a good copy.

Silent data corruption ("bit rot") is detected and corrected. Traditional filesystems wouldn't know the data was wrong.

Self-Healing

With redundancy, ZFS repairs detected corruption automatically. Regular "scrubs" proactively verify all data:

zpool scrub tank

Why ZFS for Keystone

Keystone infrastructure prioritizes data integrity and operational reliability:

Data Integrity is Paramount

Every block checksummed means silent corruption is impossible. For databases and critical state, this isn't optional.

Native Snapshots

Snapshots are instantaneous and space-efficient (only storing differences):

zfs snapshot tank/data@before-upgrade
# Do risky operation
# If it fails:
zfs rollback tank/data@before-upgrade

Send/Receive for Replication

Snapshots can be sent to remote systems:

zfs send tank/data@snap1 | ssh remote zfs recv backup/data
# Incremental:
zfs send -i @snap1 tank/data@snap2 | ssh remote zfs recv backup/data

This enables efficient backup and disaster recovery.

Compression

Transparent compression saves space without application changes:

zfs set compression=lz4 tank/data

LZ4 is fast enough that compression often improves performance (less data to write to disk).

Deduplication

For specific workloads (like VM images with similar base layers), deduplication eliminates redundant blocks. Use judiciously—it requires significant RAM.

ZFS vs Alternatives

Btrfs

Similar features (COW, snapshots, checksums), different maturity. Btrfs has had more stability issues historically, though it continues to improve. RAID5/6 equivalents are less proven than ZFS RAIDZ.

ext4 + LVM

Traditional approach: ext4 filesystem on LVM volumes. Works, well-tested, but:

  • No checksums
  • Snapshots require LVM complexity
  • No self-healing

XFS

Excellent performance for large files. But:

  • No checksums
  • No native snapshots
  • Designed for a different era

Practical Considerations

Memory

ZFS uses RAM for its ARC (Adaptive Replacement Cache). Plan for 1GB base plus 1GB per TB of storage as a starting point. More is better for performance.

Licensing

ZFS is CDDL licensed, which conflicts with the Linux kernel's GPL. This means ZFS can't be distributed as part of the Linux kernel. In practice, you install it as a kernel module (OpenZFS). This is a legal/political concern more than a technical one.

NixOS Support

NixOS has excellent ZFS support:

boot.supportedFilesystems = [ "zfs" ];
boot.zfs.forceImportRoot = false;
networking.hostId = "abcd1234";  # Required for ZFS

Declarative pool and dataset management, automatic kernel module handling, and integration with the NixOS boot process.