ZFS Tutorial Part 1

Introduction

ZFS is an advanced, open-source, filesystem available on FreeBSD, illumos, Solaris and Linux. ZFS is quite different from traditional filesystems: the best way to understand it is with hands-on experience. This series of tutorials give you that experience, covering pool and filesystem management, fault tolerance, quotas, compression, snapshots, clones, and caching. We'll also put ZFS to work with jails on FreeBSD and containers on Linux.

In part 1 we look at ZFS pools: the foundation of ZFS. In part 2 we will look at ZFS filesystems in more detail.

Requirements

To follow this tutorial you need an OS with good ZFS support, such as FreeBSD 9+, illumos, or Ubuntu 16.04 LTS. If you're using another Linux distro then you should install zfsonlinux v0.6.4+. Solaris users will find that most of what is said here is relevant, but keep in mind that the Solaris implementation is a little different.

If you're running Ubuntu 16.04 LTS (Xenial Xerus) your kernel has ZFS support, but you still need to install the command-line tools:

$ sudo apt-get install zfsutils-linux

If you're not sure if your system supports ZFS just run the zfs command. If it works you should be ready to go.

NB. FreeBSD/PC-BSD 10.1 users on SSD should make sure they're on 10.1-RELEASE-p11 or greater to avoid a kernel panic. See FreeBSD-EN-15:07.zfs for details.

Privileges

You need root privileges to create or manage ZFS pools. If a command is shown with the # prompt then it needs to be run as root or with sudo. NB. On Ubuntu all ZFS commands need root privileges.

Disk Files

To allow you to safely experiment we use files instead of disks: that way you can use any system, such as a laptop or virtual machine, to build and break multi-disk configurations. In this tutorial we use files to represent five disks, so go ahead and create them now:

$ mkdir /tmp/zfstut
$ dd bs=1m count=256 if=/dev/zero of=/tmp/zfstut/disk1
$ dd bs=1m count=256 if=/dev/zero of=/tmp/zfstut/disk2
$ dd bs=1m count=256 if=/dev/zero of=/tmp/zfstut/disk3
$ dd bs=1m count=256 if=/dev/zero of=/tmp/zfstut/disk4
$ dd bs=1m count=256 if=/dev/zero of=/tmp/zfstut/sparedisk

$ ls -lh /tmp/zfstut
total 3
-rw-r--r--  1 flux  wheel   256M Jan 23 17:54 disk1
-rw-r--r--  1 flux  wheel   256M Jan 23 17:54 disk2
-rw-r--r--  1 flux  wheel   256M Jan 23 17:54 disk3
-rw-r--r--  1 flux  wheel   256M Jan 23 17:54 disk4
-rw-r--r--  1 flux  wheel   256M Jan 23 17:54 sparedisk

ZFS Pools

All ZFS filesystems live in a pool and share its resources. A pool consists of one or more disks. The first step in using ZFS is to create a pool. ZFS pools are administered using the zpool command.

Before creating new pools you should check for existing pools on your system:

$ zpool list 
no pools available

On systems that use ZFS filesystem you will see existing pools like this:

$ zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zroot  159G  1.29G   158G         -     0%     0%  1.00x  ONLINE  -

We'll give our pools distinctive names to avoid confusion with system pools. In subsequent output we'll only show the pools used in the tutorial.

Single Disk Pool

The simplest pool consist of a single disk. Make sure you've created your disk files in /tmp/zfstut (see above). Then create a pool using one of the following two commands:

If you're running as root run:

# zpool create herring /tmp/zfstut/disk1

If you're using sudo then run:

$ sudo zpool create herring /tmp/zfstut/disk1

From now on we'll show commands without sudo, so don't forget to add it if required.

List basic information about your pool:

$ zpool list herring
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
herring   240M    50K   240M         -     1%     0%  1.00x  ONLINE  -

You now have a working pool complete with mounted ZFS filesystem under /herring (we will learn about adjusting mount points in part 2).

Create an empty file in your new filesystem:

$ touch /herring/helloworld
$ ls /herring
helloworld

It's worth pausing here to think about how much happened from that one zpool create command. ZFS has created a storage pool, a filesystem and mounted it for you. You didn't need to mess around with a raid manager, create or format a filesystem, nor set mount points. It's all done for you.

In part 2 we'll create multiple filesystems within one pool, but for now let's explore pools in more details.

Create a large file in the herring filesystem:

$ dd bs=1m count=64 if=/dev/random of=/herring/foo

$ ls -lh /herring
total 65609
-rw-r--r--  1 flux  wheel    64M Jan 23 17:55 foo
-rw-r--r--  1 flux  wheel     0B Jan 23 17:54 helloworld

$ zpool list herring
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
herring   240M  64.2M   176M         -    21%    26%  1.00x  ONLINE  -

The new file is using about a quarter of the pool capacity (indicated by the CAP value). If you run the list command before ZFS has finished writing to the disk you will see lower USED and CAP values than shown; wait a few moments and try again.

Now destroy your pool with zpool destroy:

# zpool destroy herring
$ zpool list herring
cannot open 'herring': no such pool

You will only receive a warning about destroying your pool if files on it are in use. We'll see in a later tutorial how you can recover a pool you've accidentally destroyed.

Mirrored Pool

A pool composed of a single disk doesn't offer any redundancy: if the disk fails our data is lost. One method of providing redundancy is to create a pool out of a mirrored pair of disks. This is analogous to RAID 1.

# zpool create trout mirror /tmp/zfstut/disk1 /tmp/zfstut/disk2
$ zpool list trout
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
trout   240M    50K   240M         -     1%     0%  1.00x  ONLINE  -

You can see more detail on a pool with the status command:

$ zpool status trout
  pool: trout
 state: ONLINE
  scan: none requested
config:

  NAME                   STATE     READ WRITE CKSUM
  trout                  ONLINE       0     0     0
    mirror-0             ONLINE       0     0     0
      /tmp/zfstut/disk1  ONLINE       0     0     0
      /tmp/zfstut/disk2  ONLINE       0     0     0

errors: No known data errors

We can see our pool contains one mirror of two disks. Let's create a file and see how USED changes:

$ dd bs=1m count=64 if=/dev/random of=/trout/foo

$ zpool list trout
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
trout   240M  64.2M   176M         -    20%    26%  1.00x  ONLINE  -

As before about a quarter of the disk has been used, but the data is now stored redundantly over both disks.

Mirror Resilience

If it isn't tested, it doesn't work. Let's put ZFS mirroring to the test by overwriting part of one disk with random data:

$ dd bs=1m seek=10 count=1 conv=notrunc if=/dev/random of=/tmp/zfstut/disk1

ZFS will spot the issue with the damaged data if we try to access it, but we can force an immediate check by scrubbing the pool:

# zpool scrub trout
$ zpool status trout
  pool: trout
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
  attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
  using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 1.12M in 0h0m with 0 errors on Sat Jan 23…
config:

  NAME                   STATE     READ WRITE CKSUM
  trout                  ONLINE       0     0     0
    mirror-0             ONLINE       0     0     0
      /tmp/zfstut/disk1  ONLINE       0     0     9
      /tmp/zfstut/disk2  ONLINE       0     0     0

errors: No known data errors

The disk file is fine; it's just some of its data that's damaged, so a clear should do the trick:

# zpool clear trout
$ zpool status trout
  pool: trout
 state: ONLINE
  scan: scrub repaired 1.12M in 0h0m with 0 errors on Sat Jan 23…
config:

  NAME                   STATE     READ WRITE CKSUM
  trout                  ONLINE       0     0     0
    mirror-0             ONLINE       0     0     0
      /tmp/zfstut/disk1  ONLINE       0     0     0
      /tmp/zfstut/disk2  ONLINE       0     0     0

 errors: No known data errors

Our pool is back to a healthy state: the data was repaired using the other disk in the mirror.

That's all very well if the data is corrupted, but what about a disk failure? We can simulate a whole disk failure by truncating the disk file and running another scrub:

$ echo > /tmp/zfstut/disk1

# zpool scrub trout
$ zpool status trout
  pool: trout
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
  invalid.  Sufficient replicas exist for the pool to continue
  functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Sat Jan 23…
config:

   NAME                     STATE     READ WRITE CKSUM
   trout                    DEGRADED     0     0     0
     mirror-0               DEGRADED     0     0     0
       1905457780109944468  UNAVAIL      0     0     0  was /tmp/zfstut/disk1
       /tmp/zfstut/disk2    ONLINE       0     0     0

errors: No known data errors

The disk file we truncated is showing as unavailable, but no data errors are reported for the pool as a whole. We can still read and write to the pool:

$ dd bs=1m count=64 if=/dev/random of=/trout/bar 

$ ls -lh /trout
total 131165
-rw-r--r--  1 root  wheel    64M Jan 23 18:17 bar
-rw-r--r--  1 root  wheel    64M Jan 23 18:04 foo

To maintain redundancy we should replace the broken disk with another:

# zpool replace trout /tmp/zfstut/disk1 /tmp/zfstut/sparedisk

Check to see if our pool is healthy again:

$ zpool status trout
   pool: trout
  state: ONLINE
   scan: resilvered 128M in 0h0m with 0 errors on Sat Jan 23…
 config:

  NAME                       STATE     READ WRITE CKSUM
  trout                      ONLINE       0     0     0
    mirror-0                 ONLINE       0     0     0
      /tmp/zfstut/sparedisk  ONLINE       0     0     0
      /tmp/zfstut/disk2      ONLINE       0     0     0

errors: No known data errors

If you are quick enough, or your device is slow enough, you may see a resilvering in progress. A resilivering is analogous to remirroring in traditional raid, but only copies blocks with data in them: this can save many hours on large magnetic disks.

Adding to a Mirrored Pool

You can add disks to a pool without taking it offline. Let's double the size of our trout pool by adding a second mirror:

$ zpool list trout
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
trout   240M   128M   112M         -    40%    53%  1.00x  ONLINE  -

# zpool add trout mirror /tmp/zfstut/disk3 /tmp/zfstut/disk4

$ zpool list trout
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
trout   480M   128M   352M         -    20%    26%  1.00x  ONLINE  -

This happens almost instantly, and the filesystems within the pool remain available during the addition. Looking at the status now shows the pool consists of two mirrors:

$ zpool status trout
  pool: trout
 state: ONLINE
  scan: resilvered 128M in 0h0m with 0 errors on Sat Jan 23…
config:

  NAME                       STATE     READ WRITE CKSUM
  trout                      ONLINE       0     0     0
    mirror-0                 ONLINE       0     0     0
      /tmp/zfstut/sparedisk  ONLINE       0     0     0
      /tmp/zfstut/disk2      ONLINE       0     0     0
    mirror-1                 ONLINE       0     0     0
      /tmp/zfstut/disk3      ONLINE       0     0     0
      /tmp/zfstut/disk4      ONLINE       0     0     0

errors: No known data errors

We can examine the distribution of data across our two mirrors:

$ zpool iostat -v trout
                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
trout                       128M   352M      4      5   480K   144K
  mirror                    128M   112M      4      5   480K   144K
    /tmp/zfstut/sparedisk      -      -      0     26    351  1.90M
    /tmp/zfstut/disk2          -      -      4      5   484K   151K
  mirror                     11K   240M      0      0      0    488
    /tmp/zfstut/disk3          -      -      0      1    509  49.8K
    /tmp/zfstut/disk4          -      -      0      1    509  49.8K
-------------------------  -----  -----  -----  -----  -----  -----

All the data is still written on the first mirror and none on the second. This makes sense as the second pair of disks was added after the data was written and ZFS doesn't move existing data around.

However, if we write some new data to the pool the new mirror will be used:

$ dd bs=1m count=128 if=/dev/random of=/trout/quuxx

$ zpool iostat -v trout
                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
trout                       256M   224M      3      7   441K   260K
  mirror                    178M  62.4M      3      6   441K   182K
    /tmp/zfstut/sparedisk      -      -      0     18    159  1.19M
    /tmp/zfstut/disk2          -      -      3      6   444K   188K
  mirror                   78.8M   161M      0      9      0   614K
    /tmp/zfstut/disk3          -      -      0     10    185   633K
    /tmp/zfstut/disk4          -      -      0     10    185   633K
-------------------------  -----  -----  -----  -----  -----  -----

Note how a more of the new data has been written to the new mirror than the old: ZFS tries to make best use of all the resources in the pool. As more writes occur the mirrors will gradually move towards balance.

Finally we should destroy the trout pool and remove the disk files:

# zpool destroy trout
$ rm -r /tmp/zfstut

Conclusion

That's it for part 1. I hope it has given you a taste of the power of ZFS and a solid foundation in ZFS pools. In part 2 we will look at managing ZFS filesystems, including properties, quotas, and compression. ♆