FAQ — OpenZFS documentation (2024)

What is OpenZFS

OpenZFS is an outstanding storage platform thatencompasses the functionality of traditional filesystems, volumemanagers, and more, with consistent reliability, functionality andperformance across all distributions. Additional information aboutOpenZFS can be found in the OpenZFS wikipediaarticle.

Hardware Requirements

Because ZFS was originally designed for Sun Solaris it was longconsidered a filesystem for large servers and for companies that couldafford the best and most powerful hardware available. But since theporting of ZFS to numerous OpenSource platforms (The BSDs, Illumos andLinux - under the umbrella organization “OpenZFS”), these requirementshave been lowered.

The suggested hardware requirements are:

  • ECC memory. This isn’t really a requirement, but it’s highlyrecommended.

  • 8GB+ of memory for the best performance. It’s perfectly possible torun with 2GB or less (and people do), but you’ll need more if usingdeduplication.

Do I have to use ECC memory for ZFS?

Using ECC memory for OpenZFS is strongly recommended for enterpriseenvironments where the strongest data integrity guarantees are required.Without ECC memory rare random bit flips caused by cosmic rays or byfaulty memory can go undetected. If this were to occur OpenZFS (or anyother filesystem) will write the damaged data to disk and be unable toautomatically detect the corruption.

Unfortunately, ECC memory is not always supported by consumer gradehardware. And even when it is, ECC memory will be more expensive. Forhome users the additional safety brought by ECC memory might not justifythe cost. It’s up to you to determine what level of protection your datarequires.

Installation

OpenZFS is available for FreeBSD and all major Linux distributions. Refer tothe getting started section of the wiki forlinks to installations instructions. If your distribution/OS isn’tlisted you can always build OpenZFS from the latest officialtarball.

Supported Architectures

OpenZFS is regularly compiled for the following architectures:aarch64, arm, ppc, ppc64, x86, x86_64.

Supported Linux Kernels

The notes for a givenOpenZFS release will include a range of supported kernels. Pointreleases will be tagged as needed in order to support the stablekernel available from kernel.org. Theoldest supported kernel is 2.6.32 due to its prominence in EnterpriseLinux distributions.

32-bit vs 64-bit Systems

You are strongly encouraged to use a 64-bit kernel. OpenZFSwill build for 32-bit systems but you may encounter stability problems.

ZFS was originally developed for the Solaris kernel which differs fromsome OpenZFS platforms in several significant ways. Perhaps most importantlyfor ZFS it is common practice in the Solaris kernel to make heavy use ofthe virtual address space. However, use of the virtual address space isstrongly discouraged in the Linux kernel. This is particularly true on32-bit architectures where the virtual address space is limited to 100Mby default. Using the virtual address space on 64-bit Linux kernels isalso discouraged but the address space is so much larger than physicalmemory that it is less of an issue.

If you are bumping up against the virtual memory limit on a 32-bitsystem you will see the following message in your system logs. You canincrease the virtual address size with the boot option vmalloc=512M.

However, even after making this change your system will likely not beentirely stable. Proper support for 32-bit systems is contingent uponthe OpenZFS code being weaned off its dependence on virtual memory. Thiswill take some time to do correctly but it is planned for OpenZFS. Thischange is also expected to improve how efficiently OpenZFS manages theARC cache and allow for tighter integration with the standard Linux pagecache.

Booting from ZFS

Booting from ZFS on Linux is possible and many people do it. There areexcellent walk throughs available forDebian,Ubuntu, andGentoo.

On FreeBSD 13+ booting from ZFS is supported out of the box.

Selecting /dev/ names when creating a pool (Linux)

There are different /dev/ names that can be used when creating a ZFSpool. Each option has advantages and drawbacks, the right choice foryour ZFS pool really depends on your requirements. For development andtesting using /dev/sdX naming is quick and easy. A typical home servermight prefer /dev/disk/by-id/ naming for simplicity and readability.While very large configurations with multiple controllers, enclosures,and switches will likely prefer /dev/disk/by-vdev naming for maximumcontrol. But in the end, how you choose to identify your disks is up toyou.

  • /dev/sdX, /dev/hdX: Best for development/test pools

    • Summary: The top level /dev/ names are the default for consistencywith other ZFS implementations. They are available under all Linuxdistributions and are commonly used. However, because they are notpersistent they should only be used with ZFS for development/testpools.

    • Benefits: This method is easy for a quick test, the names areshort, and they will be available on all Linux distributions.

    • Drawbacks: The names are not persistent and will change dependingon what order the disks are detected in. Adding or removinghardware for your system can easily cause the names to change. Youwould then need to remove the zpool.cache file and re-import thepool using the new names.

    • Example: zpool create tank sda sdb

  • /dev/disk/by-id/: Best for small pools (less than 10 disks)

    • Summary: This directory contains disk identifiers with more humanreadable names. The disk identifier usually consists of theinterface type, vendor name, model number, device serial number,and partition number. This approach is more user friendly becauseit simplifies identifying a specific disk.

    • Benefits: Nice for small systems with a single disk controller.Because the names are persistent and guaranteed not to change, itdoesn’t matter how the disks are attached to the system. You cantake them all out, randomly mix them up on the desk, put themback anywhere in the system and your pool will still beautomatically imported correctly.

    • Drawbacks: Configuring redundancy groups based on physicallocation becomes difficult and error prone. Unreliable on manypersonal virtual machine setups because the software does notgenerate persistent unique names by default.

    • Example:zpool create tank scsi-SATA_Hitachi_HTS7220071201DP1D10DGG6HMRP

  • /dev/disk/by-path/: Good for large pools (greater than 10 disks)

    • Summary: This approach is to use device names which include thephysical cable layout in the system, which means that a particulardisk is tied to a specific location. The name describes the PCIbus number, as well as enclosure names and port numbers. Thisallows the most control when configuring a large pool.

    • Benefits: Encoding the storage topology in the name is not onlyhelpful for locating a disk in large installations. But it alsoallows you to explicitly layout your redundancy groups overmultiple adapters or enclosures.

    • Drawbacks: These names are long, cumbersome, and difficult for ahuman to manage.

    • Example:zpool create tank pci-0000:00:1f.2-scsi-0:0:0:0 pci-0000:00:1f.2-scsi-1:0:0:0

  • /dev/disk/by-vdev/: Best for large pools (greater than 10 disks)

    • Summary: This approach provides administrative control over devicenaming using the configuration file /etc/zfs/vdev_id.conf. Namesfor disks in JBODs can be generated automatically to reflect theirphysical location by enclosure IDs and slot numbers. The names canalso be manually assigned based on existing udev device links,including those in /dev/disk/by-path or /dev/disk/by-id. Thisallows you to pick your own unique meaningful names for the disks.These names will be displayed by all the zfs utilities so it canbe used to clarify the administration of a large complex pool. Seethe vdev_id and vdev_id.conf man pages for further details.

    • Benefits: The main benefit of this approach is that it allows youto choose meaningful human-readable names. Beyond that, thebenefits depend on the naming method employed. If the names arederived from the physical path the benefits of /dev/disk/by-pathare realized. On the other hand, aliasing the names based on driveidentifiers or WWNs has the same benefits as using/dev/disk/by-id.

    • Drawbacks: This method relies on having a /etc/zfs/vdev_id.conffile properly configured for your system. To configure this fileplease refer to section Setting up the /etc/zfs/vdev_id.conffile. As withbenefits, the drawbacks of /dev/disk/by-id or /dev/disk/by-pathmay apply depending on the naming method employed.

    • Example: zpool create tank mirror A1 B1 mirror A2 B2

  • /dev/disk/by-uuid/: Not a great option

  • Summary: One might think from the use of “UUID” that this wouldbe an ideal option - however, in practice, this ends up listingone device per pool ID, which is not very useful for importingpools with multiple disks.

  • /dev/disk/by-partuuid//by-partlabel: Works only for existing partitions

  • Summary: partition UUID is generated on it’s creation, so usage is limited

  • Drawbacks: you can’t refer to a partition unique ID onan unpartitioned disk for zpool replace/add/attach,and you can’t find failed disk easily without a mapping writtendown ahead of time.

Setting up the /etc/zfs/vdev_id.conf file

In order to use /dev/disk/by-vdev/ naming the /etc/zfs/vdev_id.confmust be configured. The format of this file is described in thevdev_id.conf man page. Several examples follow.

A non-multipath configuration with direct-attached SAS enclosures and anarbitrary slot re-mapping.

multipath notopology sas_directphys_per_port 4# PCI_SLOT HBA PORT CHANNEL NAMEchannel 85:00.0 1 Achannel 85:00.0 0 B# Linux Mapped# Slot Slotslot 0 2slot 1 6slot 2 0slot 3 3slot 4 5slot 5 7slot 6 4slot 7 1

A SAS-switch topology. Note that the channel keyword takes only twoarguments in this example.

topology sas_switch# SWITCH PORT CHANNEL NAMEchannel 1 Achannel 2 Bchannel 3 Cchannel 4 D

A multipath configuration. Note that channel names have multipledefinitions - one per physical path.

multipath yes# PCI_SLOT HBA PORT CHANNEL NAMEchannel 85:00.0 1 Achannel 85:00.0 0 Bchannel 86:00.0 1 Achannel 86:00.0 0 B

A configuration using device link aliases.

# by-vdev# name fully qualified or base name of device linkalias d1 /dev/disk/by-id/wwn-0x5000c5002de3b9caalias d2 wwn-0x5000c5002def789e

After defining the new disk names run udevadm trigger to prompt udevto parse the configuration file. This will result in a new/dev/disk/by-vdev directory which is populated with symlinks to /dev/sdXnames. Following the first example above, you could then create the newpool of mirrors with the following command:

$ zpool create tank \ mirror A0 B0 mirror A1 B1 mirror A2 B2 mirror A3 B3 \ mirror A4 B4 mirror A5 B5 mirror A6 B6 mirror A7 B7$ zpool status pool: tank state: ONLINE scan: none requestedconfig: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 A0 ONLINE 0 0 0 B0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 A1 ONLINE 0 0 0 B1 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 A2 ONLINE 0 0 0 B2 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 A3 ONLINE 0 0 0 B3 ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 A4 ONLINE 0 0 0 B4 ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 A5 ONLINE 0 0 0 B5 ONLINE 0 0 0 mirror-6 ONLINE 0 0 0 A6 ONLINE 0 0 0 B6 ONLINE 0 0 0 mirror-7 ONLINE 0 0 0 A7 ONLINE 0 0 0 B7 ONLINE 0 0 0errors: No known data errors

Changing /dev/ names on an existing pool

Changing the /dev/ names on an existing pool can be done by simplyexporting the pool and re-importing it with the -d option to specifywhich new names should be used. For example, to use the custom names in/dev/disk/by-vdev:

$ zpool export tank$ zpool import -d /dev/disk/by-vdev tank

The /etc/zfs/zpool.cache file

Whenever a pool is imported on the system it will be added to the/etc/zfs/zpool.cache file. This file stores pool configurationinformation, such as the device names and pool state. If this fileexists when running the zpool import command then it will be used todetermine the list of pools available for import. When a pool is notlisted in the cache file it will need to be detected and imported usingthe zpool import -d /dev/disk/by-id command.

Generating a new /etc/zfs/zpool.cache file

The /etc/zfs/zpool.cache file will be automatically updated whenyour pool configuration is changed. However, if for some reason itbecomes stale you can force the generation of a new/etc/zfs/zpool.cache file by setting the cachefile property on thepool.

$ zpool set cachefile=/etc/zfs/zpool.cache tank

Conversely the cache file can be disabled by setting cachefile=none.This is useful for failover configurations where the pool should alwaysbe explicitly imported by the failover software.

$ zpool set cachefile=none tank

Sending and Receiving Streams

hole_birth Bugs

The hole_birth feature has/had bugs, the result of which is that, if youdo a zfs send -i (or -R, since it uses -i) from an affecteddataset, the receiver will not see any checksum or other errors, butwill not match the source.

ZoL versions 0.6.5.8 and 0.7.0-rc1 (and above) default to ignoring thefaulty metadata which causes this issue on the sender side.

For more details, see the hole_birth FAQ.

Sending Large Blocks

When sending incremental streams which contain large blocks (>128K) the--large-block flag must be specified. Inconsistent use of the flagbetween incremental sends can result in files being incorrectly zeroedwhen they are received. Raw encrypted send/recvs automatically imply the--large-block flag and are therefore unaffected.

For more details, see issue6224.

CEPH/ZFS

There is a lot of tuning that can be done that’s dependent on theworkload that is being put on CEPH/ZFS, as well as some generalguidelines. Some are as follow;

ZFS Configuration

The CEPH filestore back-end heavily relies on xattrs, for optimalperformance all CEPH workloads will benefit from the following ZFSdataset parameters

  • xattr=sa

  • dnodesize=auto

Beyond that typically rbd/cephfs focused workloads benefit from smallrecordsize({16K-128K), while objectstore/s3/rados focused workloadsbenefit from large recordsize (128K-1M).

CEPH Configuration (ceph.conf)

Additionally CEPH sets various values internally for handling xattrsbased on the underlying filesystem. As CEPH only officiallysupports/detects XFS and BTRFS, for all other filesystems it falls backto rather limited “safe”values.On newer releases, the need for larger xattrs will prevent OSD’s from evenstarting.

The officially recommended workaround (seehere)has some severe downsides, and more specifically is geared towardfilesystems with “limited” xattr support such as ext4.

ZFS does not have a limit internally to xattrs length, as such we cantreat it similarly to how CEPH treats XFS. We can set overrides to set 3internal values to the same as those used with XFS(seehereandhere)and allow it be used without the severe limitations of the “official”workaround.

[osd]filestore_max_inline_xattrs = 10filestore_max_inline_xattr_size = 65536filestore_max_xattr_value_size = 65536

Other General Guidelines

  • Use a separate journal device. Do not collocate CEPH journal onZFS dataset if at all possible, this will quickly lead to terriblefragmentation, not to mention terrible performance upfront evenbefore fragmentation (CEPH journal does a dsync for every write).

  • Use a SLOG device, even with a separate CEPH journal device. For someworkloads, skipping SLOG and setting logbias=throughput may beacceptable.

  • Use a high-quality SLOG/CEPH journal device. A consumer based SSD, oreven NVMe WILL NOT DO (Samsung 830, 840, 850, etc) for a variety ofreasons. CEPH will kill them quickly, on-top of the performance beingquite low in this use. Generally recommended devices are [Intel DC S3610,S3700, S3710, P3600, P3700], or [Samsung SM853, SM863], or better.

  • If using a high quality SSD or NVMe device (as mentioned above), youCAN share SLOG and CEPH Journal to good results on single device. Aratio of 4 HDDs to 1 SSD (Intel DC S3710 200GB), with each SSDpartitioned (remember to align!) to 4x10GB (for ZIL/SLOG) + 4x20GB(for CEPH journal) has been reported to work well.

Again - CEPH + ZFS will KILL a consumer based SSD VERY quickly. Evenignoring the lack of power-loss protection, and endurance ratings, youwill be very disappointed with performance of consumer based SSD undersuch a workload.

Performance Considerations

To achieve good performance with your pool there are some easy bestpractices you should follow.

  • Evenly balance your disks across controllers: Often the limitingfactor for performance is not the disks but the controller. Bybalancing your disks evenly across controllers you can often improvethroughput.

  • Create your pool using whole disks: When running zpool create usewhole disk names. This will allow ZFS to automatically partition thedisk to ensure correct alignment. It will also improveinteroperability with other OpenZFS implementations which honor thewholedisk property.

  • Have enough memory: A minimum of 2GB of memory is recommended forZFS. Additional memory is strongly recommended when the compressionand deduplication features are enabled.

  • Improve performance by setting ashift=12: You may be able toimprove performance for some workloads by setting ashift=12. Thistuning can only be set when block devices are first added to a pool,such as when the pool is first created or when a new vdev is added tothe pool. This tuning parameter can result in a decrease of capacityfor RAIDZ configurations.

Advanced Format Disks

Advanced Format (AF) is a new disk format which natively uses a 4,096byte, instead of 512 byte, sector size. To maintain compatibility withlegacy systems many AF disks emulate a sector size of 512 bytes. Bydefault, ZFS will automatically detect the sector size of the drive.This combination can result in poorly aligned disk accesses which willgreatly degrade the pool performance.

Therefore, the ability to set the ashift property has been added to thezpool command. This allows users to explicitly assign the sector sizewhen devices are first added to a pool (typically at pool creation timeor adding a vdev to the pool). The ashift values range from 9 to 16 withthe default value 0 meaning that zfs should auto-detect the sector size.This value is actually a bit shift value, so an ashift value for 512bytes is 9 (2^9 = 512) while the ashift value for 4,096 bytes is 12(2^12 = 4,096).

To force the pool to use 4,096 byte sectors at pool creation time, youmay run:

$ zpool create -o ashift=12 tank mirror sda sdb

To force the pool to use 4,096 byte sectors when adding a vdev to apool, you may run:

$ zpool add -o ashift=12 tank mirror sdc sdd

ZVOL used space larger than expected

Depending on the filesystem used on the zvol (e.g. ext4) and the usage(e.g. deletion and creation of many files) the used andreferenced properties reported by the zvol may be larger than the“actual” space that is being used as reported by the consumer.

This can happen due to the way some filesystems work, in which theyprefer to allocate files in new untouched blocks rather than thefragmented used blocks marked as free. This forces zfs to referenceall blocks that the underlying filesystem has ever touched.

This is in itself not much of a problem, as when the used propertyreaches the configured volsize the underlying filesystem willstart reusing blocks. But the problem arises if it is desired tosnapshot the zvol, as the space referenced by the snapshots willcontain the unused blocks.

This issue can be prevented, by issuing the so-called trim(for ex. fstrim command on Linux) to allowthe kernel to specify to zfs which blocks are unused.

Issuing a trim before a snapshot is taken will ensurea minimum snapshot size.

For Linux adding the discard option for the mounted ZVOL in /etc/fstabeffectively enables the kernel to issue the trim commandscontinuously, without the need to execute fstrim on-demand.

Using a zvol for a swap device on Linux

You may use a zvol as a swap device but you’ll need to configure itappropriately.

CAUTION: for now swap on zvol may lead to deadlock, in this caseplease send your logshere.

  • Set the volume block size to match your systems page size. Thistuning prevents ZFS from having to perform read-modify-write optionson a larger block while the system is already low on memory.

  • Set the logbias=throughput and sync=always properties. Datawritten to the volume will be flushed immediately to disk freeing upmemory as quickly as possible.

  • Set primarycache=metadata to avoid keeping swap data in RAM viathe ARC.

  • Disable automatic snapshots of the swap device.

$ zfs create -V 4G -b $(getconf PAGESIZE) \ -o logbias=throughput -o sync=always \ -o primarycache=metadata \ -o com.sun:auto-snapshot=false rpool/swap

Using ZFS on Xen Hypervisor or Xen Dom0 (Linux)

It is usually recommended to keep virtual machine storage and hypervisorpools, quite separate. Although few people have managed to successfullydeploy and run OpenZFS using the same machine configured as Dom0.There are few caveats:

  • Set a fair amount of memory in grub.conf, dedicated to Dom0.

    • dom0_mem=16384M,max:16384M

  • Allocate no more of 30-40% of Dom0’s memory to ZFS in/etc/modprobe.d/zfs.conf.

    • options zfs zfs_arc_max=6442450944

  • Disable Xen’s auto-ballooning in /etc/xen/xl.conf

  • Watch out for any Xen bugs, such as thisone related toballooning

udisks2 creating /dev/mapper/ entries for zvol (Linux)

To prevent udisks2 from creating /dev/mapper entries that must bemanually removed or maintained during zvol remove / rename, create audev rule such as /etc/udev/rules.d/80-udisks2-ignore-zfs.rules withthe following contents:

ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_FS_TYPE}=="zfs_member", ENV{ID_PART_ENTRY_TYPE}=="6a898cc3-1dd2-11b2-99a6-080020736631", ENV{UDISKS_IGNORE}="1"

Licensing

License information can be found here.

Reporting a problem

You can open a new issue and search existing issues using the publicissue tracker. The issuetracker is used to organize outstanding bug reports, feature requests,and other development tasks. Anyone may post comments after signing upfor a github account.

Please make sure that what you’re actually seeing is a bug and not asupport issue. If in doubt, please ask on the mailing list first, and ifyou’re then asked to file an issue, do so.

When opening a new issue include this information at the top of theissue:

  • What distribution you’re using and the version.

  • What spl/zfs packages you’re using and the version.

  • Describe the problem you’re observing.

  • Describe how to reproduce the problem.

  • Including any warning/errors/backtraces from the system logs.

When a new issue is opened it’s not uncommon for a developer to requestadditional information about the problem. In general, the more detailyou share about a problem the quicker a developer can resolve it. Forexample, providing a simple test case is always exceptionally helpful.Be prepared to work with the developer looking in to your bug in orderto get it resolved. They may ask for information like:

  • Your pool configuration as reported by zdb or zpool status.

  • Your hardware configuration, such as

    • Number of CPUs.

    • Amount of memory.

    • Whether your system has ECC memory.

    • Whether it is running under a VMM/Hypervisor.

    • Kernel version.

    • Values of the spl/zfs module parameters.

  • Stack traces which may be logged to dmesg.

Does OpenZFS have a Code of Conduct?

Yes, the OpenZFS community has a code of conduct. See the Code ofConduct for details.

FAQ — OpenZFS  documentation (2024)
Top Articles
Latest Posts
Article information

Author: Annamae Dooley

Last Updated:

Views: 6131

Rating: 4.4 / 5 (65 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Annamae Dooley

Birthday: 2001-07-26

Address: 9687 Tambra Meadow, Bradleyhaven, TN 53219

Phone: +9316045904039

Job: Future Coordinator

Hobby: Archery, Couponing, Poi, Kite flying, Knitting, Rappelling, Baseball

Introduction: My name is Annamae Dooley, I am a witty, quaint, lovely, clever, rich, sparkling, powerful person who loves writing and wants to share my knowledge and understanding with you.