• Aucun résultat trouvé

ZFS: Love Your Data

N/A
N/A
Protected

Academic year: 2022

Partager "ZFS: Love Your Data"

Copied!
42
0
0

Texte intégral

(1)

ZFS: Love Your Data

Neal H. Waleld

RMLL, 6 July 2015

(2)

ZFS Features

I Security

I End-to-End consistency via checksums

I Self Healing

I Copy on Write Transactions

I Additional copies of important data

I Snapshots and Clones

I Simple, Incremental Remote Replication

I Easier Administration

I One shared pool rather than many statically-sized volumes

I Performance Improvements

I Hierarchical Storage Management (HSM)

I Pooled Architecture = shared IOPs

I Developed for many-core systems

I Scalable

I Pool Address Space: 2128 bytes

I O(1)operations

I Fine-grained locks

(3)

Hard Drive Errors

(4)

Silent Data Corruption

I Data errors that are not caught by hard drive

I =Read returns dierent data from what was written

I On-disk data is protected by ECC

I But, doesn't correct / catch all errors

(5)

Silent Data Corruption

I Data errors that are not caught by hard drive

I =Read returns dierent data from what was written

I On-disk data is protected by ECC

I But, doesn't correct / catch all errors

(6)

Uncorrectable Errors

By Cory Doctorow, CC BY-SA 2.0

I Reported as BER (Bit Error Rate)

I According to Data Sheets:

I Desktop: 1 corrupted bit per 1014(12 TB)

I Enterprise: 1 corrupted bit per 1015(120 TB)

I Practice: 1 corrupted sector per 8 to 20 TB

Je Bonwick and Bill Moore, ZFS: The Last Word in File Systems, 2008

(7)

Types of Errors

By abdallahh, CC BY-SA 2.0

I Bit Rot

I Phantom writes

I Misdirected read / write

I 1 per 108 to 109IOs

I =1 error per 50 to 500 GB (assuming 512 byte IOs)

I DMA Parity Errors

I Software / Firmware Bugs

I Administration Errors

Ulrich Gräf: ZFS Internal Structures, OSDevCon, 2009

(8)

Errors are Occuring More Frequently!

By Ilya, CC BY-SA 2.0

I Error rate has remained constant

I Capacity has increased exponentially

I =⇒ many more errors per unit time!

(9)

Summary

I Uncorrectable errors are unavoidable

I More ECC is not going to solve the problem

I Solution:

I Recognize errors

I Correct errors

(10)

A Look at ZFS' Features

(11)

End to End Data Consistency

R

T1

L1 L2

T2

L3 L4

072F C0D7

923F E85F 655D 2FC6

I Every block has a checksum

I The checksum is stored with the pointer

I Not next to the data!

I Protects against phantom writes, etc.

I Forms a Merkle Tree

I (Root is secured using multiple copies.)

(12)

Self Healing

mirror

D1 D2

I Checksum veried when data is read

I On mismatch, ZFS reads a dierent copy

I The bad copy is automatically corrected

I zfs scrub checks all of the data in the pool

I Should be run once or twice a month.

(13)

Self Healing

mirror

D1 D2

I Checksum veried when data is read

I On mismatch, ZFS reads a dierent copy

I The bad copy is automatically corrected

I zfs scrub checks all of the data in the pool

I Should be run once or twice a month.

(14)

Self Healing

mirror

D1 D2

I Checksum veried when data is read

I On mismatch, ZFS reads a dierent copy

I The bad copy is automatically corrected

I zfs scrub checks all of the data in the pool

I Should be run once or twice a month.

(15)

Self Healing

mirror

D1 D2

I Checksum veried when data is read

I On mismatch, ZFS reads a dierent copy

I The bad copy is automatically corrected

I zfs scrub checks all of the data in the pool

I Should be run once or twice a month.

(16)

Self Healing

mirror

D1 D2

I Checksum veried when data is read

I On mismatch, ZFS reads a dierent copy

I The bad copy is automatically corrected

I zfs scrub checks all of the data in the pool

I Should be run once or twice a month.

(17)

Self Healing: Demo

$ dd if=/dev/zero of=disk1 count=1M bs=1k

$ dd if=/dev/zero of=disk2 count=1M bs=1k

$ sudo zpool create test mirror $(pwd)/disk1 $(pwd)/disk2

$ sudo zpool status test pool: test

state: ONLINE

scan: none requested config:

NAME STATE READ WRITE CKSUM

test ONLINE 0 0 0

mirror-0 ONLINE 0 0 0

/home/us/tmp/disk1 ONLINE 0 0 0 /home/us/tmp/disk2 ONLINE 0 0 0 errors: No known data errors

Initialize a mirrored pool

Status: healthy

(18)

Self Healing Demo: Die! Die! Die!

$ sudo cp /usr/bin/s* /test

$ sudo zpool export test

$ dd if=/dev/zero of=disk1 conv=notrunc bs=4k count=100k

$ sudo zpool import test -d .

$ sudo zpool status test pool: test

state: ONLINE

status: One or more devices has experienced an unrecoverable error.

An attempt was made to correct the error. Applications are unaffected.

action: Determine if the device needs to be replaced, and clear the errors using ’zpool clear’ or replace the device with ’zpool replace’.

see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: none requested

config:

NAME STATE READ WRITE CKSUM

test ONLINE 0 0 0

mirror-0 ONLINE 0 0 0

/home/us/tmp/disk1 ONLINE 0 0 18 /home/us/tmp/disk2 ONLINE 0 0 0 errors: No known data errors

Copy data

Corrupt it

Errors on reimport.

Errors

(19)

Self Healing: No application errors!

$ md5sum /test/* >/dev/null

$ sudo zpool status test pool: test

state: ONLINE

status: One or more devices has experienced an unrecoverable error.

An attempt was made to correct the error. Applications are unaffected.

action: Determine if the device needs to be replaced, and clear the errors using ’zpool clear’ or replace the device with

’zpool replace’.

see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: none requested

config:

NAME STATE READ WRITE CKSUM

test ONLINE 0 0 0

mirror-0 ONLINE 0 0 0

/home/us/tmp/disk1 ONLINE 0 0 47 /home/us/tmp/disk2 ONLINE 0 0 0 errors: No known data errors

Read the data

More errors.

(20)

Self Healing: Clean Up

$ sudo zpool scrub

$ sudo zpool status test pool: test

state: ONLINE

status: One or more devices has experienced an unrecoverable error.

An attempt was made to correct the error. Applications are unaffected.

action: Determine if the device needs to be replaced, and clear the errors using ’zpool clear’ or replace the device with

’zpool replace’.

see: http://zfsonlinux.org/msg/ZFS-8000-9P

scan: scrub repaired 4.33M in 0h0m with 0 errors on Wed Jul 30 14:22:20 2014 config:

NAME STATE READ WRITE CKSUM

test ONLINE 0 0 0

mirror-0 ONLINE 0 0 0

/home/us/tmp/disk1 ONLINE 0 0 209 /home/us/tmp/disk2 ONLINE 0 0 0 errors: No known data errors

$ sudo zpool clear test

Scrub the disks

All errors repaired!

(21)

Self Healing: Summary

I Self Healing not possible with RAID

I RAID / Volume Manager / FS layering is wrong!

I Volume manager was a hack from the 80s!

I Don't modify the FS to support multiple disks

I Add a layer of indirection

I FS talks to volume manager

I Volume manager has same interface as a block device

(22)

Copy on Write Transactions

R

T1

L1 L2

T2

L3 L4

L2' T1'

R'

I Live data is never overwritten

I To change a block, a new block is allocated

I The pointer is updated in the same manner

I =⇒ Data on disk is always consistent

I Unreferenced blocks are freed at the end of the transaction

(23)

Copy on Write Transactions

R

T1

L1 L2

T2

L3 L4

L2'

T1'

R'

I Live data is never overwritten

I To change a block, a new block is allocated

I The pointer is updated in the same manner

I =⇒ Data on disk is always consistent

I Unreferenced blocks are freed at the end of the transaction

(24)

Copy on Write Transactions

R

T1

L1 L2

T2

L3 L4

L2' T1'

R'

I Live data is never overwritten

I To change a block, a new block is allocated

I The pointer is updated in the same manner

I =⇒ Data on disk is always consistent

I Unreferenced blocks are freed at the end of the transaction

(25)

Copy on Write Transactions

R

T1

L1 L2

T2

L3 L4

L2' T1'

R'

I Live data is never overwritten

I To change a block, a new block is allocated

I The pointer is updated in the same manner

I =⇒ Data on disk is always consistent

I Unreferenced blocks are freed at the end of the transaction

(26)

Additional Copies of Important Data

I Metadata is saved multiple times

I FS Metadata: Twice

I Pool Metadata: Thrice

I Uberblock: Four times per physical disk

I Metadata account for just 1% to 2% of the disk space

I Also possible to store multiple copies of normal data

I (Good for laptops.)

(27)

Snapshots and Clones

T2

L3 L4

L1

R

T1

L2 10 20 30

Birth Times

I Birth Time is the transaction number during which the block was allocated

I Blocks don't have reference counts

(28)

Snapshots and Clones

T2

L3 L4

L1

R

T1

L2 10 20 30

Birth Times

I To delete a block:

I The birth time is compared with the most recent snapshot

I Larger? Block is unreferenced.

I Smaller? Block is still referenced.

I (Added to the snapshot's so-called dead list.)

(29)

Snapshots and Clones

T2

L3 L4

L1

R

T1

L2 10 20 30

Birth Times

I To delete a snapshot:

I Can again use the birth time

I Seehttps:

//blogs.oracle.com/ahrens/entry/is_it_magic

(30)

Snapshot Demo

$ sudo touch /test/a

$ sudo zfs snapshot test@1

$ sudo rm /test/a

$ ls /test/.zfs/snapshot/1 a

$ ls /test/

$ sudo zfs diff test@1 test M /test/

- /test/a

Create a snapshot Modify data Browse snapshot

di

(31)

Simple, Incremental Remote Replication

T2

L3 L4

L1

R

T1

L2 10 20 30

Birth Times

I Incremental Replication

I zfs send [base snapshot] snapshot

I | ssh zfs receive fs

(32)

Simple, Incremental Remote Replication

T2

L3 L4

L1

R

T1

L2 10 20 30

Birth Times

I Much faster than rsync

I rsyncneeds tostatall les.

I When comparing the trees, ZFS can aggressively prune subtrees

I Replication often takes just a few seconds

I Even with multi-TB data sets

I = possible to replicate every minute

(33)

Encrypted O Site Backups

FS LUKS

ZFS Disks

LUKS ZFS Disks zfs send

Local Remote

I Yes

I Use a LUKs container in the ZFS Dataset

(34)

Simpler Administration

FS1 FS2 FS3 FS4

D1 D2

FS1 FS2 FS3 FS4 Pool

D1 D2

I Shared pool instead of many statically sized volumes

I Idea: Administer the hard drives like RAM

I = mostly nothing to do

I Occasionally impose some quotas or reservations

I Bonus: Shared IOPS!

(35)

Hierarchical Storage Management (HSM)

I SSD as a Read Cache

I Write-Through Cache

I = Error can inuence performance, but not correctness

I = MLC SSD is sucient

I Automatically managed, like the page cache

(36)

Hierarchical Storage Management (HSM)

I SSD as a Write Cache

I Absorb synchronous writes

I Latency goes from 4-6 ms to 50µs (factor 100)

I 1 GB of space is sucient

I Very write intense!

I = SLC or eMLC SSD

I Or, over provision a huge MLC SSD

I If the SSD dies

I Data will only be lost if the server crashes

I The data is still buered in memory

I SSD is only read after a system crash to commit transactions

(37)

ZFS Pool

pool

RAIDZ

D1 D2 D3

RAIDZ

D4 D5 D5

I ZFS combines (stripes) hard drive groups (logical vdev)

I Groups consist of multiple drives (physical vdev)

I Mirrors

I RAIDs

I Easy to grow a existing pool: just add another group

(38)

Hardware Tips: ECC

I Use ECC Memory!

I ZFS trusts that the RAM is error-free!

I Google Study: 4k errors per year per DIMM!

http://www.zdnet.com/blog/storage/

dram-error-rates-nightmare-on-dimm-street/

638

(39)

Hardware Tips: Diversity

Diversity inhibits correlated errors!

I Use hard drives from dierent manufacturers

I Connect drives from same vdev to dierent controllers

I etc.

(40)

Why not BTRFS?

Russel Coker: Monthly BTRFS Status Reports

I Since my blog post about BTRFS in March [1] not much has changed for me. Until yesterday I was using 3.13 kernels on all my systems and dealing with the occasional kmail index le corruption problem.

Yesterday my main workstationran out of disk space and went read-only. I started a BTRFS balance which didn't seem to be doing any good because most of the space was actually in use so I deleted a bunch of snapshots. Then my X session aborted (some problem with KDE or the X server I'll never know as logs couldn't be written to disk). I rebooted the system and had kernel threads go into innite loopswith repeated messages about a lack of response for 22 seconds (I should have photographed the screen).

. . .

http:

//etbe.coker.com.au/2014/04/26/btrfs-status-april-2014/

(41)

Thanks!

More information about ZFS:

I http://zfsonlinux.org/

I zfs-discuss@zfsonlinux.org

I Install ZFS on Debian GNU/Linux von Aaron Toponce:

https://pthree.org/2012/04/17/

install-zfs-on-debian-gnulinux/

(42)

Copyright

Images (all are CC BY-SA 2.0):

I Cory Doctorowhttps://flic.kr/p/bvNXHu

I abdallahh https://flic.kr/p/6VHVze

I Ilya https://flic.kr/p/7PUjGu

This presentation is Copyright 2015, by Neal H. Waleld. License:

CC BY-SA 2.0.

Références

Documents relatifs

The article does not reflect what is currently considered to be optimal concussion management, however, and fails to refer- ence the most important and comprehensive state- ment

volume pointer is removed from the volume index of the volume from which you are disconnecting. 22 Data Management for System Programmers.. Example: In the

The beauty of linked data is that it is easy to understand how to navigate within this data through HTTP requests: each request (a URI) returns RDF from which URIs can be extracted

Data 1/0 has become the world's largest supplier of microcircuit programming equipment by working closely with you and the people who design and manufacture the

The Binary Input Program handles an fb tape, whereas the Conversion Program or the Generalized Post-Mortem Program (both Drum Utility Programs) are brought in by the

electronics industry. Improvements in materials and process controls have resulted in significant improvements in reliability performance. In addition, plastic

During extended addressing, all single-word instructions can be programmed as double-word instructions, where the second word contains the effective address of the

L es Big Data ou méga-données n’ont pas de définition universelle, mais correspondent aux données dont le traitement dépasse les capacités des technologies courantes du fait de