• Aucun résultat trouvé

A DVANCED DISK MANAGEMENT : RAID AND LVM

Adding a Disk

7.8 A DVANCED DISK MANAGEMENT : RAID AND LVM

The procedures we’ve discussed so far for adding new disks, filesystems, and swap areas are similar to those used on most UNIX systems. However, Linux has some additional tricks up its sleeve that many operating systems are still only dreaming about. Two distinct technologies, software RAID and logical volume management, give Linux disk management an additional layer of flexibility and reliability.

Hard disks fail frequently, and even with current backups, the consequences of a disk failure on a server can be disastrous. RAID, Redundant Array of Independent Disks, is a system that uses multiple hard drives to distribute or replicate data across several disks. RAID not only helps avoid data loss but also minimizes the downtime associated with a hardware failure (often to zero) and potentially increases perfor-mance as well. RAID systems can be implemented in hardware, but the Linux sys-tem implements all the necessary glue with software.

A second and equally useful tool called LVM (logical volume management) helps ad-ministrators efficiently allocate their available disk space among partitions. Imagine a world in which you don’t know exactly how large a partition needs to be. Six months after creating a partition you discover that it is much too large, but a

neigh-Adding a Disk

7.8 Advanced disk management: RAID and LVM 139

boring partition doesn’t have enough space... Sound familiar? LVM allows space to be dynamically reallocated from the greedy partition to the needy partition.

Although these tools can be powerful when used individually, they are especially potent in combination. The sections below present a conceptual overview of both systems and an example that illustrates the detailed configuration.

Linux software RAID

We recently experienced a disk controller failure on an important production server.

Although the data was replicated across several physical drives, a faulty hardware RAID controller destroyed the data on all disks. A lengthy and ugly tape restore pro-cess ensued, and it was more than two months before the server had completely re-covered. The rebuilt server now relies on the kernel’s software to manage its RAID environment, removing the possibility of another RAID controller failure.

RAID can do two basic things. First, it can improve performance by “striping” data across multiple drives, thus allowing several drives to work simultaneously to supply or absorb a single data stream. Second, it can duplicate or “mirror” data across mul-tiple drives, decreasing the risk associated with a single failed disk. Linux RAID has some subtle differences from traditional RAID, but it is still logically divided into several levels:

Linear mode provides no data redundancy or performance increases. It simply concatenates the block addresses of multiple drives to create a sin-gle (and larger) virtual drive.

RAID level 0 is used strictly to increase performance. It uses two or more drives of equal size to decrease write and access times.

RAID level 1 is the first level to offer redundancy. Data is duplicated on two or more drives simultaneously. This mode mirrors the data but harms per-formance because the information must be written more than once.

RAID level 4 competes with (and consistently loses to) RAID level 5. It stripes data but dedicates a disk to parity information, thereby incurring wait times when writing to the parity disk. Unless you have a very good reason to use RAID 4, ignore it in preference to RAID 5.

RAID level 5 is the Xanadu of RAID. By striping both data and parity infor-mation, it creates a redundant architecture while simultaneously improv-ing read and write times. RAID 5 requires at least three disk drives.

Software RAID has been built into the Linux kernel since version 2.0, but early ver-sions were buggy and incomplete. We recommend avoiding implementations older than those in the 2.4 kernel.

Logical volume management

LVM is an optional subsystem that defines a sort of supercharged version of disk partitioning. It allows you to group individual disks into “volume groups.” The

ag-gregate capacity of a volume group can then be allocated to logical volumes, which are accessed as regular block devices. Logical volume management lets you do the following:

Use and allocate disk storage more efficiently

Move logical volumes among different physical devices

Grow and shrink logical volume sizes on the fly

Take “snapshots” of whole filesystems

Replace on-line drives without interrupting service

The components of a logical volume can be put together in various ways. Concate-nation keeps each device’s physical blocks together and lines the devices up one after another. Striping interleaves the components so that adjacent virtual blocks are ac-tually spread over multiple physical disks. By reducing single-disk bottlenecks, strip-ing can often provide higher bandwidth and lower latency.

An example configuration with LVM and RAID

Our previous example illustrated the configuration of a basic disk. In this section we walk through a setup procedure that includes both RAID and LVM. This kind of setup is especially useful for production servers.

Our objective is to create a RAID 5 array out of three empty disks. On top of this RAID array, we define two LVM partitions, web1 and web2. This structure gives us several advantages over a traditional system:

RAID 5 confers redundancy; if one of the disks fails, our data remains intact. Unlike RAID 4, it doesn’t matter which disk fails!

Thanks to LVM, the partitions are resizable. When an enthusiastic web-master fills up web2, we can easily steal some extra space from web1.

More disk space could eventually be needed on both partitions. The design allows additional disks to be added to the RAID 5 array. Once this has been done, the existing LVM groups can be extended to include the additional space, all without recreating any partitions.

After showing the initial configuration, we describe how to handle a failed disk and show how to resize an LVM partition.

On our example system, we have four equally sized SCSI disks:

#fdisk -l

Disk /dev/sda: 18.2 GB, 18210036736 bytes 255 heads, 63 sectors/track, 2213 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

/dev/sda1 * 1 13 104391 83 Linux

/dev/sda2 14 144 1052257+ 82 Linux swap

/dev/sda3 145 2213 16619242+ 8e Linux LVM1G

Adding a Disk

7.8 Advanced disk management: RAID and LVM 141

Disk /dev/sdb: 18.2 GB, 18210036736 bytes 255 heads, 63 sectors/track, 2213 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

Disk /dev/sdc: 18.2 GB, 18210036736 bytes 255 heads, 63 sectors/track, 2213 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

Disk /dev/sdd: 18.2 GB, 18210036736 bytes 255 heads, 63 sectors/track, 2213 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

The first SCSI disk, /dev/sda, contains our system partitions. The other three (sdb, sdc, and sdd) have no partition tables.

To begin, we create the partitions on each of our SCSI disks. Since the disks are iden-tical, we execute the same set of commands for each.

#fdisk /dev/sdb ...

Command (m for help): new Command action

e extended

p primary partition (1-4): p Partition number (1-4): 1

First cylinder (1-2213, default 1): <Enter>

Using default value 1

Last cylinder or +size or +sizeM or +sizeK (1-2213, default 2213): <Enter>

Using default value 2213 Command (m for help): type Selected partition 1

Hex code (type L to list codes): fd

Changed system type of partition 1 to fd (Linux raid autodetect) Command (m for help): write

The partition table has been altered!

Calling ioctl() to re-read partition table.

Syncing disks.

After writing the partition labels for the other two disks, it’s time to get our hands dirty and build the RAID array. Most modern distributions use the single mdadm command for RAID management. Previous versions of RHEL used the raidtools suite, but since mdadm is both more powerful and easier to use than raidtools, that’s what we demonstrate here.

The following command builds a RAID 5 array from our three SCSI partitions:

#mdadm --create /dev/md0 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1

mdadm: array /dev/md0 started.

While the array is being built, the file /proc/mdstat shows progress information:

#cat /proc/mdstat Personalities : [raid5]

md0 : active raid5 sdb1[3] sdc1[1] sdd1[2]

35566336 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]

[====>...] recovery = 22.4% (3999616/17783168) finish=5.1min speed=44800K/sec

unused devices: <none>

This file always reflects the current state of the kernel’s RAID system. It is especially useful to keep an eye on this file after adding a new disk or replacing a faulty drive.

(watch cat /proc/mdstat is a handy idiom.)

Once assembly of the array is complete, we see a notification message in the /var/log/messages file:

RAID5 conf printout:

--- rd:3 wd:3 fd:0 disk 0, o:1, dev:sdb1 disk 1, o:1, dev:sdc1 disk 2, o:1, dev:sdd1

The initial creation command also serves to “activate” the array (make it available for use), but on subsequent reboots we need to activate the array as a separate step, usually out of a startup script. RHEL, Fedora, and SUSE all include sample startup scripts for RAID.

mdadm does not technically require a configuration file, although it will use a con-figuration file if one is supplied (typically, /etc/mdadm.conf). We strongly recom-mend the use of a configuration file. It documents the RAID configuration in a stan-dard way, thus giving administrators an obvious place to look for information when problems occur. The alternative to the use of a configuration file is to specify the configuration on the command line each time the array is activated.

mdadm --detail --scan dumps the current RAID setup into a configuration file. Un-fortunately, the configuration it prints is not quite complete. The following com-mands build a complete configuration file for our example setup:

#echo DEVICE /dev/sdb1 /dev/sdc1 /dev/sdd1 > /etc/mdadm.conf

#mdadm --detail --scan >> /etc/mdadm.conf

#cat /etc/mdadm.conf

DEVICE /dev/sdb1 /dev/sdc1 /dev/sdd1

ARRAY /dev/md0 level=raid5 num-devices=3 UUID=21158de1:faaa0dfb:

841d3b41:76e93a16

devices=/dev/sdb1,/dev/sdc1,/dev/sdd1

Adding a Disk

7.8 Advanced disk management: RAID and LVM 143

mdadm can now read this file at startup or shutdown to easily manage the array. To enable the array at startup by using the freshly created /etc/mdadm.conf, we would execute

#mdadm -As /dev/md0

To stop the array manually, we would use the command

#mdadm -S /dev/md0

We’ve now assembled our three hard disks into a single logical RAID disk. Now it’s time to define logical volume groups on which we can create expandable (and shrinkable) filesystems. LVM configuration proceeds in a few distinct phases:

Creating (defining, really) and initializing physical volumes

Adding the physical volumes to a volume group

Creating logical volumes on the volume group

The LVM2 suite of tools addresses all of these tasks and facilitates later management of the volumes. man lvm is a good introduction to the system and its tools.

In LVM terminology, the physical volumes are the “things” that are aggregated to form storage pools (“volume groups”). “Physical volume” is a somewhat misleading term, however, because the physical volumes need not have a direct correspondence to physical devices. They can be disks, but they can also be disk partitions or (as in this example) high-level RAID objects that have their own underlying structure.

LVM commands start with letters that make it clear at which level of abstraction they operate: pv commands manipulate physical volumes, vg commands manipu-late volume groups, and lv commands manipumanipu-late logical volumes.

Older versions of LVM required you to run the vgscan command as an initial step, but this is no longer necessary. Instead, you start by directly initializing each physi-cal device with pvcreate. For this example, we use the /dev/md0 RAID 5 device we just created.

#pvcreate /dev/md0

Physical volume "/dev/md0" successfully created

This operation destroys all data on the device or partition, so we were exceedingly careful! Although we’re using only a single physical device in this example, LVM al-lows us to add multiple devices of different types to a single volume group.

Our physical device is now ready to be added to a volume group:

#vgcreate LVM1 /dev/md0

Volume group "LVM1" successfully created

To step back and examine our handiwork, we use the vgdisplay command:

#vgdisplay LVM1 Volume group

---VG Name LVM1

System ID

Format lvm2

Metadata Areas 1

Metadata Sequence No 1

VG Access read/write

VG Status resizable

MAX LV 0

Cur LV 0

Open LV 0

Max PV 0

Cur PV 1

Act PV 1

VG Size 33.92 GB

PE Size 4.00 MB

Total PE 8683

Alloc PE / Size 0 / 0

Free PE / Size8683 / 33.92 GB

VG UUID nhkzzN-KHmY-BfV5-6F6Y-3LF8-dpd5-JM5lMp

The last steps are to create logical volumes within the LVM1 volume group and make partitions on the volumes. We make both of the logical volumes 10GB in size:

#lvcreate -L 10G -n web1 LVM1 Logical volume "web1" created

#lvcreate -L 10G -n web2 LVM1 Logical volume "web2" created

Now that we’ve created two logical volumes, web1 and web2, in our LVM1 volume group, we can create and mount our filesystems.

#mke2fs -j /dev/LVM1/web1 ...

#mke2fs -j /dev/LVM1/web2 ...

#mkdir /web1 /web2

#mount /dev/LVM1/web1 /web1

#mount /dev/LVM1/web2 /web2

The filesystems are finally ready for use. We add the new filesystems to /etc/fstab and reboot the system to ensure that everything comes up successfully.

Dealing with a failed disk

Our nicely architected system looks pretty now, but because of the multiple layers at which the system is operating, things can get ugly in a hurry. When a hard drive fails or a partition is corrupted (or simply fills up), it’s essential that you know how to repair it quickly and easily. You use the same tools as for the initial configuration above to maintain the system and recover from problems.

Consider the case of a failed hard disk. Because RAID 5 provides some data redun-dancy, the RAID 5 array we constructed in the previous sections will happily con-tinue to function in the event of a disk crash; users will not necessarily be aware of

Adding a Disk

7.8 Advanced disk management: RAID and LVM 145

any problems. You’ll need to pay close attention to the system logs to catch the prob-lem early (or have a program that does this for you; see page 220).

mdadm offers a handy option that simulates a failed disk:

#mdadm /dev/md0 -f /dev/sdc1

mdadm: set /dev/sdc1 faulty in /dev/md0

#tail /var/log/messages

May 30 16:14:55 harp kernel: raid5: Disk failure on sdc, disabling device.

Operation continuing on 2 devices kernel: RAID5 conf printout:

kernel: --- rd:3 wd:2 fd:1 kernel: disk 0, o:1, dev:sdb1 kernel: disk 1, o:0, dev:sdc1 kernel: disk 2, o:1, dev:sdd1 kernel: RAID5 conf printout:

kernel: --- rd:3 wd:2 fd:1 kernel: disk 0, o:1, dev:sdb1 kernel: disk 2, o:1, dev:sdd1

As shown here, the system log /var/log/messages contains information about the (simulated) failure as soon as it occurs. Similar information is available from the RAID status file /proc/mdstat. At this point, the administrator should take the fol-lowing actions:

Remove the disk from the RAID array.

Schedule downtime and shut down the computer (if necessary).

Replace the physical drive.

Add the new drive to the array.

To remove the drive from the RAID configuration, use mdadm:

#mdadm /dev/md0 -r /dev/sdc1 mdadm: hot removed /dev/sdc1

Once the disk has been logically removed, you can replace the drive. Hot-swappable drive hardware lets you make the change without turning off the system or rebooting.

If your RAID components are raw disks, you should replace them with an identical drive only. Partition-based components can be replaced with any partition of simi-lar size, although for bandwidth matching it’s best if the drive hardware is simisimi-lar. (If your RAID configuration is built on top of partitions, you must run fdisk to define the partitions appropriately before adding the replacement disk to the array.) In our example, the failure is just a simulation, so we can add the drive back to the array without replacing any hardware:

#mdadm /dev/md0 -a /dev/sdc1 mdadm: hot added /dev/sdc1

Linux rebuilds the array and reflects the progress, as always, in /proc/mdstat.

Reallocating storage space

Even more common than disk crashes are cases in which users or log files fill up partitions. We have experienced everything from servers used for personal MP3 storage to a department full of email packrats.

Suppose that in our example, /web1 has grown more than we predicted and is in need of more space. Resizing LVM partitions involves just a few short steps. The exact commands depend on the filesystem in use. The steps in the following exam-ple are for an ext3 filesystem. The steps are:

Examine the current LVM configuration

Resize the partitions with lvextend and ext2online

Verify the changes

Fortunately, we left some extra space in our volume group to grow /web1 with, so we do not have to scavenge space from another volume. We use vgdisplay to see the space available on the volume group and df to determine how to reallocate it:

#vgdisplay LVM1 Volume group

---VG Name LVM1

System ID

Format lvm2

Metadata Areas 1

Metadata Sequence No 3

VG Access read/write

VG Status resizable

MAX LV 0

Cur LV 2

Open LV 2

Max PV 0

Cur PV 1

Act PV 1

VG Size 33.92 GB

PE Size 4.00 MB

Total PE 8683

Alloc PE / Size 5120 / 20.00 GB

Free PE / Size3563 / 13.92 GB

VG UUID nhkzzN-KHmY-BfV5-6F6Y-3LF8-dpd5-JM5lMp

#df -h /web1

Filesystem Size Used Avail Use% Mounted on

/dev/mapper/LVM1-web1 9.9G 7.1G 2.3G 76% /web1

These commands show 13.92GB free in the volume group and 76% usage of /web1.

We’ll add 10GB to /web1.

First we use lvextend to add space to the logical volume, then ext2online to resize the filesystem structures to encompass the additional space.

Adding a Disk