Unit OS8: File System Unit OS8: File System

(1)

Unit OS8: File System Unit OS8: File System

8.4. NTFS Recovery Support

(2)

Copyright Notice Copyright Notice

These materials are part of the

These materials are part of the Windows Operating Windows Operating System Internals Curriculum Development Kit,

System Internals Curriculum Development Kit, developed by David A. Solomon and Mark E.

developed by David A. Solomon and Mark E.

Russinovich with Andreas Polze Russinovich with Andreas Polze

Microsoft has licensed these materials from David Microsoft has licensed these materials from David Solomon Expert Seminars, Inc. for distribution to Solomon Expert Seminars, Inc. for distribution to academic organizations solely for use in academic academic organizations solely for use in academic environments (and not for commercial use)

environments (and not for commercial use)

(3)

Roadmap for Section 8.4 Roadmap for Section 8.4

The Evolution of File Systems The Evolution of File Systems

Recoverable File System Recoverable File System

Log File Service Operation Log File Service Operation

NTFS Recovery Procedures NTFS Recovery Procedures

Fault-Tolerance Support Fault-Tolerance Support

Volume Management - Volume Management -

Striped and Spanned Volumes

(4)

NTFS Recovery Support NTFS Recovery Support

Transaction-based logging scheme Transaction-based logging scheme Fast, even for large disks

Fast, even for large disks

Recovery is limited to file system data Recovery is limited to file system data

Use transaction processing like SQL server for user data Use transaction processing like SQL server for user data

Tradeoff: performance versus fully fault-tolerant file system Tradeoff: performance versus fully fault-tolerant file system

Design options for file I/O & caching:

Careful write

Careful write : VAX/VMS fs, other proprietary OS fs : VAX/VMS fs, other proprietary OS fs Lazy write

Lazy write : most UNIX fs, OS/2 HPFS : most UNIX fs, OS/2 HPFS

(5)

Careful Write File Systems Careful Write File Systems

OS crash/power loss may corrupt file system OS crash/power loss may corrupt file system Careful write file system orders write operations:

Careful write file system orders write operations:

System crash will produce predictable, non

System crash will produce predictable, non--critical inconsistenciescritical inconsistencies

Update to disk is broken in sub operations:

Sub operations are written serially Sub operations are written serially

Allocating disk space: first write bits in bitmap indicating usage;

then allocate space on disk then allocate space on disk

I/O requests are serialized:

Allocation of disk space by one process has to be completed before Allocation of disk space by one process has to be completed before another process may create a file

another process may create a file

No interleaving sub operations of the two I/O requests No interleaving sub operations of the two I/O requests

Crash: volume stays usable; no need to run repair utility

(6)

Lazy Write File Systems Lazy Write File Systems

Careful f

Careful f ile ile s s ystem ystem write sacrifices speed for safety write sacrifices speed for safety

Lazy write improves performance by write back caching Lazy write improves performance by write back caching

Modifications are written to the cache;

Cache flush is an optimized background activity Cache flush is an optimized background activity

Less disk writes; buffer can be modified multiple times Less disk writes; buffer can be modified multiple times before being written to disk

before being written to disk

File system can return to caller before op. is completed File system can return to caller before op. is completed Inconsistent intermediate states on volume are ignored Inconsistent intermediate states on volume are ignored Greater risk / user inconvenience if system fails

Greater risk / user inconvenience if system fails

(7)

Recoverable File System Recoverable File System

(Journaling File System) (Journaling File System)

Safety of careful write fs / performance of lazy write fs Safety of careful write fs / performance of lazy write fs Log file + fast recovery procedure

Log file + fast recovery procedure

Log file imposes some overhead Log file imposes some overhead

Optimization over lazy write: distance between cache flushes Optimization over lazy write: distance between cache flushes increased

increased

NTFS supports

NTFS supports cache write-through cache write-through and and cache flushing cache flushing triggered by applications

triggered by applications

No extra disk I/O to update fs data structures necessary:

all changes to fs structure are recorded in log file which can be all changes to fs structure are recorded in log file which can be written in a single operation

written in a single operation

In the future, NTFS may support logging for user files (hooks in In the future, NTFS may support logging for user files (hooks in place)

place)

(8)

Log File Service (LFS) Log File Service (LFS)

LFS is designed to provide logging to multiple LFS is designed to provide logging to multiple

kernel components (clients) kernel components (clients)

Currently used only by NTFS Currently used only by NTFS

Cache manager

I/O manager

NTFS driver

Access the mapped file or flush the cache Flush the

log file

Write/flush the log file

Log file

service Log the transaction Write the Volume updates

(9)

Log File Regions Log File Regions

NTFS calls LFS to read/write restart area NTFS calls LFS to read/write restart area

Context info: location of logging area to be used for recovery Context info: location of logging area to be used for recovery LFS maintains 2nd copy of restart area

LFS maintains 2nd copy of restart area Logging area: circularly reused

Logging area: circularly reused

LFS uses logical sequence numbers (LSNs) to identify log records LFS uses logical sequence numbers (LSNs) to identify log records

NTFS never reads/writes transactions to log file directly NTFS never reads/writes transactions to log file directly During recovery:

During recovery:

NTFS calls LFS to read forward; recorded transactions are redone NTFS calls LFS to read forward; recorded transactions are redone NTFS calls LFS to read backward; undo all incompletely logged NTFS calls LFS to read backward; undo all incompletely logged transactions

transactions

LFS restart area „infinite“ logging area

Copy 1 Copy 2 Log records

(10)

Operation of the LFS/NTFS Operation of the LFS/NTFS

1. 1. NTFS calls LFS to record in (cached) log file any NTFS calls LFS to record in (cached) log file any transactions that will modify volume structure

transactions that will modify volume structure

2. 2. NTFS modifies the volume (also in the cache) NTFS modifies the volume (also in the cache)

3. 3. Cache manager calls LFS to flush log file to disk Cache manager calls LFS to flush log file to disk

(LFS implements flushing by calling cache manager back, telling which (LFS implements flushing by calling cache manager back, telling which page to flush)

page to flush)

4. 4. After cash manager flushes log file, it flushes volume After cash manager flushes log file, it flushes volume changes

changes ->

-> Transactions of unsuccessful modifications can be Transactions of unsuccessful modifications can be retrieved from log file and un-/redone

retrieved from log file and un-/redone

Recovery begins automatically the first time a volume is

(11)

Log Record Types Log Record Types

Update records

(series of ...)(series of ...)

Most common; each record contains:

Redo information

Redo information: how to reapply on subop. of a committed trans.: how to reapply on subop. of a committed trans.

Undo information

Undo information: how to reverse a partially logged sub operation: how to reverse a partially logged sub operation

Last record commits the transaction

(not shown here)(not shown here)

Log file records

... T1_a T1b T1c

Redo: Allocate/initialize an MFT file record Undo: Deallocate the file record

Redo: Add the filename to the index

Undo: Remove the filename from the index

Redo: Set bits 3-9 in the bitmap Undo: Clear bits 3-9 in the bitmap

(12)

Log Records (contd.) Log Records (contd.)

Physical vs. logical description of redo/undo actions:

Delete file „a.dat“ vs. Delete byte range on disk Delete file „a.dat“ vs. Delete byte range on disk

NTFS writes update records with physical descriptions NTFS writes update records with physical descriptions

NTFS w

NTFS wrrites update records (usually several) for:ites update records (usually several) for:

Creating a file Creating a file Deleting a file Deleting a file Extending a file Extending a file Truncating a file Truncating a file

Setting file information Setting file information Renaming a file

Renaming a file

Changing security applied to a file Changing security applied to a file

Redo/undo ops. must be idempotent Redo/undo ops. must be idempotent

(might be applied twice) (might be applied twice)

(13)

Checkpoint Records Checkpoint Records

NTFS periodically writes a checkpoint record NTFS periodically writes a checkpoint record

Describes, what processing would be necessary to recover a Describes, what processing would be necessary to recover a

volume if a crash would occur immediately volume if a crash would occur immediately

How far back in the log file must NTFS go to begin recovery How far back in the log file must NTFS go to begin recovery

LSN of checkpoint record is stored in restart area LSN of checkpoint record is stored in restart area

Log file records

... LSN 2058 LSN 2059 LSN 2060

Checkpoint record NTFS restart

(14)

Log File Full Log File Full

LFS presents log file to NTFS as is it were infinitely large LFS presents log file to NTFS as is it were infinitely large

Writing checkpoint records usually frees up space Writing checkpoint records usually frees up space LFS tracks several numbers:

LFS tracks several numbers:

Available log space Available log space

Amount of space needed to write an incoming log record and to undo the Amount of space needed to write an incoming log record and to undo the write

write

Amount of space needed to roll back all active (no committed) transactions, Amount of space needed to roll back all active (no committed) transactions, should that be necessary

should that be necessary

Insufficient space: „Log file full“ error & NTFS exception Insufficient space: „Log file full“ error & NTFS exception

NTFS prevents further transactions on files (block creation/deletion) NTFS prevents further transactions on files (block creation/deletion) Active transactions are completed or receive „log file full“ exception Active transactions are completed or receive „log file full“ exception NTFS calls cache manager to flush unwritten data

NTFS calls cache manager to flush unwritten data

If data is written, NTFS marks log file „empty“; resets beginning of log file If data is written, NTFS marks log file „empty“; resets beginning of log file

(15)

Recovery - Principles Recovery - Principles

NTFS performs automatic recovery NTFS performs automatic recovery

Recovery depends on two NTFS in-memory tables:

Transaction table

Transaction table: keeps track of active transactions (not completed): keeps track of active transactions (not completed) (sub operations of these transactions must be removed from disk) (sub operations of these transactions must be removed from disk) Dirty page table

Dirty page table: records which pages in cache contain : records which pages in cache contain

modifications to file system structure that have not yet been written modifications to file system structure that have not yet been written to disk

to disk

NTFS writes checkpoint every 5 sec.

Includes copy of transaction table and dirty page table Includes copy of transaction table and dirty page table

Checkpoint includes LSNs of the log records containing the tables Checkpoint includes LSNs of the log records containing the tables

Dirty page table

Update record

Transaction table

Checkpoint record

Update record

Update record Analysis pass

(16)

Recovery - Passes Recovery - Passes

1. 1.

Analysis passAnalysis pass

•

NTFS scans forward in log file from beginning of last checkpointNTFS scans forward in log file from beginning of last checkpoint

•

Updates transaction/dirty page tables it copied in memoryUpdates transaction/dirty page tables it copied in memory

•

NTFS scans tables for oldest update record of a non-NTFS scans tables for oldest update record of a non-commitcommittted trans.ed trans.

2. 2.

Redo passRedo pass

•

NTFS looks for „page update“ records which contain volume NTFS looks for „page update“ records which contain volume modification that might not have been flushed to disk

modification that might not have been flushed to disk

•

NTFS redoes these updates in the cache until it reaches end of log fileNTFS redoes these updates in the cache until it reaches end of log file

•

Cache manager „lazy writer thread“ begins to flush cache to diskCache manager „lazy writer thread“ begins to flush cache to disk

3. 3.

Undo passUndo pass

•

Roll back any transactions that weren't‘t committed when system failedRoll back any transactions that weren't‘t committed when system failed

•

After undo pass – volume is at consistent stateAfter undo pass – volume is at consistent state

•

Write empty LFS restart area; no recovery is needed if system fails nowWrite empty LFS restart area; no recovery is needed if system fails now

(17)

Undo Pass - Example Undo Pass - Example

Transaction 1 was commited before power failure Transaction 1 was commited before power failure Transaction 2 was still active

Transaction 2 was still active

NTFS must log undo operations in log file!

Power might fail again during recovery;

NTFS would have to redo its undo operations NTFS would have to redo its undo operations LSN

4044

LSN 4045

LSN 4046

LSN 4047

LSN 4048

LSN 4049

„Transaction committed“ record

Redo: Allocate/Initialize an MFT file record Undo: Deallocate the file record

Redo: Add the filename to the index

Undo: Remove the filename from the index

Redo: Set bits 3-9 in the bitmap Undo: Clear bits 3-9 in the bitmap

Power failure

(18)

NTFS Recovery - Conclusions NTFS Recovery - Conclusions

Recovery will return volume to some preexisting consistent state (not Recovery will return volume to some preexisting consistent state (not necessarily state before crash)

necessarily state before crash)

Lazy commit algorithm: log file is not immediately flushed when a Lazy commit algorithm: log file is not immediately flushed when a

„transaction committed“ record is written

„transaction committed“ record is written LFS batches records;

LFS batches records;

Flush when cache manager calls or check pointing record is written Flush when cache manager calls or check pointing record is written (once every 5 sec)

(once every 5 sec)

Several parallel transactions might have been active before crash Several parallel transactions might have been active before crash NTFS uses log file mechanisms for error handling

NTFS uses log file mechanisms for error handling Most I/O errors are not file system errors

Most I/O errors are not file system errors

NTFS might create MFT record and detect that disk is full when NTFS might create MFT record and detect that disk is full when allocating space for a file in the bitmap

allocating space for a file in the bitmap

NTFS uses log info to undo changes and returns „disk full“ error to caller NTFS uses log info to undo changes and returns „disk full“ error to caller

(19)

Fault Tolerance Support Fault Tolerance Support

NTFS‘ capabilities are enhanced by the fault-tolerant NTFS‘ capabilities are enhanced by the fault-tolerant volume managers

volume managers FtDisk/DMIO FtDisk /DMIO

Lies above hard disk drivers in the I/O system‘s layered driver Lies above hard disk drivers in the I/O system‘s layered driver

scheme scheme

FtDisk – for basic disks FtDisk – for basic disks

DMIO – for dynamic disks DMIO – for dynamic disks

Volume management capabilities Volume management capabilities : :

Redundant data storage Redundant data storage

Dynamic data recovery from bad sectors on SCSI disks Dynamic data recovery from bad sectors on SCSI disks

NTFS itself implements bad-sector recovery for NTFS itself implements bad-sector recovery for non-SCSI disks

non-SCSI disks

(20)

Volume Management Features – Volume Management Features –

Spanned Volumes Spanned Volumes

Spanned

Spanned Volume Volume s s : :

single logical volume composed of a maximum of 32 areas of free space single logical volume composed of a maximum of 32 areas of free space

on one or more disks on one or more disks

NTFS volume sets can be dynamically increased in size NTFS volume sets can be dynamically increased in size

(only bitmap file which stores allocation status needs to be extended) (only bitmap file which stores allocation status needs to be extended)

FtDisk

FtDisk/DMIO/DMIO hide physical configuration of disks from file system hide physical configuration of disks from file system

C:

(100 MB)

E:

(100 MB) D:

(100 MB)

Volume set D:

occupies half of two disks

(21)

Stripe

Stripe d Volumes d Volumes

Series of partitions, one partition per disk (of same size) Series of partitions, one partition per disk (of same size) Combined into a single logical volume

Combined into a single logical volume FtDisk

FtDisk /DMIO /DMIO optimize data storage and retrieval times optimize data storage and retrieval times

Stripes are narrow: 64KB Stripes are narrow: 64KB

Data tends to be distributed evenly among disks Data tends to be distributed evenly among disks

Multiple pending read/write ops. will operate on different disks Multiple pending read/write ops. will operate on different disks Latency for disk I/O is often reduced (parallel seek operations) Latency for disk I/O is often reduced (parallel seek operations)

(150 MB) (150 MB) (150 MB)

1 2

4

3

(22)

Fault Tolerant Volumes Fault Tolerant Volumes

FtDisk

FtDisk /DMIO /DMIO implement redundant storage schemes implement redundant storage schemes

Mirror sets Mirror sets

Stripe sets with parity Stripe sets with parity

Sector sparing Sector sparing

Tool Tool s: s : Windows Disk Management MMC snap-in Windows Disk Management MMC snap-in

• Mirrored Volumes Mirrored Volumes : :

– Contents of a partition on one disk are duplicated on another diskContents of a partition on one disk are duplicated on another disk – FtDiskFtDisk/DMIO write same data to both locations/DMIO write same data to both locations

C:

(mirror)

(23)

RAID-5 Volumes RAID-5 Volumes

Fault tolerant version of a regular stripe set Fault tolerant version of a regular stripe set Parity: logical sum (XOR)

Parity: logical sum (XOR)

Parity info is distributed evenly over available disks Parity info is distributed evenly over available disks FtDisk

FtDisk /DMIO reconstruct missing data by using XOR op. /DMIO reconstruct missing data by using XOR op.

parity

(24)

Bad Cluster Recovery Bad Cluster Recovery

Sector sparing is supported by FtDisk

Sector sparing is supported by FtDisk /DMIO /DMIO

Dynamic copying of recovered data to spare sectors Dynamic copying of recovered data to spare sectors Without intervention from file system / user

Without intervention from file system / user Works for certain SCSI disks

Works for certain SCSI disks FtDisk

FtDisk/DMIO/DMIO return bad sector warning to NTFS return bad sector warning to NTFS

Sector re

Sector re - - mapping is supported by NTFS mapping is supported by NTFS

NTFS will not reuse bad clusters NTFS will not reuse bad clusters

NTFS copies data recovered by FtDisk

NTFS copies data recovered by FtDisk/DMIO into a new cluster/DMIO into a new cluster

NTFS cannot recover data from bad sector without help NTFS cannot recover data from bad sector without help from FtDisk

from FtDisk /DMIO /DMIO

NTFS will never write to bad sector (re

NTFS will never write to bad sector (re--map before write)map before write)

(25)

Bad-cluster re-mapping Bad-cluster re-mapping

NTFS filename

Standard info Security desc. Data

DataData

User file

Startin Startin g VCN

g VCN StartinStartin g LCN

g LCN Number of Number of clusters clusters 00 13551355 22 22 10491049 11 33 15881588 44

VCN 0 1

LCN 1355 1356 DataData

VCN 3 4 5 6

DataData VCN 2 NTFS filename

Standard info Security desc. Data

Startin Startin g VCN

g VCN StartinStartin g LCN

g LCN Number of Number of clusters clusters 00 13571357 11 BadBad

VCN 0

LCN 1357

(26)