Unit OS8: File System Unit OS8: File System
8.4. NTFS Recovery Support
8.4. NTFS Recovery Support
Copyright Notice Copyright Notice
© 2000-2005 David A. Solomon and Mark Russinovich
© 2000-2005 David A. Solomon and Mark Russinovich
These materials are part of the
These materials are part of the Windows Operating Windows Operating System Internals Curriculum Development Kit,
System Internals Curriculum Development Kit, developed by David A. Solomon and Mark E.
developed by David A. Solomon and Mark E.
Russinovich with Andreas Polze Russinovich with Andreas Polze
Microsoft has licensed these materials from David Microsoft has licensed these materials from David Solomon Expert Seminars, Inc. for distribution to Solomon Expert Seminars, Inc. for distribution to academic organizations solely for use in academic academic organizations solely for use in academic environments (and not for commercial use)
environments (and not for commercial use)
Roadmap for Section 8.4 Roadmap for Section 8.4
The Evolution of File Systems The Evolution of File Systems
Recoverable File System Recoverable File System
Log File Service Operation Log File Service Operation
NTFS Recovery Procedures NTFS Recovery Procedures
Fault-Tolerance Support Fault-Tolerance Support
Volume Management - Volume Management -
Striped and Spanned Volumes
Striped and Spanned Volumes
NTFS Recovery Support NTFS Recovery Support
Transaction-based logging scheme Transaction-based logging scheme Fast, even for large disks
Fast, even for large disks
Recovery is limited to file system data Recovery is limited to file system data
Use transaction processing like SQL server for user data Use transaction processing like SQL server for user data
Tradeoff: performance versus fully fault-tolerant file system Tradeoff: performance versus fully fault-tolerant file system
Design options for file I/O & caching:
Design options for file I/O & caching:
Careful write
Careful write : VAX/VMS fs, other proprietary OS fs : VAX/VMS fs, other proprietary OS fs Lazy write
Lazy write : most UNIX fs, OS/2 HPFS : most UNIX fs, OS/2 HPFS
Careful Write File Systems Careful Write File Systems
OS crash/power loss may corrupt file system OS crash/power loss may corrupt file system Careful write file system orders write operations:
Careful write file system orders write operations:
System crash will produce predictable, non
System crash will produce predictable, non--critical inconsistenciescritical inconsistencies
Update to disk is broken in sub operations:
Update to disk is broken in sub operations:
Sub operations are written serially Sub operations are written serially
Allocating disk space: first write bits in bitmap indicating usage;
Allocating disk space: first write bits in bitmap indicating usage;
then allocate space on disk then allocate space on disk
I/O requests are serialized:
I/O requests are serialized:
Allocation of disk space by one process has to be completed before Allocation of disk space by one process has to be completed before another process may create a file
another process may create a file
No interleaving sub operations of the two I/O requests No interleaving sub operations of the two I/O requests
Crash: volume stays usable; no need to run repair utility
Crash: volume stays usable; no need to run repair utility
Lazy Write File Systems Lazy Write File Systems
Careful f
Careful f ile ile s s ystem ystem write sacrifices speed for safety write sacrifices speed for safety
Lazy write improves performance by write back caching Lazy write improves performance by write back caching
Modifications are written to the cache;
Modifications are written to the cache;
Cache flush is an optimized background activity Cache flush is an optimized background activity
Less disk writes; buffer can be modified multiple times Less disk writes; buffer can be modified multiple times before being written to disk
before being written to disk
File system can return to caller before op. is completed File system can return to caller before op. is completed Inconsistent intermediate states on volume are ignored Inconsistent intermediate states on volume are ignored Greater risk / user inconvenience if system fails
Greater risk / user inconvenience if system fails
Recoverable File System Recoverable File System
(Journaling File System) (Journaling File System)
Safety of careful write fs / performance of lazy write fs Safety of careful write fs / performance of lazy write fs Log file + fast recovery procedure
Log file + fast recovery procedure
Log file imposes some overhead Log file imposes some overhead
Optimization over lazy write: distance between cache flushes Optimization over lazy write: distance between cache flushes increased
increased
NTFS supports
NTFS supports cache write-through cache write-through and and cache flushing cache flushing triggered by applications
triggered by applications
No extra disk I/O to update fs data structures necessary:
No extra disk I/O to update fs data structures necessary:
all changes to fs structure are recorded in log file which can be all changes to fs structure are recorded in log file which can be written in a single operation
written in a single operation
In the future, NTFS may support logging for user files (hooks in In the future, NTFS may support logging for user files (hooks in place)
place)
Log File Service (LFS) Log File Service (LFS)
LFS is designed to provide logging to multiple LFS is designed to provide logging to multiple
kernel components (clients) kernel components (clients)
Currently used only by NTFS Currently used only by NTFS
Cache manager
I/O manager
NTFS driver
Access the mapped file or flush the cache Flush the
log file
Write/flush the log file
Log file
service Log the transaction Write the Volume updates
Log File Regions Log File Regions
NTFS calls LFS to read/write restart area NTFS calls LFS to read/write restart area
Context info: location of logging area to be used for recovery Context info: location of logging area to be used for recovery LFS maintains 2nd copy of restart area
LFS maintains 2nd copy of restart area Logging area: circularly reused
Logging area: circularly reused
LFS uses logical sequence numbers (LSNs) to identify log records LFS uses logical sequence numbers (LSNs) to identify log records
NTFS never reads/writes transactions to log file directly NTFS never reads/writes transactions to log file directly During recovery:
During recovery:
NTFS calls LFS to read forward; recorded transactions are redone NTFS calls LFS to read forward; recorded transactions are redone NTFS calls LFS to read backward; undo all incompletely logged NTFS calls LFS to read backward; undo all incompletely logged transactions
transactions
LFS restart area „infinite“ logging area
Copy 1 Copy 2 Log records
Operation of the LFS/NTFS Operation of the LFS/NTFS
1. 1. NTFS calls LFS to record in (cached) log file any NTFS calls LFS to record in (cached) log file any transactions that will modify volume structure
transactions that will modify volume structure
2. 2. NTFS modifies the volume (also in the cache) NTFS modifies the volume (also in the cache)
3. 3. Cache manager calls LFS to flush log file to disk Cache manager calls LFS to flush log file to disk
(LFS implements flushing by calling cache manager back, telling which (LFS implements flushing by calling cache manager back, telling which page to flush)
page to flush)
4. 4. After cash manager flushes log file, it flushes volume After cash manager flushes log file, it flushes volume changes
changes ->
-> Transactions of unsuccessful modifications can be Transactions of unsuccessful modifications can be retrieved from log file and un-/redone
retrieved from log file and un-/redone
Recovery begins automatically the first time a volume is
Recovery begins automatically the first time a volume is
Log Record Types Log Record Types
Update records
Update records
(series of ...)(series of ...)Most common; each record contains:
Most common; each record contains:
Redo information
Redo information: how to reapply on subop. of a committed trans.: how to reapply on subop. of a committed trans.
Undo information
Undo information: how to reverse a partially logged sub operation: how to reverse a partially logged sub operation
Last record commits the transaction
Last record commits the transaction
(not shown here)(not shown here)Log file records
... T1a T1b T1c
Redo: Allocate/initialize an MFT file record Undo: Deallocate the file record
Redo: Add the filename to the index
Undo: Remove the filename from the index
Redo: Set bits 3-9 in the bitmap Undo: Clear bits 3-9 in the bitmap
Log Records (contd.) Log Records (contd.)
Physical vs. logical description of redo/undo actions:
Physical vs. logical description of redo/undo actions:
Delete file „a.dat“ vs. Delete byte range on disk Delete file „a.dat“ vs. Delete byte range on disk
NTFS writes update records with physical descriptions NTFS writes update records with physical descriptions
NTFS w
NTFS wrrites update records (usually several) for:ites update records (usually several) for:
Creating a file Creating a file Deleting a file Deleting a file Extending a file Extending a file Truncating a file Truncating a file
Setting file information Setting file information Renaming a file
Renaming a file
Changing security applied to a file Changing security applied to a file
Redo/undo ops. must be idempotent Redo/undo ops. must be idempotent
(might be applied twice) (might be applied twice)
Checkpoint Records Checkpoint Records
NTFS periodically writes a checkpoint record NTFS periodically writes a checkpoint record
Describes, what processing would be necessary to recover a Describes, what processing would be necessary to recover a
volume if a crash would occur immediately volume if a crash would occur immediately
How far back in the log file must NTFS go to begin recovery How far back in the log file must NTFS go to begin recovery
LSN of checkpoint record is stored in restart area LSN of checkpoint record is stored in restart area
Log file records
... LSN 2058 LSN 2059 LSN 2060
Checkpoint record NTFS restart
Log File Full Log File Full
LFS presents log file to NTFS as is it were infinitely large LFS presents log file to NTFS as is it were infinitely large
Writing checkpoint records usually frees up space Writing checkpoint records usually frees up space LFS tracks several numbers:
LFS tracks several numbers:
Available log space Available log space
Amount of space needed to write an incoming log record and to undo the Amount of space needed to write an incoming log record and to undo the write
write
Amount of space needed to roll back all active (no committed) transactions, Amount of space needed to roll back all active (no committed) transactions, should that be necessary
should that be necessary
Insufficient space: „Log file full“ error & NTFS exception Insufficient space: „Log file full“ error & NTFS exception
NTFS prevents further transactions on files (block creation/deletion) NTFS prevents further transactions on files (block creation/deletion) Active transactions are completed or receive „log file full“ exception Active transactions are completed or receive „log file full“ exception NTFS calls cache manager to flush unwritten data
NTFS calls cache manager to flush unwritten data
If data is written, NTFS marks log file „empty“; resets beginning of log file If data is written, NTFS marks log file „empty“; resets beginning of log file
Recovery - Principles Recovery - Principles
NTFS performs automatic recovery NTFS performs automatic recovery
Recovery depends on two NTFS in-memory tables:
Recovery depends on two NTFS in-memory tables:
Transaction table
Transaction table: keeps track of active transactions (not completed): keeps track of active transactions (not completed) (sub operations of these transactions must be removed from disk) (sub operations of these transactions must be removed from disk) Dirty page table
Dirty page table: records which pages in cache contain : records which pages in cache contain
modifications to file system structure that have not yet been written modifications to file system structure that have not yet been written to disk
to disk
NTFS writes checkpoint every 5 sec.
NTFS writes checkpoint every 5 sec.
Includes copy of transaction table and dirty page table Includes copy of transaction table and dirty page table
Checkpoint includes LSNs of the log records containing the tables Checkpoint includes LSNs of the log records containing the tables
Dirty page table
Update record
Transaction table
Checkpoint record
Update record
Update record Analysis pass
Recovery - Passes Recovery - Passes
1. 1.
Analysis passAnalysis pass•
NTFS scans forward in log file from beginning of last checkpointNTFS scans forward in log file from beginning of last checkpoint•
Updates transaction/dirty page tables it copied in memoryUpdates transaction/dirty page tables it copied in memory•
NTFS scans tables for oldest update record of a non-NTFS scans tables for oldest update record of a non-commitcommittted trans.ed trans.2. 2.
Redo passRedo pass•
NTFS looks for „page update“ records which contain volume NTFS looks for „page update“ records which contain volume modification that might not have been flushed to diskmodification that might not have been flushed to disk
•
NTFS redoes these updates in the cache until it reaches end of log fileNTFS redoes these updates in the cache until it reaches end of log file•
Cache manager „lazy writer thread“ begins to flush cache to diskCache manager „lazy writer thread“ begins to flush cache to disk3. 3.
Undo passUndo pass•
Roll back any transactions that weren't‘t committed when system failedRoll back any transactions that weren't‘t committed when system failed•
After undo pass – volume is at consistent stateAfter undo pass – volume is at consistent state•
Write empty LFS restart area; no recovery is needed if system fails nowWrite empty LFS restart area; no recovery is needed if system fails nowUndo Pass - Example Undo Pass - Example
Transaction 1 was commited before power failure Transaction 1 was commited before power failure Transaction 2 was still active
Transaction 2 was still active
NTFS must log undo operations in log file!
NTFS must log undo operations in log file!
Power might fail again during recovery;
Power might fail again during recovery;
NTFS would have to redo its undo operations NTFS would have to redo its undo operations LSN
4044
LSN 4045
LSN 4046
LSN 4047
LSN 4048
LSN 4049
„Transaction committed“ record
Redo: Allocate/Initialize an MFT file record Undo: Deallocate the file record
Redo: Add the filename to the index
Undo: Remove the filename from the index
Redo: Set bits 3-9 in the bitmap Undo: Clear bits 3-9 in the bitmap
Power failure
NTFS Recovery - Conclusions NTFS Recovery - Conclusions
Recovery will return volume to some preexisting consistent state (not Recovery will return volume to some preexisting consistent state (not necessarily state before crash)
necessarily state before crash)
Lazy commit algorithm: log file is not immediately flushed when a Lazy commit algorithm: log file is not immediately flushed when a
„transaction committed“ record is written
„transaction committed“ record is written LFS batches records;
LFS batches records;
Flush when cache manager calls or check pointing record is written Flush when cache manager calls or check pointing record is written (once every 5 sec)
(once every 5 sec)
Several parallel transactions might have been active before crash Several parallel transactions might have been active before crash NTFS uses log file mechanisms for error handling
NTFS uses log file mechanisms for error handling Most I/O errors are not file system errors
Most I/O errors are not file system errors
NTFS might create MFT record and detect that disk is full when NTFS might create MFT record and detect that disk is full when allocating space for a file in the bitmap
allocating space for a file in the bitmap
NTFS uses log info to undo changes and returns „disk full“ error to caller NTFS uses log info to undo changes and returns „disk full“ error to caller
Fault Tolerance Support Fault Tolerance Support
NTFS‘ capabilities are enhanced by the fault-tolerant NTFS‘ capabilities are enhanced by the fault-tolerant volume managers
volume managers FtDisk/DMIO FtDisk /DMIO
Lies above hard disk drivers in the I/O system‘s layered driver Lies above hard disk drivers in the I/O system‘s layered driver
scheme scheme
FtDisk – for basic disks FtDisk – for basic disks
DMIO – for dynamic disks DMIO – for dynamic disks
Volume management capabilities Volume management capabilities : :
Redundant data storage Redundant data storage
Dynamic data recovery from bad sectors on SCSI disks Dynamic data recovery from bad sectors on SCSI disks
NTFS itself implements bad-sector recovery for NTFS itself implements bad-sector recovery for non-SCSI disks
non-SCSI disks
Volume Management Features – Volume Management Features –
Spanned Volumes Spanned Volumes
Spanned
Spanned Volume Volume s s : :
single logical volume composed of a maximum of 32 areas of free space single logical volume composed of a maximum of 32 areas of free space
on one or more disks on one or more disks
NTFS volume sets can be dynamically increased in size NTFS volume sets can be dynamically increased in size
(only bitmap file which stores allocation status needs to be extended) (only bitmap file which stores allocation status needs to be extended)
FtDisk
FtDisk/DMIO/DMIO hide physical configuration of disks from file system hide physical configuration of disks from file system
C:
(100 MB)
E:
(100 MB) D:
(100 MB) D:
(100 MB)
Volume set D:
occupies half of two disks
Stripe
Stripe d Volumes d Volumes
Series of partitions, one partition per disk (of same size) Series of partitions, one partition per disk (of same size) Combined into a single logical volume
Combined into a single logical volume FtDisk
FtDisk /DMIO /DMIO optimize data storage and retrieval times optimize data storage and retrieval times
Stripes are narrow: 64KB Stripes are narrow: 64KB
Data tends to be distributed evenly among disks Data tends to be distributed evenly among disks
Multiple pending read/write ops. will operate on different disks Multiple pending read/write ops. will operate on different disks Latency for disk I/O is often reduced (parallel seek operations) Latency for disk I/O is often reduced (parallel seek operations)
(150 MB) (150 MB) (150 MB)
1 2
4
3
Fault Tolerant Volumes Fault Tolerant Volumes
FtDisk
FtDisk /DMIO /DMIO implement redundant storage schemes implement redundant storage schemes
Mirror sets Mirror sets
Stripe sets with parity Stripe sets with parity
Sector sparing Sector sparing
Tool Tool s: s : Windows Disk Management MMC snap-in Windows Disk Management MMC snap-in
• Mirrored Volumes Mirrored Volumes : :
– Contents of a partition on one disk are duplicated on another diskContents of a partition on one disk are duplicated on another disk – FtDiskFtDisk/DMIO write same data to both locations/DMIO write same data to both locations
C:
C:
(mirror)
RAID-5 Volumes RAID-5 Volumes
Fault tolerant version of a regular stripe set Fault tolerant version of a regular stripe set Parity: logical sum (XOR)
Parity: logical sum (XOR)
Parity info is distributed evenly over available disks Parity info is distributed evenly over available disks FtDisk
FtDisk /DMIO reconstruct missing data by using XOR op. /DMIO reconstruct missing data by using XOR op.
parity
Bad Cluster Recovery Bad Cluster Recovery
Sector sparing is supported by FtDisk
Sector sparing is supported by FtDisk /DMIO /DMIO
Dynamic copying of recovered data to spare sectors Dynamic copying of recovered data to spare sectors Without intervention from file system / user
Without intervention from file system / user Works for certain SCSI disks
Works for certain SCSI disks FtDisk
FtDisk/DMIO/DMIO return bad sector warning to NTFS return bad sector warning to NTFS
Sector re
Sector re - - mapping is supported by NTFS mapping is supported by NTFS
NTFS will not reuse bad clusters NTFS will not reuse bad clusters
NTFS copies data recovered by FtDisk
NTFS copies data recovered by FtDisk/DMIO into a new cluster/DMIO into a new cluster
NTFS cannot recover data from bad sector without help NTFS cannot recover data from bad sector without help from FtDisk
from FtDisk /DMIO /DMIO
NTFS will never write to bad sector (re
NTFS will never write to bad sector (re--map before write)map before write)
Bad-cluster re-mapping Bad-cluster re-mapping
NTFS filename
Standard info Security desc. Data
DataData
User file
Startin Startin g VCN
g VCN StartinStartin g LCN
g LCN Number of Number of clusters clusters 00 13551355 22 22 10491049 11 33 15881588 44
VCN 0 1
LCN 1355 1356 DataData
VCN 3 4 5 6
DataData VCN 2 NTFS filename
Standard info Security desc. Data
Startin Startin g VCN
g VCN StartinStartin g LCN
g LCN Number of Number of clusters clusters 00 13571357 11 BadBad
VCN 0
LCN 1357