Unit OS3: Concurrency Unit OS3: Concurrency
3.2. Windows Trap Dispatching, Interrupts, 3.2. Windows Trap Dispatching, Interrupts,
Synchronization
Synchronization
Copyright Notice Copyright Notice
© 2000-2005 David A. Solomon and Mark Russinovich
© 2000-2005 David A. Solomon and Mark Russinovich
These materials are part of the
These materials are part of the Windows Operating Windows Operating System Internals Curriculum Development Kit,
System Internals Curriculum Development Kit, developed by David A. Solomon and Mark E.
developed by David A. Solomon and Mark E.
Russinovich with Andreas Polze Russinovich with Andreas Polze
Microsoft has licensed these materials from David Microsoft has licensed these materials from David Solomon Expert Seminars, Inc. for distribution to Solomon Expert Seminars, Inc. for distribution to academic organizations solely for use in academic academic organizations solely for use in academic environments (and not for commercial use)
environments (and not for commercial use)
Roadmap for Section 3.2.
Roadmap for Section 3.2.
Trap and Interrupt dispatching Trap and Interrupt dispatching
IRQL levels & Interrupt Precedence IRQL levels & Interrupt Precedence
Spinlocks and Kernel Synchronization Spinlocks and Kernel Synchronization
Executive Synchronization
Executive Synchronization
Processes and Threads Processes and Threads
What is a process?
What is a process?
Represents an instance of a running program Represents an instance of a running program
you create a process to run a program you create a process to run a program starting an application creates a process starting an application creates a process
Process defined by:
Process defined by:
Address space Address space
Resources (e.g. open handles) Resources (e.g. open handles) Security profile (token)
Security profile (token) What is a thread?
What is a thread?
An execution context within a process An execution context within a process
Unit of scheduling (threads run, processes don’t run) Unit of scheduling (threads run, processes don’t run)
All threads in a process share the same per-process address space All threads in a process share the same per-process address space
Services provided so that threads can synchronize access to shared resources Services provided so that threads can synchronize access to shared resources (critical sections, mutexes, events, semaphores)
(critical sections, mutexes, events, semaphores)
All threads in the system are scheduled as peers to all others, without regard All threads in the system are scheduled as peers to all others, without regard to their “parent” process
to their “parent” process
System calls System calls
Primary argument to CreateProcess is image file name (or command line) Primary argument to CreateProcess is image file name (or command line) Primary argument to CreateThread is a function entry point address Primary argument to CreateThread is a function entry point address
Per-process address space
Per-process address space
Systemwide Address Systemwide Address
Space Space
Systemwide Address Systemwide Address
Space Space
Thread
Thread
Thread
Kernel Mode Versus User Mode Kernel Mode Versus User Mode
A processor state A processor state
Controls access to memory Controls access to memory Each memory page is tagged Each memory page is tagged to show the required mode for to show the required mode for reading and for writing
reading and for writing
Protects the system from Protects the system from the users
the users
Protects the user (process) from Protects the user (process) from themselves
themselves
System is not protected System is not protected from system
from system
Code regions are tagged Code regions are tagged
“no write in any mode”
“no write in any mode”
Controls ability to execute Controls ability to execute privileged instructions privileged instructions A Windows abstraction A Windows abstraction
Intel: Ring 0, Ring 3 Intel: Ring 0, Ring 3
Control flow (i.e.; a thread) ca Control flow (i.e.; a thread) ca change from user to kernel mode change from user to kernel mode and back
and back
Does not affect scheduling Does not affect scheduling Thread context includes info Thread context includes info about execution mode (along about execution mode (along with registers, etc)
with registers, etc)
PerfMon counters:
PerfMon counters:
“Privileged Time” and “Privileged Time” and
“User Time”
“User Time”
4 levels of granularity: thread, 4 levels of granularity: thread, process,
process,
processor, system processor, system
Getting Into Kernel Mode Getting Into Kernel Mode
Code is run in kernel mode for one of three reasons:
Code is run in kernel mode for one of three reasons:
1. Requests from user mode 1. Requests from user mode
Via the system service dispatch mechanism Via the system service dispatch mechanism
Kernel-mode code runs in the context of the requesting thread Kernel-mode code runs in the context of the requesting thread
2. Interrupts from external devices 2. Interrupts from external devices
Windows interrupt dispatcher invokes the interrupt service routine Windows interrupt dispatcher invokes the interrupt service routine ISR runs in the context of the interrupted thread
ISR runs in the context of the interrupted thread (so-called “arbitrary thread context”)
(so-called “arbitrary thread context”)
ISR often requests the execution of a “DPC routine,”
ISR often requests the execution of a “DPC routine,”
which also runs in kernel mode which also runs in kernel mode
Time not charged to interrupted thread Time not charged to interrupted thread
3. Dedicated kernel-mode system threads 3. Dedicated kernel-mode system threads
Some threads in the system stay in kernel mode at all times Some threads in the system stay in kernel mode at all times (mostly in the “System” process)
(mostly in the “System” process)
Scheduled, preempted, etc., like any other threads Scheduled, preempted, etc., like any other threads
Trap dispatching Trap dispatching
Trap: processor‘s mechanism to capture executing thread Trap: processor‘s mechanism to capture executing thread
Switch from user to kernel mode Switch from user to kernel mode Interrupts – asynchronous
Interrupts – asynchronous Exceptions - synchronous Exceptions - synchronous
Interrupt dispatcher
System service dispatcher
Interrupt service routines Interrupt service routines Interrupt service routines
System services System services System services Exception dispatcher
Exception handlers Exception
handlers Exception
handlers Virtual memory
manager‘s pager Interrupt
System service call
HW exceptions SW exceptions Virtual address exceptions
Interrupt Dispatching Interrupt Dispatching
Interrupt dispatch routine
Disable interrupts
Record machine state (trap frame) to allow resume
Mask equal- and lower-IRQL interrupts
Find and call appropriate ISR
Dismiss interrupt
Restore machine state (including mode and enabled interrupts) Disable interrupts
Record machine state (trap frame) to allow resume
Mask equal- and lower-IRQL interrupts
Find and call appropriate ISR
Dismiss interrupt
Restore machine state (including mode and enabled interrupts)
Tell the device to stop interrupting
Interrogate device state, start next operation on device, etc.
Request a DPC Return to caller
Tell the device to stop interrupting
Interrogate device state, start next operation on device, etc.
Request a DPC Return to caller
Interrupt service routine interrupt !
user or kernel mode
code
kernel mode
Note, no thread or process context switch!
Note, no thread or process context switch!
Interrupt Precedence via IRQLs (x86) Interrupt Precedence via IRQLs (x86)
IRQL = Interrupt Request Level IRQL = Interrupt Request Level
the “precedence” of the interrupt with the “precedence” of the interrupt with respect to other interrupts
respect to other interrupts
Different interrupt sources have Different interrupt sources have different IRQLs
different IRQLs
not the same as IRQ not the same as IRQ
Passive/Low APC
Dispatch/DPC Device 1
. . .
Profile & Synch (Srv 2003)
Clock
Interprocessor Interrupt Power fail
High
normal thread execution Hardware interrupts
Deferrable software interrupts 0
1 2 30 29 28 31
IRQL is also a state of the processor IRQL is also a state of the processor Servicing an interrupt raises processor Servicing an interrupt raises processor IRQL to that interrupt’s IRQL
IRQL to that interrupt’s IRQL
this masks subsequent interrupts at equal this masks subsequent interrupts at equal and lower IRQLs
and lower IRQLs
User mode is limited to IRQL 0 User mode is limited to IRQL 0 No waits or page faults at IRQL >=
No waits or page faults at IRQL >=
DISPATCH_LEVEL DISPATCH_LEVEL
Interrupt processing Interrupt processing
Interrupt dispatch table (IDT) Interrupt dispatch table (IDT)
Links to interrupt service routines Links to interrupt service routines
x86: x86:
Interrupt controller interrupts processor (single line) Interrupt controller interrupts processor (single line)
Processor queries for interrupt vector; uses vector as index to IDT Processor queries for interrupt vector; uses vector as index to IDT
After ISR execution, IRQL is lowered to initial level
After ISR execution, IRQL is lowered to initial level
Interrupt object Interrupt object
Allows device drivers to register ISRs for their devices Allows device drivers to register ISRs for their devices
Contains dispatch code (initial handler) Contains dispatch code (initial handler)
Dispatch code calls ISR with interrupt object as parameter Dispatch code calls ISR with interrupt object as parameter (HW cannot pass parameters to ISR)
(HW cannot pass parameters to ISR)
Connecting/disconnecting interrupt objects:
Connecting/disconnecting interrupt objects:
Dynamic association between ISR and IDT entry Dynamic association between ISR and IDT entry Loadable device drivers (kernel modules)
Loadable device drivers (kernel modules) Turn on/off ISR
Turn on/off ISR
Interrupt objects can synchronize access to ISR data Interrupt objects can synchronize access to ISR data
Multiple instances of ISR may be active simultaneously (MP machine) Multiple instances of ISR may be active simultaneously (MP machine) Multiple ISR may be connected with IRQL
Multiple ISR may be connected with IRQL
Predefined IRQLs Predefined IRQLs
High High
used when halting the system (via
used when halting the system (via KeBugCheck())KeBugCheck())
Power fail Power fail
originated in the NT design document, but has never been used originated in the NT design document, but has never been used
Inter-processor interrupt Inter-processor interrupt
used to request action from other processor (dispatching a thread, used to request action from other processor (dispatching a thread, updating a processors TLB, system shutdown, system crash)
updating a processors TLB, system shutdown, system crash)
Clock Clock
Used to update system‘s clock, allocation of CPU time to threads Used to update system‘s clock, allocation of CPU time to threads
Profile Profile
Used for kernel profiling (see Kernel profiler – Kernprof.exe, Res Kit) Used for kernel profiling (see Kernel profiler – Kernprof.exe, Res Kit)
Predefined IRQLs (contd.) Predefined IRQLs (contd.)
Device Device
Used to prioritize device interrupts Used to prioritize device interrupts
DPC/dispatch
DPC/dispatch and and APC APC
Software interrupts that kernel and device drivers Software interrupts that kernel and device drivers
generate generate
Passive Passive
No interrupt level at all, normal thread execution
No interrupt level at all, normal thread execution
IRQLs on 64-bit Systems IRQLs on 64-bit Systems
Passive/Low APC
Dispatch/DPC Device 1
. .
Device n
Synch (Srv 2003) Clock
Interprocessor Interrupt/Power
High/Profile
0 1 2 14 13 15
3 4
Passive/Low APC
Dispatch/DPC & Synch (UP only) Correctable Machine Check
Device 1 .
Device n Synch (MP only)
Clock
Interprocessor Interrupt
High/Profile/Power
x64 IA64
12
Interrupt Prioritization & Delivery Interrupt Prioritization & Delivery
IRQLs are determined as follows:
IRQLs are determined as follows:
x86 UP systems: IRQL = 27 - IRQ x86 UP systems: IRQL = 27 - IRQ
x86 MP systems: bucketized (random) x86 MP systems: bucketized (random)
x64 & IA64 systems: IRQL = IDT vector number / 16 x64 & IA64 systems: IRQL = IDT vector number / 16
On MP systems, which processor is chosen to deliver an interrupt?
On MP systems, which processor is chosen to deliver an interrupt?
By default, any processor can receive an interrupt from any device By default, any processor can receive an interrupt from any device
Can be configured with IntFilter utility in Resource Kit Can be configured with IntFilter utility in Resource Kit
On x86 and x64 systems, the IOAPIC (I/O advanced programmable On x86 and x64 systems, the IOAPIC (I/O advanced programmable interrupt controller) is programmed to interrupt the processor running at interrupt controller) is programmed to interrupt the processor running at the lowest IRQL
the lowest IRQL
On IA64 systems, the SAPIC (streamlined advanced programmable On IA64 systems, the SAPIC (streamlined advanced programmable interrupt controller) is configured to interrupt one processor for each interrupt controller) is configured to interrupt one processor for each interrupt source
interrupt source
Processors are assigned round robin for each interrupt vector Processors are assigned round robin for each interrupt vector
Software interrupts Software interrupts
Initiating thread dispatching Initiating thread dispatching
DPC allow for scheduling actions when kernel is DPC allow for scheduling actions when kernel is
deep within many layers of code deep within many layers of code
Delayed scheduling decision, one DPC queue per Delayed scheduling decision, one DPC queue per
processor processor
Handling timer expiration Handling timer expiration
Asynchronous execution of a procedure in Asynchronous execution of a procedure in
context of a particular thread context of a particular thread
Support for asynchronous I/O operations
Support for asynchronous I/O operations
Flow of Interrupts Flow of Interrupts
Peripheral Device Controller
CPU Interrupt Controller
CPU Interrupt
Service Table 0 2 3
n
ISR Address Spin Lock
Dispatch Code
Interrupt Object
Read from device Acknowledge- Interrupt Request DPC
Driver ISR
Raise IRQL
Lower IRQL
KiInterruptDispatch
Grab Spinlock Drop Spinlock
0 31
Synchronization on SMP Systems Synchronization on SMP Systems
Synchronization on MP systems use spinlocks to coordinate among the Synchronization on MP systems use spinlocks to coordinate among the
processors processors
Spinlock acquisition and release routines implement a one-owner-at-a-time Spinlock acquisition and release routines implement a one-owner-at-a-time
algorithm algorithm
A spinlock is either free, or is considered to be owned by a CPU A spinlock is either free, or is considered to be owned by a CPU Analogous to using Windows API mutexes from user mode Analogous to using Windows API mutexes from user mode
A spinlock is just a data cell in memory A spinlock is just a data cell in memory
Accessed with a test-and-modify operation that is atomic across all processors Accessed with a test-and-modify operation that is atomic across all processors
KSPIN_LOCK is an opaque data type, typedef’d as a ULONG KSPIN_LOCK is an opaque data type, typedef’d as a ULONG To implement synchronization, a single bit is sufficient
To implement synchronization, a single bit is sufficient
Kernel Synchronization Kernel Synchronization
Processor B Processor A
do
acquire_spinlock(DPC) until (SUCCESS)
begin
remove DPC from queue end
release_spinlock(DPC)
do
acquire_spinlock(DPC) until (SUCCESS)
begin
remove DPC from queue end
release_spinlock(DPC) .
. .
. . .
Critical section
spinlock
DPC DPC
A spinlock is a locking primitive associated
with a global data structure, such as the DPC queue
Try to acquire spinlock:
Test, set, was set, loop Test, set, was set, loop Test, set, was set, loop Test, set, was set, loop Test, set, WAS CLEAR (got the spinlock!) Begin updating data Try to acquire spinlock:
Test, set, was set, loop Test, set, was set, loop Test, set, was set, loop Test, set, was set, loop Test, set, WAS CLEAR (got the spinlock!) Begin updating data Try to acquire spinlock:
Test, set, WAS CLEAR (got the spinlock!) Begin updating data that’s protected by the spinlock
(done with update) Release the spinlock:
Clear the spinlock bit Try to acquire spinlock:
Test, set, WAS CLEAR (got the spinlock!) Begin updating data that’s protected by the spinlock
(done with update) Release the spinlock:
Clear the spinlock bit
Spinlocks in Action Spinlocks in Action
CPU 1 CPU 2
Queued Spinlocks Queued Spinlocks
Problem
Problem : Checking status of spinlock via test-and-set : Checking status of spinlock via test-and-set operation creates bus contention
operation creates bus contention
Queued spinlocks maintain queue of waiting processors Queued spinlocks maintain queue of waiting processors First processor acquires lock; other processors wait on First processor acquires lock; other processors wait on processor-local flag
processor-local flag
Thus, busy-wait loop requires no access to the memory bus Thus, busy-wait loop requires no access to the memory bus
When releasing lock, the first processor’s flag is modified When releasing lock, the first processor’s flag is modified
Exactly one processor is being signaled Exactly one processor is being signaled
Pre-determined wait order
Pre-determined wait order
SMP Scalability Improvements SMP Scalability Improvements
Windows 2000: queued spinlocks Windows 2000: queued spinlocks
!qlocks in Kernel Debugger
!qlocks in Kernel Debugger
XP/2003:
XP/2003:
Minimized lock contention for hot locks (PFN or Page Frame Database) lock Minimized lock contention for hot locks (PFN or Page Frame Database) lock Some locks completely eliminated
Some locks completely eliminated
Charging nonpaged/paged pool quotas, allocating and mapping system page table entries, Charging nonpaged/paged pool quotas, allocating and mapping system page table entries, charging commitment of pages, allocating/mapping physical memory through
charging commitment of pages, allocating/mapping physical memory through AWE functions
AWE functions
New, more efficient locking mechanism (pushlocks) New, more efficient locking mechanism (pushlocks)
Doesn’t use spinlocks when no contention Doesn’t use spinlocks when no contention
Used for object manager and address windowing extensions (AWE) related locks Used for object manager and address windowing extensions (AWE) related locks
Server 2003:
Server 2003:
More spinlocks eliminated (context swap, system space, commit) More spinlocks eliminated (context swap, system space, commit) Further reduction of use of spinlocks & length they are held
Further reduction of use of spinlocks & length they are held Scheduling database now per-CPU
Scheduling database now per-CPU
Allows thread state transitions in parallel Allows thread state transitions in parallel
Waiting Waiting
Flexible wait calls Flexible wait calls
Wait for one or multiple objects in one call Wait for one or multiple objects in one call
Wait for multiple can wait for “any” one or “all” at once Wait for multiple can wait for “any” one or “all” at once
“
“All”: all objects must be in the signalled state concurrently to resolve the waitAll”: all objects must be in the signalled state concurrently to resolve the wait
All wait calls include optional timeout argument All wait calls include optional timeout argument Waiting threads consume no CPU time
Waiting threads consume no CPU time
Waitable objects include:
Waitable objects include:
Events (may be auto-reset or manual reset; may be set or “pulsed”) Events (may be auto-reset or manual reset; may be set or “pulsed”) Mutexes (“mutual exclusion”, one-at-a-time)
Mutexes (“mutual exclusion”, one-at-a-time) Semaphores (n-at-a-time)
Semaphores (n-at-a-time) Timers
Timers
Processes and Threads (signalled upon exit or terminate) Processes and Threads (signalled upon exit or terminate) Directories (change notification)
Directories (change notification)
No guaranteed ordering of wait resolution No guaranteed ordering of wait resolution
If multiple threads are waiting for an object, and only one thread is released (e.g. it’s a If multiple threads are waiting for an object, and only one thread is released (e.g. it’s a mutex or auto-reset event), which thread gets released is unpredictable
mutex or auto-reset event), which thread gets released is unpredictable
Typical order of wait resolution is FIFO; however APC delivery may change this order Typical order of wait resolution is FIFO; however APC delivery may change this order
Executive Synchronization Executive Synchronization
Waiting on Dispatcher Objects – outside the kernel Waiting on Dispatcher Objects – outside the kernel
Thread waits on an object
handle
Create and initialize thread object Initialized
Ready
Transition Waiting
Running Terminated
Standby Wait is complete;
Set object to signaled state
Interaction with thread scheduling
Interactions between Interactions between
S S ynchronization and ynchronization and T T hread hread D D ispatching ispatching
1.1. User mode thread waits on an event object‘s handleUser mode thread waits on an event object‘s handle
2.2. Kernel changes thread‘s scheduling state from ready to waiting and Kernel changes thread‘s scheduling state from ready to waiting and adds thread to wait-list
adds thread to wait-list
3.3. Another thread sets the eventAnother thread sets the event
4.4. Kernel wakes up waiting threads; variable priority threads get priority Kernel wakes up waiting threads; variable priority threads get priority boost
boost
5.5. Dispatcher re-schedules new thread – it may preempt running thread Dispatcher re-schedules new thread – it may preempt running thread it it has lower priority and issues software interrupt to initiate context it it has lower priority and issues software interrupt to initiate context switch
switch
6.6. If no processor can be preempted, the dispatcher places the ready If no processor can be preempted, the dispatcher places the ready thread in the dispatcher ready queue to be scheduled later
thread in the dispatcher ready queue to be scheduled later
What signals an object?
What signals an object?
Dispatcher object
System events and resulting state change
Effect of signaled state on waiting threads
nonsignaled signaled
Owning thread releases mutex
Resumed thread acquires mutex
Kernel resumes one waiting thread
Mutex (kernel mode)
nonsignaled signaled
Owning thread or other thread releases mutex
Resumed thread acquires mutex
Kernel resumes one waiting thread
Mutex
(exported to user mode)
nonsignaled signaled
One thread releases the semaphore, freeing a resource
A thread acquires the semaphore.
More resources are not available
Kernel resumes one or more waiting threads
Semaphore
What signals an object? (contd.) What signals an object? (contd.)
Dispatcher object System events and resulting state change
Effect of signaled state on waiting threads
nonsignaled signaled
A thread sets the event
Kernel resumes one or more threads
Kernel resumes one or more waiting threads
Event
nonsignaled signaled
Dedicated thread sets one event in the event pair
Kernel resumes the other dedicated thread
Kernel resumes waiting dedicated thread
Event pair
nonsignaled signaled
Timer expires
A thread (re) initializes the timer
Kernel resumes all waiting threads
Timer
A thread reinitializes the thread object
What signals an object? (contd.) What signals an object? (contd.)
Dispatcher object System events and resulting state change
Effect of signaled state on waiting threads
nonsignaled signaled
IO operation completes
Thread initiates wait on an IO port
Kernel resumes waiting dedicated thread
File
nonsignaled signaled
Process terminates
A process reinitializes the process object
Kernel resumes all waiting threads
Process
nonsignaled signaled
Thread terminates
Kernel resumes all waiting threads
Thread
Further Reading Further Reading
Mark E. Russinovich and David A. Solomon, Mark E. Russinovich and David A. Solomon,
Microsoft Windows Internals, 4th Edition, Microsoft Windows Internals, 4th Edition,
Microsoft Press, 2004.
Microsoft Press, 2004.
Chapter 3 - System Mechanisms Chapter 3 - System Mechanisms
Trap Dispatching (from pp. 85) Trap Dispatching (from pp. 85) Synchronization (from pp. 149) Synchronization (from pp. 149)
Kernel Event Tracing (from pp. 175)
Kernel Event Tracing (from pp. 175)
Source Code References Source Code References
Windows Research Kernel sources Windows Research Kernel sources
\base\ntos\ke\i386 (similar files for amd64)
\base\ntos\ke\i386 (similar files for amd64)
Trap.asm, Trapc.c – Trap dispatcher Trap.asm, Trapc.c – Trap dispatcher Spinlock.asm – Spinlocks
Spinlock.asm – Spinlocks
Clockint.asm – Clock Interrupt Handler Clockint.asm – Clock Interrupt Handler
Int.asm, Intobj.c, Intsup.asm – Interrupt Processing Int.asm, Intobj.c, Intsup.asm – Interrupt Processing
\base\ntos\ke
\base\ntos\ke
eventobj.c - Event object eventobj.c - Event object mutntobj.c – Mutex object mutntobj.c – Mutex object
semphobj.c – Semaphore object semphobj.c – Semaphore object timerobj.c, timersup.c – Timers timerobj.c, timersup.c – Timers wait.c, waitsup.c – Wait support wait.c, waitsup.c – Wait support
\base\ntos\inc\ke.h – structure/type definitions
\base\ntos\inc\ke.h – structure/type definitions