Beyond One Core
Multi-Cores
Evolution in Frequency Has Stalled
Source: ISCA 2009 Keynoteof Kathy Yelick
Multi-Cores
Private and/or shared caches Bus → Network (on chip = NoC)
Cache Core
Cache Core Cache
Core
Cache Core
Network or Bus Memory
Example: ARM Cortex A9
Source: ARM
Example: Intel Nehalem
Network on Chip
Various topologies: mesh, torus,…
Static or dynamic routing (wormhole)
Source: Eyal Friedman, 2008
Cache Coherence
Load A Core
Cache Core
Cache Core
Cache Core
Network
A=1
Cache Coherence
A=1 Core
Cache Core
Load A Core
Cache Core
Network
A=1
Cache Coherence
A=1 Core
Cache Core
A=1 Core
Load A Core
Network
A=1
Cache Coherence
A=1 Core
Cache Core
A=1 Core
A=1 Core
Network
A=1
Cache Coherence
A=1 Core
Cache Core
A=3 Core
A=1 Core
Network
A=1
Cache Coherence
Invalid Core
Cache Core
A=3 Core
Invalid Core
Network
A=1
Cache Coherence
Keep data coherent among caches
Cache Core
Cache Core
A=3 Core
Cache Core
Network
A=3
Cache Coherence Protocol: MESI
M (Modified): no other cache has block in M, E or S state, value different from memory
E (Exclusive): no other cache has block in M, E or S, value same as memory
S (Shared): other caches sharing block I (Invalid): cache block invalid
Source: Mehmet Senvar
Parallelization
Decompose program into independent tasks But tasks usually share data and synchronize
Pthreads
Example
Source: Charles Leiserson
System
Hardware System & Operating System Hardware system = processor + I/O devices Operating system = software for:
•
providing a hardware abstraction to user
•
managing all hardware resources Main hardware resources:
•
Processor:
which process to execute ?
for how long ?
•
I/O (disk, network, keyboard, screen,…):
Communications with processor ?
•
Memory:
Where to place data and programs for users ?
How to allocate memory to users ?
Input/Output
Device controler hides internal operations to systems Main commands: readand write
Transfers by bytesor bursts Special I/O registers and buffers
Registers and buffers are either memory-mapped or hardware elements
Bus Keyboard controler Screen controler Processor
Drivers
Hardware specifics of a device should have no impact on O/S
O/S shielded from device specifics through
driversProvide an abstract view of a category of devices
Drivers are hardware-specific software components
added to O/S
O/S provides standard interface to drivers of each category
Upper layer of O/S
Controler 1 Controler 2 Controler 3 Controler 4
Device 1 Device 2 Device 3 Device 4
Operating System
Using an I/O Device
Permanent polling:
1. user calls specific polling routine 2. I/O registers are read (or modified)
3. processor checks if data arrived in buffer (or have been used)
4. new data collected (or sent)
Processor frequently probed
Processor and I/O devices work on different time scales
I/O Programming
Polling:
• KBDR: code of keystroke
• KBSR: memory-mapped input register
KBSR[15] = 1 if keystroke and KBDR not yet read
KBSR[15] = 0 if KBDR read
START LDI R1, @KBSR
BRzp START LDI R0, @KBDR
BR NEXT
@KBSR .FILL xF400
@KBDR .FILL xF401
• CRTDR: ASCII code of character to display
• CRTSR: memory-mapped output register
CRTSR[15] = 1 if character not yet used
Using an I/O Device
Interrupts
1.
call a specific routines
2.I/O registers modified
3.
processor can execute other tasks
4.when data arrive (read), controler signals
processor
5.
processor interrupts task, processes data
I/O Programming
Interrupts
Processor must allow interrupt (can refuse)
Interrupt controler sends interrupt routine address
Upon interrupt: context backup, PC to interrupt
routine, restore context upon end
DEBUT LDI R0, @KBDR
RTI
@KBDR .FILL .FILL xF401
Managing Several I/O Devices
Simultaneous interrupts possible
Device signals its interrupt request through dedicated bus line
Interrupt controler manages interrupt priority
Interrupt controler signals authorization to devices
Address Processor Device
1
Interrupt lines Device
2 Device
3 Device
4 Interrupt
controler
Example
Plug And Play on a PC:
•
Initially, each device (card) had an interrupt level (0 to 7) and a fixed I/O register address (e.g.,
keyboard: 0x60 to 0x64).
⇒
Possible conflicts among devices
⇒
Later, possibility to change interrupt level and register addresses
•Plug And Play:
devices have programmable interrupt levels and register addresses
upon start-up, system collects all devices and desired interrupts
it assigns interrupts to devices
Simplifying and Protecting Access to I/O Devices
Similar to subroutine call but special instructions: TRAP/RTI Prevent user from directly
accessing I/O device registers System call table
0 1 2 3 4 5 6 7 8 9 1 0 1 1
trapvect8 0
0 0 0
...
...
0x2A0B 3
2 1 0 Index
0x4F01 0x3100 0x3000 Subroutine
Address
TRAP
DMA
Direct Memory Access (DMA):
1. processor sends starting target address, number of bytes, device address to DMA controler
2. DMA controler starts communication between device and memory
3. once transfer completed, processor probed 4. either a new block transferred or transfer stopped
Processor is not required for I/O transfers
Example application:
• DVD viewing on a PC
Putting It All Together: PC Organization
Example
I/O registers for disk controler
Disk address(sector) Memory address Number of sectors
read / write
Processor Disk controler
Bus
Buffer
Example
Disk divided into cylinders, tracks, sectors
Controler provides abstract view of disk to O/S: N contiguous sectors
Read k sectors starting at sector i and send to address A triggers:
• iconverted into cylinder, head, sector
• head moved to track
• head waits for sector to come below track
• information sent as a sequence of bytes
Storage surface Read/Write heads
Example
Screen/Visualization
Contains a memory area (VRAM) accessed by processor via controler 1 pixel associated to 1 to 24 bits
(bitmap)
• 1 bit: N/B
• 24 bits: 3x8 bits for RGB (Red Green Blue)
• 8/16 bits: index into color palette Example: displaying graphics
• no graphics accelerator:
processor computes image bitmap and sends it to controler
• with graphics accelerator:
program sends abstract command (draw rectangle)
API (e.g., DirectX)
graphics card converts command into bitmap, stores image in VRAM
RAMDAC (Random Access Memory Digital Analog Converter) transfers image to memory
Processor Graphics
card
Screen Controler Video RAM
Example
Increasing role of graphics card:
• Initially: all in CPU
• Then: 2D in graphics processor, 3D in CPU
• Then: 3D in graphics processor
• Now: graphics processor can do general computations (GPGPU: General-Purpose Graphics Processing Unit) GPU computational power >
CPU
CPU manufacturers add graphics extensions, develop GPU-like architectures PC architecture may revolve
around GPU rather than CPU ? Intel Larrabee vs. NVIDIA
Fermi
VideoRAM RAMDAC
To screen
Video BIOS GPU (fonts, graphics primitives...)
A GPU Example
NVIDIA Fermi
16 multiprocessors
Each multiprocessor contains 32 cores 512 cores in total
40nm, 3 billion transistors Multiprocessor
Fermi Multiprocessor
Evolved from GPU shaders Warp = bundle of 32 threads Simple cores
Source: EyalFriedman, 2008
Shared register file Shared L/S Shared L1 cache SFUs: complex math
Multithreading in Fermi
Dual-Issue but pipelines independent (no dependences)
48 warps per multiprocessor 48 x 32 = 1536 threads on chip 512 threads each cycle
No delay for thread switching
Memory Hierarchy User-managed
memory location ECC (Error
Correction Code) everywhere:
registers, caches, memory
Configurable local memory:
•
can be partitioned
•
shared-memory
•
or small L1 caches
Grid
Global Memory Block (0, 0)
Shared Memory
Thread (0, 0) Registers
Thread (1, 0) Registers
Block (1, 0)
Shared Memory
Thread (0, 0) Registers
Thread (1, 0) Registers
Host
Constant Memory
Programming Fermi
C/C++ Close to CPU programming
Compile into abstract representation (PTX) Parallel & Memory extensions
Memory
Simple DRAM Component
Few pins
→multiplex address
Capacitors
→periodic refresh (
≈ms)
SIMM (Single In-line Memory Module) 4Mo: 4 chips 1Mbit x 8; 32 bits per node
SDRAM
Page mode
• Keep row address
• Select column address
⇒Faster access to same- row bits
• Output latch to store row
→faster column address change
SDRAM:
• CAS & RAS synchronized with CPU clock
• Register indicates # bits to read
Mode Page, EDO
SDRAM
Virtual Memory Words: 32 or 64 bits.
Address space: 4 GB or 16 Exabytes (2
60).
Physical memory size <<
Address space size Virtual memory:
•
Give illusion physical memory size = address space size
•
Physical memory
≈cache of virtual memory
•
Managed by operating system
Core
Memory
≈227bytes
Disk
≈235bytes
≈ns
≈ms
Memory Hierarchy
Memory Management Unit (MMU)
Processor uses virtual addresses
Data at virtual address A
vis in physical location A
p⇒
convert
Avto A
p•
conversion done by MMU
Processor
Memory MMU
11100100101100101110001100101100
10011001001110 Virtual address 32-bits
Physical address 10011001001111 ...
Pages
Pages: 512B-64KB blocks Disk fetch: more efficient
with blocks
Byte address in page:
same for physical and virtual addresses
11100100101100101110001100101100
10011001001110001100101100 32-bit word,
4KB-page
Byte address in page Virtual page
address
Byte address in page Physical page
address
MSB
Address Translation
Page-based translation
Page tables (in memory)
Page table:
• indexed by virtual address
• hashed to physical address
Page Virtual address
Page Physical
address Processor request
(virtual address)
Page Table
LSB
Page Table Entry
Physical/Virtual address translation But also:
• process protection (illegal read/writes among processes)
• information for page replacement
• page availability (in memory/not in memory)
• information for I/Os Page table entry size ≈32 bits
r w x Physical page address In memory
Protection Modified
Used I/O
Segmentation
Different and complementary approach
A « user » view
Segments correspond to user-known categories:
• stack
• heap
• program
• …
Segments grow dynamically, can be limited
Segment pointer/offset
Can be combined with virtual
memory (x86)
Espace d’adressage virtuel de l’Alpha
Segment stack
Segment heap