• Aucun résultat trouvé

1 BeyondOne CoreMulti-CoresEvolution in FrequencyHas Stalled

N/A
N/A
Protected

Academic year: 2022

Partager "1 BeyondOne CoreMulti-CoresEvolution in FrequencyHas Stalled"

Copied!
17
0
0

Texte intégral

(1)

Beyond One Core

Multi-Cores

Evolution in Frequency Has Stalled

Source: ISCA 2009 Keynoteof Kathy Yelick

(2)

Multi-Cores

Private and/or shared caches Bus → Network (on chip = NoC)

Cache Core

Cache Core Cache

Core

Cache Core

Network or Bus Memory

Example: ARM Cortex A9

Source: ARM

Example: Intel Nehalem

(3)

Network on Chip

Various topologies: mesh, torus,…

Static or dynamic routing (wormhole)

Source: Eyal Friedman, 2008

Cache Coherence

Load A Core

Cache Core

Cache Core

Cache Core

Network

A=1

Cache Coherence

A=1 Core

Cache Core

Load A Core

Cache Core

Network

A=1

(4)

Cache Coherence

A=1 Core

Cache Core

A=1 Core

Load A Core

Network

A=1

Cache Coherence

A=1 Core

Cache Core

A=1 Core

A=1 Core

Network

A=1

Cache Coherence

A=1 Core

Cache Core

A=3 Core

A=1 Core

Network

A=1

(5)

Cache Coherence

Invalid Core

Cache Core

A=3 Core

Invalid Core

Network

A=1

Cache Coherence

Keep data coherent among caches

Cache Core

Cache Core

A=3 Core

Cache Core

Network

A=3

Cache Coherence Protocol: MESI

M (Modified): no other cache has block in M, E or S state, value different from memory

E (Exclusive): no other cache has block in M, E or S, value same as memory

S (Shared): other caches sharing block I (Invalid): cache block invalid

Source: Mehmet Senvar

(6)

Parallelization

Decompose program into independent tasks But tasks usually share data and synchronize

Pthreads

Example

Source: Charles Leiserson

(7)

System

Hardware System & Operating System Hardware system = processor + I/O devices Operating system = software for:

providing a hardware abstraction to user

managing all hardware resources Main hardware resources:

Processor:

which process to execute ?

for how long ?

I/O (disk, network, keyboard, screen,…):

Communications with processor ?

Memory:

Where to place data and programs for users ?

How to allocate memory to users ?

Input/Output

Device controler hides internal operations to systems Main commands: readand write

Transfers by bytesor bursts Special I/O registers and buffers

Registers and buffers are either memory-mapped or hardware elements

Bus Keyboard controler Screen controler Processor

(8)

Drivers

Hardware specifics of a device should have no impact on O/S

O/S shielded from device specifics through

drivers

Provide an abstract view of a category of devices

Drivers are hardware-specific software components

added to O/S

O/S provides standard interface to drivers of each category

Upper layer of O/S

Controler 1 Controler 2 Controler 3 Controler 4

Device 1 Device 2 Device 3 Device 4

Operating System

Using an I/O Device

Permanent polling:

1. user calls specific polling routine 2. I/O registers are read (or modified)

3. processor checks if data arrived in buffer (or have been used)

4. new data collected (or sent)

Processor frequently probed

Processor and I/O devices work on different time scales

I/O Programming

Polling:

• KBDR: code of keystroke

• KBSR: memory-mapped input register

KBSR[15] = 1 if keystroke and KBDR not yet read

KBSR[15] = 0 if KBDR read

START LDI R1, @KBSR

BRzp START LDI R0, @KBDR

BR NEXT

@KBSR .FILL xF400

@KBDR .FILL xF401

• CRTDR: ASCII code of character to display

• CRTSR: memory-mapped output register

CRTSR[15] = 1 if character not yet used

(9)

Using an I/O Device

Interrupts

1.

call a specific routines

2.

I/O registers modified

3.

processor can execute other tasks

4.

when data arrive (read), controler signals

processor

5.

processor interrupts task, processes data

I/O Programming

Interrupts

Processor must allow interrupt (can refuse)

Interrupt controler sends interrupt routine address

Upon interrupt: context backup, PC to interrupt

routine, restore context upon end

DEBUT LDI R0, @KBDR

RTI

@KBDR .FILL .FILL xF401

Managing Several I/O Devices

Simultaneous interrupts possible

Device signals its interrupt request through dedicated bus line

Interrupt controler manages interrupt priority

Interrupt controler signals authorization to devices

Address Processor Device

1

Interrupt lines Device

2 Device

3 Device

4 Interrupt

controler

(10)

Example

Plug And Play on a PC:

Initially, each device (card) had an interrupt level (0 to 7) and a fixed I/O register address (e.g.,

keyboard: 0x60 to 0x64).

Possible conflicts among devices

Later, possibility to change interrupt level and register addresses

•Plug And Play:

devices have programmable interrupt levels and register addresses

upon start-up, system collects all devices and desired interrupts

it assigns interrupts to devices

Simplifying and Protecting Access to I/O Devices

Similar to subroutine call but special instructions: TRAP/RTI Prevent user from directly

accessing I/O device registers System call table

0 1 2 3 4 5 6 7 8 9 1 0 1 1

trapvect8 0

0 0 0

...

...

0x2A0B 3

2 1 0 Index

0x4F01 0x3100 0x3000 Subroutine

Address

TRAP

DMA

Direct Memory Access (DMA):

1. processor sends starting target address, number of bytes, device address to DMA controler

2. DMA controler starts communication between device and memory

3. once transfer completed, processor probed 4. either a new block transferred or transfer stopped

Processor is not required for I/O transfers

Example application:

• DVD viewing on a PC

(11)

Putting It All Together: PC Organization

Example

I/O registers for disk controler

Disk address

(sector) Memory address Number of sectors

read / write

Processor Disk controler

Bus

Buffer

Example

Disk divided into cylinders, tracks, sectors

Controler provides abstract view of disk to O/S: N contiguous sectors

Read k sectors starting at sector i and send to address A triggers:

• iconverted into cylinder, head, sector

• head moved to track

• head waits for sector to come below track

• information sent as a sequence of bytes

Storage surface Read/Write heads

(12)

Example

Screen/Visualization

Contains a memory area (VRAM) accessed by processor via controler 1 pixel associated to 1 to 24 bits

(bitmap)

• 1 bit: N/B

• 24 bits: 3x8 bits for RGB (Red Green Blue)

• 8/16 bits: index into color palette Example: displaying graphics

• no graphics accelerator:

processor computes image bitmap and sends it to controler

• with graphics accelerator:

program sends abstract command (draw rectangle)

API (e.g., DirectX)

graphics card converts command into bitmap, stores image in VRAM

RAMDAC (Random Access Memory Digital Analog Converter) transfers image to memory

Processor Graphics

card

Screen Controler Video RAM

Example

Increasing role of graphics card:

Initially: all in CPU

Then: 2D in graphics processor, 3D in CPU

Then: 3D in graphics processor

Now: graphics processor can do general computations (GPGPU: General-Purpose Graphics Processing Unit) GPU computational power >

CPU

CPU manufacturers add graphics extensions, develop GPU-like architectures PC architecture may revolve

around GPU rather than CPU ? Intel Larrabee vs. NVIDIA

Fermi

VideoRAM RAMDAC

To screen

Video BIOS GPU (fonts, graphics primitives...)

A GPU Example

(13)

NVIDIA Fermi

16 multiprocessors

Each multiprocessor contains 32 cores 512 cores in total

40nm, 3 billion transistors Multiprocessor

Fermi Multiprocessor

Evolved from GPU shaders Warp = bundle of 32 threads Simple cores

Source: EyalFriedman, 2008

Shared register file Shared L/S Shared L1 cache SFUs: complex math

Multithreading in Fermi

Dual-Issue but pipelines independent (no dependences)

48 warps per multiprocessor 48 x 32 = 1536 threads on chip 512 threads each cycle

No delay for thread switching

(14)

Memory Hierarchy User-managed

memory location ECC (Error

Correction Code) everywhere:

registers, caches, memory

Configurable local memory:

can be partitioned

shared-memory

or small L1 caches

Grid

Global Memory Block (0, 0)

Shared Memory

Thread (0, 0) Registers

Thread (1, 0) Registers

Block (1, 0)

Shared Memory

Thread (0, 0) Registers

Thread (1, 0) Registers

Host

Constant Memory

Programming Fermi

C/C++ Close to CPU programming

Compile into abstract representation (PTX) Parallel & Memory extensions

Memory

(15)

Simple DRAM Component

Few pins

multiplex address

Capacitors

periodic refresh (

ms)

SIMM (Single In-line Memory Module) 4Mo: 4 chips 1Mbit x 8; 32 bits per node

SDRAM

Page mode

• Keep row address

• Select column address

⇒Faster access to same- row bits

• Output latch to store row

→faster column address change

SDRAM:

• CAS & RAS synchronized with CPU clock

• Register indicates # bits to read

Mode Page, EDO

SDRAM

Virtual Memory Words: 32 or 64 bits.

Address space: 4 GB or 16 Exabytes (2

60

).

Physical memory size <<

Address space size Virtual memory:

Give illusion physical memory size = address space size

Physical memory

cache of virtual memory

Managed by operating system

Core

Memory

≈227bytes

Disk

≈235bytes

≈ns

≈ms

Memory Hierarchy

(16)

Memory Management Unit (MMU)

Processor uses virtual addresses

Data at virtual address A

v

is in physical location A

p

convert

Av

to A

p

conversion done by MMU

Processor

Memory MMU

11100100101100101110001100101100

10011001001110 Virtual address 32-bits

Physical address 10011001001111 ...

Pages

Pages: 512B-64KB blocks Disk fetch: more efficient

with blocks

Byte address in page:

same for physical and virtual addresses

11100100101100101110001100101100

10011001001110001100101100 32-bit word,

4KB-page

Byte address in page Virtual page

address

Byte address in page Physical page

address

MSB

Address Translation

Page-based translation

Page tables (in memory)

Page table:

• indexed by virtual address

• hashed to physical address

Page Virtual address

Page Physical

address Processor request

(virtual address)

Page Table

LSB

(17)

Page Table Entry

Physical/Virtual address translation But also:

• process protection (illegal read/writes among processes)

• information for page replacement

• page availability (in memory/not in memory)

• information for I/Os Page table entry size ≈32 bits

r w x Physical page address In memory

Protection Modified

Used I/O

Segmentation

Different and complementary approach

A « user » view

Segments correspond to user-known categories:

• stack

• heap

• program

• …

Segments grow dynamically, can be limited

Segment pointer/offset

Can be combined with virtual

memory (x86)

Espace d’adressage virtuel de l’Alpha

Segment stack

Segment heap

Références

Documents relatifs

All firmware events will be interrupt driven. When the Z-80 is executing an Interrupt Service Routine, interrupts will be disabled to prevent another interrupt

This routine is the Interrupt Service Routine performed by the card in response to a Receive character interrupt from the UART. This routine services one

The following code releases IRQ number 9 from any as- signed internal logical device by going through all logical de- vices, one by one, and writing the value of 0h to bits 3 - 0 of

The received-packet callback procedures call the IP input processing routine directly, rather than placing received packets on a queue for later processing; this means that any

8.6.1.1 Single Vector Mode Prologue When entering the interrupt handler routine, the inter- rupt controller must first save the current priority and exception PC counter from

DISPLAY STATUS UPON ·INTERRUPT IF EOP INTER~UPT OCCURRED. STATUS UPON EOP INTERRUPT. CLEAR EOP INTERRUPT FLAG. ERROR C, UNEXPECTED EOP INT.. DISPLAY CURRENT STATUS

When Interrupt Control D is interacting with Floating Tape Loader-Monitor C and control- ling the execution of a foreground and background program, location 267

Interrupt Enable Flag INTO Interrupt Flag INTI Interrupt Flag Timer A Interrupt Flag Timer B Interrupt Flag Serial Interface Interrupt INTO Interrupt Mask INTI