How to teach a billion transistor chip a new trick

(1)

How to teach a billion transistor chip a new trick

Dr. ir. Bart Kienhuis

University Leiden, LIACS Computer Systems Group http://www.liacs.nl

ICT-Kenniscongres 10 April 2006

(2)

Outline

Focus is programming ICs for Stream based Applications

FPGAs

Stream Based Applications

Multi-media

Imaging

Bio-informatics

Classical DSP

Increasing demand for high- performance, embedded compute performance

We know how to build billion transistor FPGAs, but cannot program efficiently

FPGA

Billion Transistors

Applications (C / Matlab / Java)

programming

?

(3)

3

Stream Based Applications

Imaging

HDTV 1080x720 @ 100 Hz

77 Million samples per seconds

Suppose 100 operations per sample

8 Billion Operations per second Classical DSP

LOFAR One antenna

Sampling rate 200*2 = 400 Million Samples per second

100 Operations per second (First stage)

40 Billion Operations per second

There are 10.000 Antennas expected = 40*10^13 operations per second

Next beamforming is performed, statistics obtained, but on decimated signal

Operation 10 – 10.00

Operations per Sample

10 – 1000 Million Samples per second Stream

Sample

Real-Time

(4)

Compute Requirements

Source: TI, Xilinx – 1 MAC = 8 bit Multiply-Accumulate Trend: ferocious appetite for more embedded compute power

0 500 1000 1500 2000 2500

2000 2001 2002 2003 2004 2005 2006

Billion MAC/s

HDTV MPEG4

Video over IP

3G Wireless/

WCDMA

Future

Broadband Standards

Voice

over IP ^General

Purpose DSP/CPU Market

Requirements

Increasing Gap 2.5G

?

(5)

5

Intrinsic Computational Efficiency (ICE)

E. Roza, System-on-chip: what are the limits?, IEE Electronics

Communication Engineering Journal, vol13, No6, Dec 2001, pp 249-255.

Computational efficiency [MOPS/W]

i386SX

i486DX P5 68040 microsparc

Super sparc 601604

Ultra

sparc P6

604e 21164a Turbosparc 604e

21364

7400

10⁶

10⁵ 10⁴ 10³ 10² 10¹ 10⁰

2 1 0.5 0.25 0.13 0.07

Feature size [µm]

Microprocessors Intrinsic Computational

Efficiency of Silicon

Playing field

(6)

Xilinx Virtex II Pro

Heterogeneous Architecture

Multi CPUs

Distributed Memory Blocks

Programmable Logic (IP cores)

Commercially Available Platform

Virtex-4 is released and available

Virtex-5 is on the drawing board

FPGAs are becoming more important in Embedded

System Design

Flexible since they can easily be re-programmed

More functionality at lower cost

replacing a larger part of the ASICs market

Reconfigurable logic

Heterogeneous Architecture

PowerPCs

(7)

8 4 16 words x 1 bit memory

[A,B,C,D]

F

A 4-input lookup table

(LUT) can implement any function of 4 inputs.

For example, a 1-bit adder needs 2 LUTs:

A⊕B⊕C_i A.B.C_i

A B

Ci

Co

S

Building an FPGA: Logic First

0 1011

0 1010

1 0010

1 0001

F Address

(ABCD)

(8)

4

FF

CE RST

16 words x 1 bit memory M

Carry M

M

Din WE Cin

Cout

Make LUT RAM a user resource.

Fast carry ripple to neighbor.

Arithmetic, Distributed RAM

(9)

10 4

4

4 4

40

Group logic cells to reduce overhead.

Add H, V routing channels with

switchboxes.

Add input, output MUXing between logic and routing.

Add Interconnect

(10)

4 4

40

4 4

40

4 4

40

4 4

40

Build an Array

(11)

12

FPGA Technology

Virtex-4 has already over a billion transistors!

FPGA can take most

advantage of Moore’s law

More an more logic

available on the same die

CPUs in programmable logic

Hardcore =

Softcore =

(12)

13

1/91 1/92 1/93 1/94 1/95 1/96 1/97 1/98 1/99 1/00 1/01 1/02 1/03 1/04 1

10 100 1000

Capacity Speed

Price Virtex-II Pro

Spartan-3 Virtex-II Pro Memory

Bandwidth

Year

FPGA, the last 10 years

(13)

14

FPGA Programming

FPGA is flexible in Hardware and Software…

How to teach a billion transistor chip a new trick

Solution Industry doesn’t work

Programming means:

Write “C” programs for the Micro processors

Write “VHDL” programs for the IP cores and interconnection network

“C” programs and “VHDL”

programs need to work together

High performance

Error proof

Nightmare to debug

Programming is currently done manually with lots of engineers

“C”

VHDL

Virtex-II Pro FPGA

(14)

Productivity Lines of Code/Man Month

10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000

1981 1985 1989 1993 1997 2001 2005 2009

Logic transistors per chip (K)

10 100 1,000 10,000 100,000 1,000,000 10,000,000

Logic Tr./Chip

58% / Yr. compound complexity growth rate

Loc/MM

8-10% / Yr. compound productivity growth rate

Productivity gap

Software Productivity

We know how to build Billion transistor chips, but do not know how to program them efficiently

(15)

16

Research at LIACS (COMPAAN)

Kahn Process Network

Matlab/C/Java

Compaan

FPGA

_{Laura /}

Espam

for j = 1:1:N, [x(j)] = S1( );

end

for i = 1:1:K, [y(i)] = S2( );

end

for j = 1:1:N, for i = 1:1:K,

[y(i), x(j)] = func(y(i), x(j) );

end end

for i = 1:1:K,

[Out(i)] = Sink( y( I ) );

end

F1 F2

S2 ^F3 ^F4

Sink

S1

How ICT Research at LIACS tackles the Productivity Gap (Leiden University, STW PROGRESS)

Parameterized Nested Loop Programs

(16)

Toolbox

Toolbox to change the number of processes

Unrolling (More Parallelism)

Merging (Less Parallelism)

Better use of Resources

Re-timing

C-Slowing

Exploration

Track Technology

changes

Parallelism^Less Partitioning

More Parallelism

P1 P2

P3

Application

(17)

18

Results: Motion JPEG (1)

for k = 1:1:NumFrames, for j = 1:1:VNumBlocks,

for i = 1:1:HNumBlocks, [ Block ] = VideoInMain();

[ Block ] = DCT( Block );

[ Block ] = Q( Block );

[ Packets ] = VLE( Block );

[ ] = VideoOut( Packets );

end end end

Generated Process Network

DATE04, ‘System Design using Kahn Process Networks: The Compaan/Laura Approach’

Block( j , i )

(18)

19

Motion JPEG (2)

From Matlab to FPGA implementation

Automatically derived a multi processor solution

Free choice between Microprocessor (MicroBlaze) or Hardware (IPCore)

Automatically generated

“C” program for all Microprocessors

VHDL program for interconnections and IP Core integration

Correct by Construction ND_2

DCT 400 ND_1

VideoIn

ND_2 DCT

ND_3 Q

ND_4 VLE

ND_5 VideoOut

10837 135036 68972 54210 2795

FIFO

Difference: 337

- Manual: 3 months - Compaan: 3 weeks

(19)

20

Results: QR Case study

for k = 1:1:21, for j = 1:1:7,

[r(j,j), rr(j,j), a, b, d(k)] =

bcell( r(j,j), rr(j,j),x(k,j), d(k));

for i = j+1:1:7,

[r(j,i), x(k,i)] = icell( r(j,i), x(k,i), a, b );

end end end

0 200 400 600 800 1000 1200 1400 1600 1800 2000

QR QR

Skew

QR Skew

+2st

QR Skew

+4st

QR Skew

+6st

QR Skew

+7st

QR Skew +10st

MFLOPS x28

QR decomposition Beam former/Radar

FPL04, ‘Increasing pipelined IP core utilization in Process Networks using Exploration’

- Manual: 1 year - Compaan: 3 days

Exploration

- 60MFlops out-of-the-box

- 2 GFlops, with Compaan/Laura

(20)

Multidisciplinary Work

Mathematics

Electrical Engineering

Mathematical Modeling (Polyhedral Mathematics) Efficient Solvers (Parametric Integer Programming)

Processor Design FPGA Technology

Electronics Design Automation

Models of Computations Compiler Technology Software Engineering

Compaan Project: 6 years, 6 PhDs, 2 Postdocs

Computer Science

(21)

22

Conclusions

Stream Based Applications require Heterogeneous Multiprocessor architecture

Stream of 10 – 1000 million samples per second

10 – 100 Giga operations per second

Latest FPGAs are Heterogeneous Multiprocessor architectures

Over a Billion Transistor and still counting

Problem is that we know how to build FPGAs, but programming them is very hard

Programming/Compiler Technology is a strategic ICT component to reduce Productivity Gap

Showed the Compaan Project at LIACS, which provides a solution for stream based applications.

(22)

Thanks to

Prof. Ed Deprettere

Dr. Todor Stefanov

Dr. Sven Verdoolaege

Alexandru Turjan

Claudiu Zissulescu

Hristo Nikolov

Sjoerd Meijer

Jerome Lemaitre

Bin Jiang

Wei Zhong

For more information http://www.liacs.nl/~kienhuis