How to teach a billion transistor chip a new trick
Dr. ir. Bart Kienhuis
University Leiden, LIACS Computer Systems Group http://www.liacs.nl
ICT-Kenniscongres 10 April 2006
Outline
Focus is programming ICs for Stream based Applications
FPGAs
Stream Based Applications
Multi-media
Imaging
Bio-informatics
Classical DSP
Increasing demand for high- performance, embedded compute performance
We know how to build billion transistor FPGAs, but cannot program efficiently
FPGA
Billion Transistors
Applications (C / Matlab / Java)
programming
?
3
Stream Based Applications
Imaging
HDTV 1080x720 @ 100 Hz
77 Million samples per seconds
Suppose 100 operations per sample
8 Billion Operations per second Classical DSP
LOFAR One antenna
Sampling rate 200*2 = 400 Million Samples per second
100 Operations per second (First stage)
40 Billion Operations per second
There are 10.000 Antennas expected = 40*10^13 operations per second
Next beamforming is performed, statistics obtained, but on decimated signal
Operation 10 – 10.00
Operations per Sample
10 – 1000 Million Samples per second Stream
Sample
Real-Time
Compute Requirements
Source: TI, Xilinx – 1 MAC = 8 bit Multiply-Accumulate Trend: ferocious appetite for more embedded compute power
0 500 1000 1500 2000 2500
2000 2001 2002 2003 2004 2005 2006
Billion MAC/s
HDTV MPEG4
Video over IP
3G Wireless/
WCDMA
Future
Broadband Standards
Voice
over IP General
Purpose DSP/CPU Market
Requirements
Increasing Gap 2.5G
?
5
Intrinsic Computational Efficiency (ICE)
E. Roza, System-on-chip: what are the limits?, IEE Electronics
Communication Engineering Journal, vol13, No6, Dec 2001, pp 249-255.
Computational efficiency [MOPS/W]
i386SX
i486DX P5 68040 microsparc
Super sparc 601604
Ultra
sparc P6
604e 21164a Turbosparc 604e
21364
7400
106
105 104 103 102 101 100
2 1 0.5 0.25 0.13 0.07
Feature size [µm]
Microprocessors Intrinsic Computational
Efficiency of Silicon
Playing field
Xilinx Virtex II Pro
Heterogeneous Architecture
Multi CPUs
Distributed Memory Blocks
Programmable Logic (IP cores)
Commercially Available Platform
Virtex-4 is released and available
Virtex-5 is on the drawing board
FPGAs are becoming more important in Embedded
System Design
Flexible since they can easily be re-programmed
More functionality at lower cost
replacing a larger part of the ASICs market
Reconfigurable logic
Heterogeneous Architecture
PowerPCs
8 4 16 words x 1 bit memory
[A,B,C,D]
F
A 4-input lookup table
(LUT) can implement any function of 4 inputs.
For example, a 1-bit adder needs 2 LUTs:
A⊕B⊕Ci A.B.Ci
A B
Ci
Co
S
Building an FPGA: Logic First
0 1011
0 1010
1 0010
1 0001
F Address
(ABCD)
4
FF
CE RST
16 words x 1 bit memory M
Carry M
M
M
Din WE Cin
Cout
Make LUT RAM a user resource.
Fast carry ripple to neighbor.
Arithmetic, Distributed RAM
10 4
4
4 4
4 4
4 4
40
Group logic cells to reduce overhead.
Add H, V routing channels with
switchboxes.
Add input, output MUXing between logic and routing.
Add Interconnect
4 4
4 4
4 4
4 4
40
4 4
4 4
4 4
4 4
40
4 4
4 4
4 4
4 4
40
4 4
4 4
4 4
4 4
40
Build an Array
12
FPGA Technology
Virtex-4 has already over a billion transistors!
FPGA can take most
advantage of Moore’s law
More an more logic
available on the same die
CPUs in programmable logic
Hardcore =
Softcore =
13
1/91 1/92 1/93 1/94 1/95 1/96 1/97 1/98 1/99 1/00 1/01 1/02 1/03 1/04 1
10 100 1000
Capacity Speed
Price Virtex-II Pro
Spartan-3 Virtex-II Pro Memory
Bandwidth
Year
FPGA, the last 10 years
14
FPGA Programming
FPGA is flexible in Hardware and Software…
How to teach a billion transistor chip a new trick
Solution Industry doesn’t work
Programming means:
Write “C” programs for the Micro processors
Write “VHDL” programs for the IP cores and interconnection network
“C” programs and “VHDL”
programs need to work together
High performance
Error proof
Nightmare to debug
Programming is currently done manually with lots of engineers
“C”
“C”
“C”
“C”
VHDL
Virtex-II Pro FPGA
Productivity Lines of Code/Man Month
10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000
1981 1985 1989 1993 1997 2001 2005 2009
Logic transistors per chip (K)
10 100 1,000 10,000 100,000 1,000,000 10,000,000
Logic Tr./Chip
58% / Yr. compound complexity growth rate
Loc/MM
8-10% / Yr. compound productivity growth rate
Productivity gap
Software Productivity
We know how to build Billion transistor chips, but do not know how to program them efficiently
16
Research at LIACS (COMPAAN)
Kahn Process Network
Matlab/C/Java
Compaan
FPGA
Laura /Espam
for j = 1:1:N, [x(j)] = S1( );
end
for i = 1:1:K, [y(i)] = S2( );
end
for j = 1:1:N, for i = 1:1:K,
[y(i), x(j)] = func(y(i), x(j) );
end end
for i = 1:1:K,
[Out(i)] = Sink( y( I ) );
end
F1 F2
S2 F3 F4
Sink
S1
How ICT Research at LIACS tackles the Productivity Gap (Leiden University, STW PROGRESS)
Parameterized Nested Loop Programs
Toolbox
Toolbox to change the number of processes
Unrolling (More Parallelism)
Merging (Less Parallelism)
Better use of Resources
Re-timing
C-Slowing
Exploration
Track Technology
changes
ParallelismLess PartitioningMore Parallelism
P1 P2
P3
Application
18
Results: Motion JPEG (1)
for k = 1:1:NumFrames, for j = 1:1:VNumBlocks,
for i = 1:1:HNumBlocks, [ Block ] = VideoInMain();
[ Block ] = DCT( Block );
[ Block ] = Q( Block );
[ Packets ] = VLE( Block );
[ ] = VideoOut( Packets );
end end end
Generated Process Network
DATE04, ‘System Design using Kahn Process Networks: The Compaan/Laura Approach’
Block( j , i )
19
Motion JPEG (2)
From Matlab to FPGA implementation
Automatically derived a multi processor solution
Free choice between Microprocessor (MicroBlaze) or Hardware (IPCore)
Automatically generated
“C” program for all Microprocessors
VHDL program for interconnections and IP Core integration
Correct by Construction ND_2
DCT 400 ND_1
VideoIn
ND_2 DCT
ND_3 Q
ND_4 VLE
ND_5 VideoOut
10837 135036 68972 54210 2795
FIFO
Difference: 337
- Manual: 3 months - Compaan: 3 weeks
20
Results: QR Case study
for k = 1:1:21, for j = 1:1:7,
[r(j,j), rr(j,j), a, b, d(k)] =
bcell( r(j,j), rr(j,j),x(k,j), d(k));
for i = j+1:1:7,
[r(j,i), x(k,i)] = icell( r(j,i), x(k,i), a, b );
end end end
0 200 400 600 800 1000 1200 1400 1600 1800 2000
QR QR
Skew
QR Skew
+2st
QR Skew
+4st
QR Skew
+6st
QR Skew
+7st
QR Skew +10st
MFLOPS x28
QR decomposition Beam former/Radar
FPL04, ‘Increasing pipelined IP core utilization in Process Networks using Exploration’
- Manual: 1 year - Compaan: 3 days
Exploration
- 60MFlops out-of-the-box
- 2 GFlops, with Compaan/Laura
Multidisciplinary Work
Mathematics
Electrical Engineering
Mathematical Modeling (Polyhedral Mathematics) Efficient Solvers (Parametric Integer Programming)
Processor Design FPGA Technology
Electronics Design Automation
Models of Computations Compiler Technology Software Engineering
Compaan Project: 6 years, 6 PhDs, 2 Postdocs
Computer Science
22
Conclusions
Stream Based Applications require Heterogeneous Multiprocessor architecture
Stream of 10 – 1000 million samples per second
10 – 100 Giga operations per second
Latest FPGAs are Heterogeneous Multiprocessor architectures
Over a Billion Transistor and still counting
Problem is that we know how to build FPGAs, but programming them is very hard
Programming/Compiler Technology is a strategic ICT component to reduce Productivity Gap
Showed the Compaan Project at LIACS, which provides a solution for stream based applications.
Thanks to
Prof. Ed Deprettere
Dr. Todor Stefanov
Dr. Sven Verdoolaege
Alexandru Turjan
Claudiu Zissulescu
Hristo Nikolov
Sjoerd Meijer
Jerome Lemaitre
Bin Jiang
Wei Zhong
For more information http://www.liacs.nl/~kienhuis