System Design Using Kahn System Design Using Kahn
Process Networks:
Process Networks:
The Compaan/Laura Approach The Compaan/Laura Approach
Bart Kienhuis
Assistant Professor
LIACS, Leiden University
The Netherlands
DSP Performance Requirements DSP Performance Requirements
0 500 1000 1500 2000 2500
2000 2001 2002 2003 2004 2005 2006
Billion MAC/s
HDTV MPEG4
Video over IP
3G Wireless/
WCDMA
Future Broadband Standards
Voice over IP
General Purpose DSP/CPU Market
Requirements
Increasing
Gap 2.5G
Applications have a ferocious appetite for more programmable compute power
Source: TI, Xilinx – 1 MAC = 8 bit Multiply-Accumulate
Embedded DSP Architectures Embedded DSP Architectures
Programmable Interconnect (NoC)
Programmable Interconnect (NoC)
IPcore IPcore RPU RPU Memory Memory
CPU CPU Micro Processor Micro Processor
Memory Memory
...
CPU: A simple Microprocessor
RPU: Reconfigurable Processing Unit IPcore; Dedicated Accelerator block NoC: Network on a Chip
Weakly coupled
Processing elements
Programming Problem Programming Problem
for j = 1:1:N,
[x(j)] = Source1( );
end
for i = 1:1:K,
[y(i)] = Source2( );
end
for j = 1:1:N, for i = 1:1:K,
[y(i), x(j)] = F( y(i), x(j) );
endend
for i = 1:1:K,
[Out(i)] = Sink( y( I ) );
end
Sequential
Application Specification
EASY to specify
DIFFICULT to map
Programmable Interconnect (NoC)
Programmable Interconnect (NoC)
IPcore IPcore RPU RPU Memory Memory
CPU CPU Micro Proce ssor Micro Proce ssor
Memory Memory
...
Programming
P1 P2
S1
Source
P3 P4
Sink
Parallel
Application Specification
EASY to map DIFFICULT to specify
Compaan
L
Application
au ra
Outline Outline
z The programming problem
z Kahn Process Networks
z System Design: Compaan/Laura Approach
z Case-study M-JPEG
z Conclusions
Embedded DSP Architectures Embedded DSP Architectures
• Distributed Control
• Distributed Memory To satisfy the computational requirements,
these architectures have to exploit:
Task-level Parallelism
Instruction Parallelism
Programmable Interconnect (NoC)
Programmable Interconnect (NoC)
RPU RPU Memory Memory
CPU CPU Micro Processor Micro Processor
...
Memory Memory
IPcore IPcore
QR Algorithm (smart antennas) QR Algorithm (smart antennas)
%parameter N 8 16;
%parameter K 100 1000;
for k = 1:1:K, for j = 1:1:N,
[ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );
for i = j+1:1:N,
[ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );
end end end
Matlab Code (QR Algorithm)
Sequentially Ordered
Matrices are located in Big Global Memory
QR simple program: but keeps your CPU very busy
Solution Solution
z Change the model of computation in such a way that it better fits the model of
architecture.
z Make sure the data-type is of precisely the format that fits the architecture (e.g. Streams)
z What model of Computation would fit this description, when looking at Digital Signal Processing (DSP) applications, Imaging and Multi Media?
Kahn Process Networks
Kahn Process Network (KPN) Kahn Process Network (KPN)
z
Kahn Process Networks [Kahn 1974][Parks&Lee 95]
– Processes run autonomously – Communicate via
unbounded FIFOs
– Synchronize via blocking read
z
Process is either
– executing (execute) – communicating
(send/get)
z
Deterministic
z
Distributed Control
z
Distributed Memory
Fc A
Fa Fb
get execute send C get
execute send
send
get get
execute send
Fifo
C
B
Kahn Process Network (KPN) Kahn Process Network (KPN)
Fifo
Process A
Process C
Process B Fifo Fifo
Fifo FPGA B Fifo
CPU 1 FPGA A
•Autonomously operating Processes; no global schedule needed
•Blocking Read simple realize in Hardware
•Buffer Sizes of the FIFOs are quite often very small
The Compaan Tool Chain The Compaan Tool Chain
%parameter N 8 16;
%parameter K 100 1000;
for k = 1:1:K, for j = 1:1:N,
[r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );
for i = j+1:1:N,
[r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );
end end end
Matlab Program
SAC MatParser
Matlab Application
outputR Rotate
Vectorize initialR
inputSamples
DgParser
PRDG
Source
P1 P2
S1
P3 P1
Process Network
Sink
Panda
Kahn Process Network
Polyhedral Reduce Dependence Graph (PRDG)
Data Dependency Analysis
Linearization
Data Dependency Analysis Data Dependency Analysis
j
1 2 3 4 5 N=6 1
2 4 3 5 N=6 for i= 1 : 1 : N,
for j= 1 : 1 : N,
[ a(i+j) ] = funcA( a(i+j) );
end end
i
i+j=6
a(i+1,j-1)
Ax >= b (polytope)
The for-next loops define an Iteration Domain
Polyhedral Reduced Dependence Graph Polyhedral Reduced Dependence Graph
%parameter N 8 16;
%parameter K 100 1000;
for k = 1:1:K, for j = 1:1:N,
[r(j,j), x(k,j), t ]=Vectorize( r(j,j),x(k,j) );
for i = j+1:1:N,
[ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );
end end end
Matlab Code
(QR Algorithm)Dependence Graph
k i
j Vec
Rot
Polyhedral Reduced Dependence Graph Polyhedral Reduced Dependence Graph
C A
B D
E
Polytope “C”
Polytope “D”
Linearization Linearization
z Linearization is the process of mapping high-order data-
structures (e.g., Matrices) on a 1-D stream
z We replaced the
indexing of the variable x(j,I) and x(n-1,m) by
relative put and get operation on a FIFO buffer (unboxing)
z Is this always possible?
for j = 1 :1 5, for i = j : 1 : 5,
[ x(j,I)]=F1();
end end
for j = 1 :1 5, for i = j : 1 : 5,
F2(x(n-1,m));
end end
Global Memory
for j = 1 :1 5, for i = j : 1 : 5,
[ out]=F1();
FIFO.put(out);
end end
for j = 1 :1 5, for i = j : 1 : 5,
in = FIFO.get();
F2(in);
end end
FIFO
Linearization
Producer Consumer
KPN Hand Off KPN Hand Off
Kahn Process Network
(b) (a)
(b)
(a) – rotate (b) – vectorize
Synthesizable VHDL
FPGA
Laura
Functional Simulation in Ptolemy II Ptolemy Actor
models in Java C++/YAPI
The Laura Tool The Laura Tool
KPN
Network of Virtual Processors
KPNtoArchicture
Mapping
Library of IP cores
Network of
Synthesizable Processors
Verilog VHDL SystemC
Platform dependent
Platform Independent
The Laura Tool The Laura Tool
IP2 OP1
IP1 OP2
Ch2
P2 Kahn Process Network
P3 P1
FIFO1 IP2 FIFO3
IP1
OP1
VP2
OP2 FIFO2Ch1 Ch3
KPNToArchitecture
Abstract Architecture
VP1 VP3
The Laura Tool The Laura Tool
DATA FLOW
Execution Unit IP Core
Read Unit
Controller
Write Unit
MUX DeM U X
FIFO
FIFO
FIFO
FIFO
Control Tables Control Tables
Structure of an individual processor
System Design Flow System Design Flow
z The Tools in action
– M-JPEG Example based on the original C-code of the Portable Video Research Group,
Stanford University.
– Simple Target Platform
• Common PC platform
• With FPGA board
Motion JPEG encoder Motion JPEG encoder
Sequence of
T frames
JPEG encoding
M-JPEG encoded video stream Video stream
(4:2:2 YUV format)
observed bitrate
dim V
dim H
Target Platform Target Platform
Memory Banks
Multiplexer
Address Control
Host Interface
Control Select
Virtex-II 2V6000 FPGA
ADM-XRCII board
Status
Control HW Design
Pentium IV Microprocessor
PCI bus
Control Address Data Data
Data
Data Address
Data
System Design Flow System Design Flow
KPN Application
In Matlab
HW/SW partitioning (Workload Analysis) Compaan Compiler
Compaan/Laura HW Compiler Hardware Processes
(Matlab)
Hardware Description VHDL
Memory Banks
Multiplexer
Address Control
Host Interface
Control Select
Virtex-II 2V6000 FPGA
ADM-XRCII board
Status
Control HW Design Pentium IV
Microprocessor PCI bus
Control Address Data Data Data
Data Address
SW Programming
DataHW Programming
Software Path Hardware Path
Software Processes (YAPI)
GCC/V++
SW Compiler
Object Code
M M - - JPEG Specification in Matlab JPEG Specification in Matlab
[ QTables, HuffTables, TablesInfo, EndOfFrame ] = P2_l_DefaultTables( );
fork = 1:1:NumFrames,
[ HeaderInfo ] = P1_l_VideoInInit( );
forj = 1:1:VNumBlocks, for i = 1:1:HNumBlocks,
[ Block( j ,i ) ] = P1_l_VideoInMain( );
end end
forj = 1:1:VNumBlocks, for i = 1:1:HNumBlocks,
[ Block( j , i ) ] = DCT( Block( j , i ) );
end end
forj = 1:1:VNumBlocks, for i = 1:1:HNumBlocks,
[ Block( j , i ) ] = Q( Block( j , i ), QTables );
[ Packets, StatisticsB ] = VLE( Block( j , i ), EndOfFrame, HuffTables );
[ BitRate, StatisticsF, EndOfFrame ] = CtrlF1( StatisticsB );
[ ] = VideoOut( HeaderInfo, TablesInfo, Packets );
end end
[ QTable, HuffTables, TablesInfo ] = P2_l_CtrlF2( BitRate, StatisticsF, QTables, HuffTables, TablesInfo );
end
Parameterized
%parameterNumFrames 1 1000;
%parameterVNumBlocks16 256;
%parameterHNumBlocks8 256;
Block( j , i )
Deriving a KPN Deriving a KPN
Application In Matlab
Compaan Compiler Functional
Verification
Ptolemy II PN Domain Workload
Analysis to do the
HW/SW
Partitioning YAPI/C++
Deriving a KPN
Deriving a KPN
The KPN of M
The KPN of M - - JPEG JPEG
VOut
CtrlF1
DCT Q P1
P2
Block Block Block VLE Packets
BitRate
QTabl e s
Hu ffT ab le s
Statist icsB
StatisticsF
EndOfFrame
TablesInfo struct Block {
int Y1[64]; /* block 8x8 pixels */
int Y2[64]; /* block 8x8 pixels */
int U[64]; /* block 8x8 pixels */
int V[64]; /* block 8x8 pixels */
};
HeaderInfo
Interface Code DCT Interface Code DCT
z The DCT process is selected to move to Hardware.
z Interface code is needed to run with the Software Processes
z Observe that ‘Blocks’
are being moved to the FPGA and from the
FPGA
void DCT::main() { // NumFrames = 100;
// VNumBlocks = 16;
// HNumBlocks = 8;
for ( int k=1; k <= NumFrames; k++ ) { for ( int j=1; j <= VNumBlocks; j++ ) {
for ( int i=1; i <= HNumBlocks; i++ ) { read( inPort, inBlock );
outBlock = DCT( inBlock );
write( outPort, outBlock );
}
}
}
}
Matlab of the DCT Process Matlab of the DCT Process
for k = 1:1:4, forj = 1:1:64,
[ Pixel( k , j ) ] = Source( inBlock );
end end
for k = 1:1:4, if k <= 2,
for j = 1:1:64,
[ Pixel( k , j ) ] =PreShift( Pixel( k , j ) );
end end
for j = 1:1:64,
[ Block ] = P_l_PixelsToBlock( Pixel( k , j ) );
end
[ Block ] = P_l_2D_dct( Block );
for j = 1:1:64,
[ Pixel( k , j ) ] = P_l_BlockToPixels( Block );
end end
for k = 1:1:4, for j = 1:1:64,
[ outBlock ] = Sink( Pixel( k , j ) );
end end
z Of the DCT process, we make a new Matlab
program
– This exposes more
parallelism at a finer lever.
– Automatic conversion from Blocks to Stream (Linearization)
z Compaan produces by default a process per function call.
– However, using the
Preamble ‘P_1’ we can
group processes.
KPN Sub network DCT KPN Sub network DCT
P
PreShift
Source Pixel Sink
Pixel
Pixel Pixel
inBlock outBlock
DCT
VOut
CtrlF1
DCT Q P1
P2
Block Block Block VLE Packets
HeaderInfo
BitRate
QTabl e s
Hu ffT ab le s
Statist icsB
StatisticsF
EndOfFrame
TablesInfo
Hierarchical Subnet of DCT
Programming the CPU Programming the CPU
M-JPEG specified In YAPI
C++ Compiler YAPI Executable
Pentium IV Pentium IV
YAPI Multithreading
Environment
Laura DCT Hardware Model Laura DCT Hardware Model
P
PreShift
Source
Pixel
SinkPixel
Pixel Pixel
inBlock outBlock
DCT
IP2 IP1
VP3
OP1FIFO1 FIFO2
FIFO3
FIFO4
VP2
VP1
(Source)
VP4
(Sink) (PreShift)
(P)
Sink/Source do the type Conversion
One-to-One Mapping
Abstract Hardware Model: Network of Virtual Processors
Laura DCT Hardware Model Laura DCT Hardware Model
z To get the functionality of the Virtual Processor, we integrated an IPcore.
z We have taken the Core (2D-DCT) from the Xilinx Webside.
z Make Processor specific for a platform
– Determine the Bit width – Determine the FIFO sizes – Take into account the
Clock
– Determine the Control tables for the switches
IP2 IP1
VP3
OP1FIFO1
FIFO3
FIFO4 (xilinx) (P)
0 1 MUX IP1
IP2
in_0 out_0
C
OP1
Control Table
Synch. Logic Synch. Logic
2D-DCT (Xilinx)
Control Unit Execute Unit
IP Core implementing
Read Unit Write Unit
Hw/ Hw/ Sw Sw Solution for M Solution for M - - JPEG JPEG
Memory Banks
Multiplexer
Address Control
Host Interface
Control Select
Virtex-II 2V6000 FPGA
ADM-XRCII board
Status
Control
PCI bus
Control Address Data Data
Data
Data Address
YAPI Multithreading
DataEnvironment Pentium IV Pentium IV
The way it is programmed; the CPU and FPGA run in parallel
Processing Time M
Processing Time M - - JPEG JPEG
Compaan Laura Other tools Manually Total M-JPEG -> KPN 00:00:22 -- -- 00:30:00 00:30:22 Software
Compilation -- -- 00:00:35 -- 00:00:35 DCT Subnet
Compilation 00:00:08 -- -- -- 00:00:08
Laura -- 00:00:07 -- 03:00:00 03:00:07
Synthesis to FPGA -- -- 00:13:10 -- 00:13:10
Overall 00:00:30 00:00:07 00:13:45 03:30:00 03:44:22
Device Utilization for the DCT Device Utilization for the DCT
FPGA resource Utilization % Number of MULT18x18s 8 out of 144 5%
Number of RAMB16s 4 out of 144 2%
Number of SLICEs 2367 out of 33792 7%
Number of BUFGMUXs 2 out of 16 12%
Virtex-II 2V6000 FPGA
(taking 4% of the FPGA)
Real Real - - time performance M time performance M - - JPEG JPEG
z Throughput of the system
– 10.5 CIF frame (128x128) per second
• Running Windows 2000
• Simple Compiler
• Simple Multithreading architecture z Required is 25 frames per second
– Communication FPGA/CPU is too slow (PCI) z However,
– 64 bit PCI
– Running at 66Mhz
– 4 times increase in performance
z Then 25 frames per second (128x128) not a problem
Exploration Exploration
P1 P2
S1
SinkS2
KPN_4
Generate
for j = 1:1:N,
[x(j)] = Source1( );
end
for i = 1:1:K,
[y(i)] = Source2( );
end
for j = 1:1:N, for i = 1:1:K,
[y(i), x(j)] =
F
( y(i), x(j) );end end
for i = 1:1:K,
[Out(i)] = Sink( y(i) );
end
P1 P2
P3 P4
S1
SinkS2
KPN_1
P1 P2
S1
SinkS2
KPN_3
P4
S1
SinkS2
P2 P3 P1
KPN_2
P
S1
Sink
S2
KPN_5
Map Explore
Programmable Interconnect (NoC)
Programmable Interconnect (NoC)
IPcoreIPcore RPURPU MemoryMemory
CPUCPU Micro Processor Micro Processor
Memory Memory
...
Alternative Application Instances
Unrolling/Unfolding Unrolling/Unfolding
%parameter N 100 1000;
%parameter K 8 48;
for j = 1:1:N, for i = 1:1:K,
[y(i), x(j)] =
F
(y(i), x(j));endend
F F F F
F F
F F
F F
F F
x(1) x(2) x(3) x(4)
y(1)
y(2)
y(3)
j
i F F F F
F F
F F
F F
F F
x(1) x(2) x(3) x(4)
y(1)
y(2)
y(3)
j i
Compaan
U = [ N, K ]
Difficult to derive
for j = 1:1:N,
if mod( j ,
if mod( j , 2 2 ) = 1, ) = 1,
for i = 1:1:K,
[y(i), x(j)] = F (y(i), x(j));
end
end end
if mod( j ,
if mod( j , 2 2 ) = 0, ) = 0,
for i = 1:1:K,
[y(i), x(j)] = F (y(i), x(j));
end
end end
end
MatTransform
Retiming/Skewing Retiming/Skewing
Skewing matrix
= =
= =
0 1 0 1 1 1
2222 2121
1212 1111
m m m m
m m m m
M M
F F F F
F F
F F
F F
F F
x(1) x(2) x(3) x(4)
y(1)
y(2)
y(3)
j i
for j = 2:1:N+K,
if mod( j ,
if mod( j , 2 2 ) = 1, ) = 1,
for i = max(1, j-N):1:min(j-1, K), [y(i), x(j-i)] = F (y(i), x(j-i));
end
end end
if mod( j ,
if mod( j , 2 2 ) = 0, ) = 0,
for i = max(1, j-N):1:min(j-1, K), [y(i), x(j-i)] = F (y(i), x(j-i));
end
end end F F F F end
F
F F
F F
F F
F
x(1) x(2) x(3) x(4)
y(1)
y(2)
y(3)
j
i F F F F
F
F F
F F
F F
F
x(1) x(2) x(3) x(4)
y(1)
y(2)
y(3)
j i
for j = 2:1:N+K,
for i = max(1, j-N):1:min(j-1, K), [y(i), x(j-i)] = F (y(i), x(j-i));
end end
Unfolding vector U = [ u
1, u
2] = [2, 1]
Compaan
Difficult to derive
%parameter N 100 1000;
%parameter K 8 48;
for j = 1:1:N, for i = 1:1:K,
[y(i), x(j)] =
F
(y(i), x(j));endend
Design Space Exploration Design Space Exploration
Retiming/Unrolling
Compaan Compiler
Initial Values of Parameters
Matlab Application
Matlab Code
New Values of Parameters Mapping
Laura/XFT Performance
Analysis Performance
Numbers Xilinx
Virtex-II Process
Network
Conclusions Conclusions
z To satisfy tomorrow’s applications, we will see
hierarchical multiprocessor systems with a number of CPUs, Memories, IPcores, and RPU.
z Programming these system will be difficult unless the MoC is changed to take Concurrency into account.
The key items will be
– Distributed Memory – Distributed Control
z Kahn Process Networks seem to be a very promising programming formalism for tomorrow’s HW/SW
codesign platforms
Conclusions Conclusions
z We showed proof-of-concept with a case in which we Convert M-JPEG into a KPN of which the processes are mapped either on hardware or software.
z In the M-JPEG case, the hardware and software were running concurrently, exploiting task-level parallelism
z Having good tools, we can start from (legacy) code in Matlab, C, or other imperative languages.
z The results are just the beginning. There is more to
achieve when more mature / commercial products are
used (RTOS, Compiler, Target Platform, Virtex Pro)
Publications Publications
z Bart Kienhuis, Edwin Rijpkema, and Ed F. Deprettere ``Compaan: Deriving Process Networks from Matlab for Embedded Signal Processing Architectures.'', 8th International Workshop on
Hardware/Software Codesign (CODES'2000), May 3 -- 5 2000, San Diego, CA, USA.
z Alexandru Turjan, Bart Kienhuis, and Ed Deprettere
``A compile time based approach for solving out-of-order communication in Kahn Process
Networks'', in proceeding of IEEE 13th International Conference on Application-specific Systems, Architectures and Processors (ASAP'2002), San Jose, CA, USA, July 17-19, 2002
z Tim Harriss, Richard Walke, Bart Kienhuis, and Ed Deprettere
``Compilation from Matlab to Process Networks Realized in FPGA'', In journal on Design Automation of Embedded Systems, Kluwer, Vol 7, Issue 4, 2002
z Todor Stefanov, Bart Kienhuis, and Ed Deprettere
``Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances'', in proceeding of Tenth International Symposium on Hardware/Software Codesign CODES'2002, Stanley Hotel, Estes Park, Colorado, USA, May 6 -- 8, 2002
z Edwin Rijpkema, ``Modeling Task Level Parallelism in Piece-wise Regular Programs'',
PhD thesis, Leiden University, Leiden Institute of Advanced Computer Science (LIACS), The Netherlands, Sept 2002.
z Alexandru Turjan, Bart Kienhuis, and Ed Deprettere, ``Solving out-of-order communication in Kahn Process Networks '', submitted for publication in Journal on VLSI Signal Processing-Systems for Signal, Image, and Video Technology. Kluwer Academic Publishers., 2003
z Claudiu Zissulescu , Todor Stefanov, Bart Kienhuis and Ed Deprettere, “Laura: Leiden Architecture Research and Exploration Tool”, submitted to The International Conference on Field Programmable Logic and Applications, September 1-3, 2003 Lisbon, Portugal
z Todor Stefanov, Claudiu Zissulescu, Alexandru Turjan, Bart Kienhuis and Ed Deprettere, “System Design using Kahn Process Networks: The Compaan/Laura Approach”, submitted for review to ICCAD, November 9 –13, 2003, San Jose, CA, USA.