Billion MAC/s

(1)

System Design Using Kahn System Design Using Kahn

Process Networks:

The Compaan/Laura Approach The Compaan/Laura Approach

Bart Kienhuis

Assistant Professor

LIACS, Leiden University

The Netherlands

(2)

DSP Performance Requirements DSP Performance Requirements

0 500 1000 1500 2000 2500

2000 2001 2002 2003 2004 2005 2006

Billion MAC/s

HDTV MPEG4

Video over IP

3G Wireless/

WCDMA

Future Broadband Standards

Voice over IP

General Purpose DSP/CPU Market

Requirements

Increasing

Gap 2.5G

Applications have a ferocious appetite for more programmable compute power

Source: TI, Xilinx – 1 MAC = 8 bit Multiply-Accumulate

(3)

Embedded DSP Architectures Embedded DSP Architectures

Programmable Interconnect (NoC)

IPcore IPcore RPU RPU Memory Memory

CPU CPU Micro Processor Micro Processor

Memory Memory

...

CPU: A simple Microprocessor

RPU: Reconfigurable Processing Unit IPcore; Dedicated Accelerator block NoC: Network on a Chip

Weakly coupled

Processing elements

(4)

Programming Problem Programming Problem

for j = 1:1:N,

[x(j)] = Source1( );

end

for i = 1:1:K,

[y(i)] = Source2( );

end

for j = 1:1:N, for i = 1:1:K,

[y(i), x(j)] = F( y(i), x(j) );

endend

for i = 1:1:K,

[Out(i)] = Sink( y( I ) );

end

Sequential

Application Specification

EASY to specify

DIFFICULT to map

Programmable Interconnect (NoC)

IPcore IPcore RPU RPU Memory Memory

CPU CPU Micro Proce ssor Micro Proce ssor

Memory Memory

...

Programming

P1 P2

S1

Source

P3 P4

Sink

Parallel

Application Specification

EASY to map DIFFICULT to specify

Compaan

L

Application

au ra

(5)

Outline Outline

z The programming problem

z Kahn Process Networks

z System Design: Compaan/Laura Approach

z Case-study M-JPEG

z Conclusions

(6)

Embedded DSP Architectures Embedded DSP Architectures

• Distributed Control

• Distributed Memory To satisfy the computational requirements,

these architectures have to exploit:

Task-level Parallelism

Instruction Parallelism

Programmable Interconnect (NoC)

RPU RPU Memory Memory

CPU CPU Micro Processor Micro Processor

...

Memory Memory

IPcore IPcore

(7)

QR Algorithm (smart antennas) QR Algorithm (smart antennas)

%parameter N 8 16;

%parameter K 100 1000;

for k = 1:1:K, for j = 1:1:N,

[ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );

for i = j+1:1:N,

[ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );

end end end

Matlab Code (QR Algorithm)

Sequentially Ordered

Matrices are located in Big Global Memory

QR simple program: but keeps your CPU very busy

(8)

Solution Solution

z Change the model of computation in such a way that it better fits the model of

architecture.

z Make sure the data-type is of precisely the format that fits the architecture (e.g. Streams)

z What model of Computation would fit this description, when looking at Digital Signal Processing (DSP) applications, Imaging and Multi Media?

Kahn Process Networks

(9)

Kahn Process Network (KPN) Kahn Process Network (KPN)

z

Kahn Process Networks [Kahn 1974][Parks&Lee 95]

– Processes run autonomously – Communicate via

unbounded FIFOs

– Synchronize via blocking read

z

Process is either

– executing (execute) – communicating

(send/get)

z

Deterministic

z

Distributed Control

z

Distributed Memory

Fc A

Fa Fb

get execute send C get

execute send

send

get get

execute send

Fifo

C

B

(10)

Kahn Process Network (KPN) Kahn Process Network (KPN)

Fifo

Process A

Process C

Process B Fifo Fifo

Fifo ^{FPGA B} Fifo

CPU 1 FPGA A

•Autonomously operating Processes; no global schedule needed

•Blocking Read simple realize in Hardware

•Buffer Sizes of the FIFOs are quite often very small

(11)

The Compaan Tool Chain The Compaan Tool Chain

%parameter N 8 16;

%parameter K 100 1000;

for k = 1:1:K, for j = 1:1:N,

[r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );

for i = j+1:1:N,

[r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );

end end end

Matlab Program

SAC MatParser

Matlab Application

outputR Rotate

Vectorize initialR

inputSamples

DgParser

PRDG

Source

P1 P2

S1

P3 P1

Process Network

Sink

Panda

Kahn Process Network

Polyhedral Reduce Dependence Graph (PRDG)

Data Dependency Analysis

Linearization

(12)

Data Dependency Analysis Data Dependency Analysis

j

1 2 3 4 5 N=6 1

2 4 3 5 N=6 for i= 1 : 1 : N,

for j= 1 : 1 : N,

[ a(i+j) ] = funcA( a(i+j) );

end end

i

i+j=6

a(i+1,j-1)

Ax >= b (polytope)

The for-next loops define an Iteration Domain

(13)

Polyhedral Reduced Dependence Graph Polyhedral Reduced Dependence Graph

%parameter N 8 16;

%parameter K 100 1000;

for k = 1:1:K, for j = 1:1:N,

[r(j,j), x(k,j), t ]=Vectorize( r(j,j),x(k,j) );

for i = j+1:1:N,

[ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );

end end end

Matlab Code

(QR Algorithm)

Dependence Graph

k i

j Vec

Rot

(14)

Polyhedral Reduced Dependence Graph Polyhedral Reduced Dependence Graph

C A

B D

E

Polytope “C”

Polytope “D”

(15)

Linearization Linearization

z Linearization is the process of mapping high-order data-

structures (e.g., Matrices) on a 1-D stream

z We replaced the

indexing of the variable x(j,I) and x(n-1,m) by

relative put and get operation on a FIFO buffer (unboxing)

z Is this always possible?

for j = 1 :1 5, for i = j : 1 : 5,

[ x(j,I)]=F1();

end end

for j = 1 :1 5, for i = j : 1 : 5,

F2(x(n-1,m));

end end

Global Memory

for j = 1 :1 5, for i = j : 1 : 5,

[ out]=F1();

FIFO.put(out);

end end

for j = 1 :1 5, for i = j : 1 : 5,

in = FIFO.get();

F2(in);

end end

FIFO

Linearization

Producer Consumer

(16)

KPN Hand Off KPN Hand Off

Kahn Process Network

(b) (a)

(b)

(a) – rotate (b) – vectorize

Synthesizable VHDL

FPGA

Laura

Functional Simulation in Ptolemy II Ptolemy Actor

models in Java C++/YAPI

(17)

The Laura Tool The Laura Tool

KPN

Network of Virtual Processors

KPNtoArchicture

Mapping

Library of IP cores

Network of

Synthesizable Processors

Verilog VHDL SystemC

Platform dependent

Platform Independent

(18)

The Laura Tool The Laura Tool

IP2 OP1

IP1 OP2

Ch2

P2 Kahn Process Network

P3 P1

FIFO1 ^IP2 FIFO3

IP1

OP1

VP2

OP2 FIFO2

Ch1 Ch3

KPNToArchitecture

Abstract Architecture

VP1 VP3

(19)

The Laura Tool The Laura Tool

DATA FLOW

Execution Unit IP Core

Read Unit

Controller

Write Unit

MUX DeM U X

FIFO

Control Tables Control Tables

Structure of an individual processor

(20)

System Design Flow System Design Flow

z The Tools in action

– M-JPEG Example based on the original C-code of the Portable Video Research Group,

Stanford University.

– Simple Target Platform

• Common PC platform

• With FPGA board

(21)

Motion JPEG encoder Motion JPEG encoder

Sequence of

T ^frames

JPEG encoding

M-JPEG encoded video stream Video stream

(4:2:2 YUV format)

observed bitrate

dim V

dim H

(22)

Target Platform Target Platform

Memory Banks

Multiplexer

Address Control

Host Interface

Control Select

Virtex-II 2V6000 FPGA

ADM-XRCII board

Status

Control HW Design

Pentium IV Microprocessor

PCI bus

Control Address Data Data

Data

Data Address

Data

(23)

System Design Flow System Design Flow

KPN Application

In Matlab

HW/SW partitioning (Workload Analysis) Compaan Compiler

Compaan/Laura HW Compiler Hardware Processes

(Matlab)

Hardware Description VHDL

Memory Banks

Multiplexer

Host Interface

Virtex-II 2V6000 FPGA

ADM-XRCII board

Status

Control HW Design Pentium IV

Microprocessor PCI bus

Control Address Data Data Data

SW Programming

Data

HW Programming

Software Path Hardware Path

Software Processes (YAPI)

GCC/V++

SW Compiler

Object Code

(24)

M M - - JPEG Specification in Matlab JPEG Specification in Matlab

[ QTables, HuffTables, TablesInfo, EndOfFrame ] = P2_l_DefaultTables( );

fork = 1:1:NumFrames,

[ HeaderInfo ] = P1_l_VideoInInit( );

forj = 1:1:VNumBlocks, for i = 1:1:HNumBlocks,

[ Block( j ,i ) ] = P1_l_VideoInMain( );

end end

[ Block( j , i ) ] = DCT( Block( j , i ) );

end end

[ Block( j , i ) ] = Q( Block( j , i ), QTables );

[ Packets, StatisticsB ] = VLE( Block( j , i ), EndOfFrame, HuffTables );

[ BitRate, StatisticsF, EndOfFrame ] = CtrlF1( StatisticsB );

[ ] = VideoOut( HeaderInfo, TablesInfo, Packets );

end end

[ QTable, HuffTables, TablesInfo ] = P2_l_CtrlF2( BitRate, StatisticsF, QTables, HuffTables, TablesInfo );

end

Parameterized

%parameterNumFrames 1 1000;

%parameterVNumBlocks16 256;

%parameterHNumBlocks8 256;

Block( j , i )

(25)

Deriving a KPN Deriving a KPN

Application In Matlab

Compaan Compiler _Functional

Verification

Ptolemy II PN Domain Workload

Analysis to do the

HW/SW

Partitioning YAPI/C++

(26)

Deriving a KPN

(27)

The KPN of M

The KPN of M - - JPEG JPEG

VOut

CtrlF1

DCT Q P1

P2

Block Block Block VLE Packets

BitRate

QTabl e s

Hu ffT ab le s

Statist icsB

StatisticsF

EndOfFrame

TablesInfo struct Block {

**int Y1[64]; /* block 8x8 pixels */**

**int Y2[64]; /* block 8x8 pixels */**

**int U[64]; /* block 8x8 pixels */**

**int V[64]; /* block 8x8 pixels */**

};

HeaderInfo

(28)

Interface Code DCT Interface Code DCT

z The DCT process is selected to move to Hardware.

z Interface code is needed to run with the Software Processes

z Observe that ‘Blocks’

are being moved to the FPGA and from the

FPGA

void DCT::main() { // NumFrames = 100;

// VNumBlocks = 16;

// HNumBlocks = 8;

for ( int k=1; k <= NumFrames; k++ ) { for ( int j=1; j <= VNumBlocks; j++ ) {

for ( int i=1; i <= HNumBlocks; i++ ) { read( inPort, inBlock );

outBlock = DCT( inBlock );

write( outPort, outBlock );

}

(29)

Matlab of the DCT Process Matlab of the DCT Process

for k = 1:1:4, forj = 1:1:64,

[ Pixel( k , j ) ] = Source( inBlock );

end end

for k = 1:1:4, if k <= 2,

for j = 1:1:64,

[ Pixel( k , j ) ] =PreShift( Pixel( k , j ) );

end end

for j = 1:1:64,

[ Block ] = P_l_PixelsToBlock( Pixel( k , j ) );

end

[ Block ] = P_l_2D_dct( Block );

for j = 1:1:64,

[ Pixel( k , j ) ] = P_l_BlockToPixels( Block );

end end

for k = 1:1:4, for j = 1:1:64,

[ outBlock ] = Sink( Pixel( k , j ) );

end end

z Of the DCT process, we make a new Matlab

program

– This exposes more

parallelism at a finer lever.

– Automatic conversion from Blocks to Stream (Linearization)

z Compaan produces by default a process per function call.

– However, using the

Preamble ‘P_1’ we can

group processes.

(30)

KPN Sub network DCT KPN Sub network DCT

P

PreShift

Source Pixel Sink

Pixel

Pixel Pixel

inBlock outBlock

DCT

VOut

CtrlF1

DCT Q P1

P2

Block Block Block VLE Packets

HeaderInfo

BitRate

QTabl e s

Hu ffT ab le s

Statist icsB

StatisticsF

EndOfFrame

TablesInfo

Hierarchical Subnet of DCT

(31)

Programming the CPU Programming the CPU

M-JPEG specified In YAPI

C++ Compiler YAPI Executable

Pentium IV Pentium IV

YAPI Multithreading

Environment

(32)

Laura DCT Hardware Model Laura DCT Hardware Model

P

PreShift

Source

Pixel

Sink

Pixel

Pixel Pixel

inBlock outBlock

DCT

IP2 IP1

VP3

OP1

FIFO1 FIFO2

FIFO3

FIFO4

VP2

VP1

(Source)

VP4

(Sink) (PreShift)

(P)

Sink/Source do the type Conversion

One-to-One Mapping

Abstract Hardware Model: Network of Virtual Processors

(33)

Laura DCT Hardware Model Laura DCT Hardware Model

z To get the functionality of the Virtual Processor, we integrated an IPcore.

z We have taken the Core (2D-DCT) from the Xilinx Webside.

z Make Processor specific for a platform

– Determine the Bit width – Determine the FIFO sizes – Take into account the

Clock

– Determine the Control tables for the switches

IP2 IP1

VP3

OP1

FIFO1

FIFO3

FIFO4 (xilinx) (P)

0 1 MUX IP1

IP2

in_0 out_0

C

OP1

Control Table

Synch. Logic Synch. Logic

2D-DCT (Xilinx)

Control Unit Execute Unit

IP Core implementing

Read Unit Write Unit

(34)

Hw/ Hw/ Sw Sw Solution for M Solution for M - - JPEG JPEG

Memory Banks

Multiplexer

Host Interface

Virtex-II 2V6000 FPGA

ADM-XRCII board

Status

Control

PCI bus

Control Address Data Data

Data

YAPI Multithreading

Data

Environment Pentium IV Pentium IV

The way it is programmed; the CPU and FPGA run in parallel

(35)

Processing Time M

Processing Time M - - JPEG JPEG

Compaan Laura Other tools Manually Total M-JPEG -> KPN 00:00:22 -- -- 00:30:00 00:30:22 Software

Compilation -- -- 00:00:35 -- 00:00:35 DCT Subnet

Compilation 00:00:08 -- -- -- 00:00:08

Laura -- 00:00:07 -- 03:00:00 03:00:07

Synthesis to FPGA -- -- 00:13:10 -- 00:13:10

Overall 00:00:30 00:00:07 00:13:45 03:30:00 03:44:22

(36)

Device Utilization for the DCT Device Utilization for the DCT

FPGA resource Utilization % Number of MULT18x18s 8 out of 144 5%

Number of RAMB16s 4 out of 144 2%

Number of SLICEs 2367 out of 33792 7%

Number of BUFGMUXs 2 out of 16 12%

Virtex-II 2V6000 FPGA

(taking 4% of the FPGA)

(37)

Real Real - - time performance M time performance M - - JPEG JPEG

z Throughput of the system

– 10.5 CIF frame (128x128) per second

• Running Windows 2000

• Simple Compiler

• Simple Multithreading architecture z Required is 25 frames per second

– Communication FPGA/CPU is too slow (PCI) z However,

– 64 bit PCI

– Running at 66Mhz

– 4 times increase in performance

z Then 25 frames per second (128x128) not a problem

(38)

Exploration Exploration

P1 P2

S1

^Sink

S2

KPN_4

Generate

for j = 1:1:N,

[x(j)] = Source1( );

end

for i = 1:1:K,

[y(i)] = Source2( );

end

for j = 1:1:N, for i = 1:1:K,

[y(i), x(j)] =

F

( y(i), x(j) );

end end

for i = 1:1:K,

[Out(i)] = Sink( y(i) );

end

P1 P2

P3 P4

S1

^Sink

S2

KPN_1

P1 P2

S1

^Sink

S2

KPN_3

P4

S1

^Sink

S2

P2 P3 P1

KPN_2

P

S1

Sink

S2

KPN_5

Map Explore

Programmable Interconnect (NoC)

IPcoreIPcore RPURPU MemoryMemory

CPUCPU Micro Processor Micro Processor

Memory Memory

...

Alternative Application Instances

(39)

Unrolling/Unfolding Unrolling/Unfolding

%parameter N 100 1000;

%parameter K 8 48;

for j = 1:1:N, for i = 1:1:K,

[y(i), x(j)] =

F

(y(i), x(j));

endend

F F F F

F F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j

i F F F F

F F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j i

Compaan

U = [ N, K ]

Difficult to derive

for j = 1:1:N,

if mod( j ,

if mod( j , 2 2 ) = 1, ) = 1,

for i = 1:1:K,

[y(i), x(j)] = F (y(i), x(j));

end

end end

if mod( j ,

if mod( j , 2 2 ) = 0, ) = 0,

for i = 1:1:K,

[y(i), x(j)] = F (y(i), x(j));

end

end end

end

MatTransform

(40)

Retiming/Skewing Retiming/Skewing

Skewing matrix

 

 

 

  

 



 

= =

  

 



 

  

 



 

= =

0 1 0 1 1 1

2222 2121

1212 1111

m m m m

M M

F F F F

F F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j i

for j = 2:1:N+K,

if mod( j ,

if mod( j , 2 2 ) = 1, ) = 1,

for i = max(1, j-N):1:min(j-1, K), [y(i), x(j-i)] = F (y(i), x(j-i));

end

end end

if mod( j ,

if mod( j , 2 2 ) = 0, ) = 0,

for i = max(1, j-N):1:min(j-1, K), [y(i), x(j-i)] = F (y(i), x(j-i));

end

end end F F F F end

F

F F

F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j

i F F F F

F

F F

F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j i

for j = 2:1:N+K,

for i = max(1, j-N):1:min(j-1, K), [y(i), x(j-i)] = F (y(i), x(j-i));

end end

Unfolding vector U = [ u

₁

, u

₂

] = [2, 1]

Compaan

Difficult to derive

%parameter N 100 1000;

%parameter K 8 48;

for j = 1:1:N, for i = 1:1:K,

[y(i), x(j)] =

F

(y(i), x(j));

endend

(41)

Design Space Exploration Design Space Exploration

Retiming/Unrolling

Compaan Compiler

Initial Values of Parameters

Matlab Application

Matlab Code

New Values of Parameters Mapping

Laura/XFT Performance

Analysis Performance

Numbers Xilinx

Virtex-II Process

Network

(42)

Conclusions Conclusions

z To satisfy tomorrow’s applications, we will see

hierarchical multiprocessor systems with a number of CPUs, Memories, IPcores, and RPU.

z Programming these system will be difficult unless the MoC is changed to take Concurrency into account.

The key items will be

– Distributed Memory – Distributed Control

z Kahn Process Networks seem to be a very promising programming formalism for tomorrow’s HW/SW

codesign platforms

(43)

Conclusions Conclusions

z We showed proof-of-concept with a case in which we Convert M-JPEG into a KPN of which the processes are mapped either on hardware or software.

z In the M-JPEG case, the hardware and software were running concurrently, exploiting task-level parallelism

z Having good tools, we can start from (legacy) code in Matlab, C, or other imperative languages.

z The results are just the beginning. There is more to

achieve when more mature / commercial products are

used (RTOS, Compiler, Target Platform, Virtex Pro)

(44)

Publications Publications

z Bart Kienhuis, Edwin Rijpkema, and Ed F. Deprettere ``Compaan: Deriving Process Networks from Matlab for Embedded Signal Processing Architectures.'', 8th International Workshop on

Hardware/Software Codesign (CODES'2000), May 3 -- 5 2000, San Diego, CA, USA.

z Alexandru Turjan, Bart Kienhuis, and Ed Deprettere

``A compile time based approach for solving out-of-order communication in Kahn Process

Networks'', in proceeding of IEEE 13th International Conference on Application-specific Systems, Architectures and Processors (ASAP'2002), San Jose, CA, USA, July 17-19, 2002

z Tim Harriss, Richard Walke, Bart Kienhuis, and Ed Deprettere

``Compilation from Matlab to Process Networks Realized in FPGA'', In journal on Design Automation of Embedded Systems, Kluwer, Vol 7, Issue 4, 2002

z Todor Stefanov, Bart Kienhuis, and Ed Deprettere

``Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances'', in proceeding of Tenth International Symposium on Hardware/Software Codesign CODES'2002, Stanley Hotel, Estes Park, Colorado, USA, May 6 -- 8, 2002

z Edwin Rijpkema, ``Modeling Task Level Parallelism in Piece-wise Regular Programs'',

PhD thesis, Leiden University, Leiden Institute of Advanced Computer Science (LIACS), The Netherlands, Sept 2002.

z Alexandru Turjan, Bart Kienhuis, and Ed Deprettere, ``Solving out-of-order communication in Kahn Process Networks '', submitted for publication in Journal on VLSI Signal Processing-Systems for Signal, Image, and Video Technology. Kluwer Academic Publishers., 2003

z Claudiu Zissulescu , Todor Stefanov, Bart Kienhuis and Ed Deprettere, “Laura: Leiden Architecture Research and Exploration Tool”, submitted to The International Conference on Field Programmable Logic and Applications, September 1-3, 2003 Lisbon, Portugal

z Todor Stefanov, Claudiu Zissulescu, Alexandru Turjan, Bart Kienhuis and Ed Deprettere, “System Design using Kahn Process Networks: The Compaan/Laura Approach”, submitted for review to ICCAD, November 9 –13, 2003, San Jose, CA, USA.