Optimizing the on-chip communication architecture of low power Systems-on-Chip in Deep Sub-Micron technology

(1)

Universit´e Libre de Bruxelles

Ann´ ee Acad´ emique 2006-2007 Facult´ e des Sciences Appliqu´ ees

Optimizing the on-chip

communication architecture of low power Systems-on-Chip in Deep

Sub-Micron technology

Promoteurs:

Prof. F.

Robert

Prof. D.

Verkest

Th` ese pr´ esent´ ee par Anthony

Leroy

en vue de l’obtention du titre de

Docteur en Sciences Appliqu´ ees

(2)

(3)

Acknowledgments

I would like to thank my supervisors Prof. Fr´ ed´ eric Robert and Prof.

Diederik Verkest for providing me with a very interesting Ph.D. topic, for their guidance all along my thesis and for setting up a collaboration between IMEC and the ULB. I also very much appreciated the autonomy and the trust they have given me throughout my thesis.

My deepest gratitude goes to Prof. Francky Catthoor for his strong and constant involvement in this Ph.D. thesis and for his good and very valuable advices.

I am very grateful to Prof. Pierre Mathys for hosting me in the BEAMS division at the ULB and for transmitting to me his passion for electronics during all the courses I have followed throughout my studies.

I also thank Prof. Marie-Ange Remiche for her guidance and the pos- sible future collaborations that we have identified for the follow-up of this research.

I would also like to thank Adelina Shickova, Andy Lambrechts, Guillermo Talavera and Praveen Raghavan for their contributions to our common work in the context of the Architecture Ph.D. Team experiments at IMEC.

Julien Picalausa helped me a lot in the implementation of a large part of the VHDL framework which has been exploited in this thesis. I am also grateful to Prof. Dragomir Milojevic for having allowed me to guide Julien’s work.

A special thanks goes to my colleagues and friends at the ULB: Alexis, Antoine, Axel, Kim, Manu and Michel for their moral support and for their advices.

My gratitude also goes to my technical supervisors at IMEC Antonis Papanikolaou and Pol Marchal for their precious advices.

My thanks are due to Jayaprakash Balachandran for his precious infor- mations about the physical issues in the interconnect.

I also thank Th´ eodore Marescaux for all the nice chats about NoCs at IMEC and for possible future collaborations.

I acknowledge the IMEC management for providing me a very nice work environment at IMEC: Jean-Yves Mignolet, Serge Vernalde, Michel Eyck- mans, Johan Vounckx, Wilfried Verachtert and Rudy Lauwereins .

i

(4)

This research was funded by the Fonds pour la formation ` a la Recherche dans lIndustrie et dans lAgriculture (FRIA).

Last but not least, I am greately indebted to No´ emie for her patience

and all the support she gave me during those four years and to my family

for supporting me during all my studies.

(5)

R´ esum´ e

Ce m´ emoire traite des syst` emes int´ egr´ es sur puce (System-on-Chip) ` a faible consommation d’´ energie tels que ceux qui seront utilis´ es dans les ´ equipements portables de future g´ en´ eration (ordinateurs de poche (PDA), t´ el´ ephones mo- biles). S’agissant d’´ equipements aliment´ es par des batteries, la consomma- tion ´ energ´ etique est un probl` eme critique.

Ces plateformes contiendront probablement une douzaine de coeurs de processeur et une quantit´ e importante de m´ emoire embarqu´ ee. Une archi- tecture de communication optimis´ ee sera donc n´ ecessaire afin de les inter- connecter de mani` ere efficace. De nombreuses architectures de communi- cation ont ´ et´ e propos´ ees dans la litt´ erature: bus partag´ es, bus pont´ es, bus segment´ es et plus r´ ecemment, les r´ eseaux int´ egr´ es (NoC).

Toutefois, ` a l’exception des bus, la consommation d’´ energie des r´ eseaux d’interconnexion int´ egr´ es a ´ et´ e largement ignor´ ee pendant longtemps. Ce n’est que tr` es r´ ecemment que les premi` eres ´ etudes sont apparues dans ce domaine.

Cette th` ese pr´ esente:

•

Une analyse compl` ete de l’espace de conception des architectures de communication int´ egr´ ees. Sur base de cet espace de conception et d’un

´ etat de l’art d´ etaill´ e, des techniques jusqu’alors inexplor´ ees ont pu ˆ etre identifi´ ees et investigu´ ees.

•

La conception d’environnements de simulation de bas et haut niveaux permettant de r´ ealiser des comparaisons entre diff´ erentes architectures de communication en termes de consommation ´ energ´ etique et de sur- face.

•

La conception et la validation d’une architecture de communication intgre innovante bas´ ee sur le multiplexage spatial

Ce dernier point a pour ambition de d´ emontrer qu’un r´ eseau bas´ e sur le multiplexage spatial (SDM) constitue une alternative int´ eressante aux r´ eseaux classiques principalement bas´ es sur le multiplexage temporel dans le contexte tr` es sp´ ecifique des architectures de communication int´ egr´ ees.

iii

(6)

Nous d´ emontrerons la validit´ e de la solution propos´ ee ` a l’aide de cam-

pagnes de simulation de haut niveau pour divers types de trafic ainsi que

des simulations de plus bas niveau. L’´ etude concerne successivement la

conception de routers SDM, des interfaces r´ eseau et finalement d’un r´ eseau

complet. Les avantages et inconv´ enients d’une telle technique seront discut´ es

en d´ etails.

(7)

Abstract

This thesis targets heterogeneous low-power Systems-on-Chip that will be used in the future generations hand-held devices (PDA’s, mobile phones).

Those platforms will probably contain a dozen of various processing cores and considerable amount of on-chip memories.

An optimized communication architecture will be required to intercon- nect them efficiently. Many communication architectures have been pro- posed in the literature: shared buses, bridged buses, segmented buses and more recently, Networks-on-Chip.

Being battery-powered devices, the energy consumption of the platform is a critical issue. However, with the exception of buses, power consumption has been mostly neglected in interconnection networks. Only very recently have a few studies emerged in that domain.

This thesis presents:

•

A complete characterization of the on-chip communication architec- ture design space. Based on this design space and a detailed state- of-the art, yet unexplored techniques have been identified and investi- gated

•

The design of generic on-chip communication architecture simulators both at high and low abstraction levels

•

The design and validation of an innovating communication architec- ture based on spatial multiplexing

The last point aims at demonstrating that a network based on spatial multiplexing is an interesting alternative to classical networks mainly based on time division multiplexing in the specific context of on-chip communica- tion architecture.

We will demonstrate the validity of the proposed solution based on high- level simulation campaigns for different types of traffic as well as low-level simulations. The study concerns successively the design of the SDM routers, of the network interfaces and finally of a complete network. The advantages and drawbacks of such a technique will be discussed in details.

v

(8)

(9)

Publications

This thesis presents the results of my research during the last four years.

Part of my research has been published in the following IEEE conference papers:

•

A. Leroy, F. Catthoor, D.Verkest, F. Robert, ”Spatial Division Mul- tiplexing: a novel approach for guaranteed throughput on NoCs”, Network-on-Chip workshop at the Design, Automation and Test in Europe Conference 2006

•

A. Leroy, P. Marchal, A. Shickova, F. Catthoor, D. Verkest, F. Robert,

”Spatial Division Multiplexing: a novel approach for guaranteed through- put on NoCs”, International Conference on Hardware/Software Code- sign and System Synthesis” (IEEE CODES-ISSS’05), New York, 19-21 septembre 2005.

•

A. Lambrechts, A. Leroy, G. Talavera, A. Shickova, T. Vander Aa, M.

Jayapala, B. Mei, F. Catthoor, D. Verkest, G. Deconinck, H. Corpo- raal, F. Robert and J. Carrabina Bordol, ”Power breakdown analysis for a heterogeneous NoC platform running a video application”, IEEE 16th International Conference on Application-specific Systems, Archi- tectures and Processors (ASAP 2005), Samos, Greece, 23-25 July 2005

•

A. Lambrechts, T. Vander Aa, M. Jayapala, A. Leroy, G. Talavera, A.

Shickova, F. Barat, B. Mei, F. Catthoor, D. Verkest, G. Deconinck, H. Corporaal, F. Robert and J. Carrabina Bordol, ”Design style case study for compute nodes of a heterogeneous NoC platform”, Real- Time Systems Symposium (RTSS’04), Lisbon (Portugal), December 5-8, 2004

A patent has been submitted to the USA and European Patent Office:

•

A. Leroy, F. Catthoor, USA Patent Application No 11/487175, ”Method for managing a plurality of virtual links shared on a communication line and network implementing said method”, presented on 14.07.05 by Interuniversitair Microelektronica Centrum vzw.

vii

(10)

(11)

List of Tables

1.1 Predicted evolution of the interconnect vs. gate parameters

(ITRS) . . . .

13

1.2 Comparison of network domains . . . .

28

3.1 Theoretical characteristics of different network topologies . .

68

3.2 Summary of the characteristics of the major topologies . . . .

75

3.3 Comparison of switching techniques . . . .

91

4.1 Comparison of different bus architectures . . . .

123

4.2 Comparison of the different bus architectures proposed for the Core Connect architecture . . . .

127

4.3 Comparison of application layer decisions for various state- of-the-art NoCs . . . .

156

4.4 Comparison of transport layer decisions for various state-of- the-art NoCs . . . .

158

4.5 Comparison of network layer decisions for various state-of- the-art on-chip communication architectures . . . .

160

4.6 Comparison of link layer decisions for various state-of-the-art NoCs . . . .

162

4.7 Comparison of physical layer decisions for various state-of- the-art NoCs . . . .

163

4.8 Comparison of physical layer decisions for various state-of- the-art buses . . . .

164

5.1 Comparison of the step response for lumped and distributed RC models . . . .

197

5.2 Experimental set-up . . . .

213

6.1 Classification of N×N MIN switches . . . .

230

6.2 Virtual circuit bandwidth allocation . . . .

257

6.3 Power, area and delay estimations for router R6 implementing SDM . . . .

264

xv

(18)

6.4 Power, area and delay estimations for router R6 implemented with TDM and SDM (130 nm technology (V

_dd

= 1.2V ), post- layout) . . . .

265

6.5 Comparison between best-effort packet-switched, TDM and

SDM characteristics . . . .

267

(19)

List of Figures

1.1 The Intel Core Duo processor . . . .

5

1.2 The MPSoC Nexperia platform . . . .

8

1.3 The IMEC 3MF MPSoC plaform . . . .

9

1.4 The Cell processor . . . .

10

1.5 A tile-based design of a System-on-Chip . . . .

11

1.6 Explosion of the number of cores for a 50K gates complexity

12

1.7 Wiring hierarchy (CMOS 7S) composed of 6 layers . . . .

14

1.8 Distribution of the interconnect net length . . . .

15

1.9 Evolution of interconnect hierarchy . . . .

16

1.10 Comparison between gate delay and local/global interconnect delay . . . .

17

1.11 Leakage current contributions . . . .

19

1.12 Leakage vs dynamic power consumption contribution . . . . .

21

1.13 Effect of electromigration on the interconnect . . . .

22

1.14 A basic network-on-chip communication architecture based on a mesh topology . . . .

26

2.1 Mapping and scheduling of the communication task graph on a tile-based architecture . . . .

36

2.2 Differences between end-to-end network latency and end-to- end communication architecture latency . . . .

38

2.3 Global platform design flow . . . .

42

2.4 On-chip communication architecture design as a part of the global platform design . . . .

43

2.5 SIMD (Data Level Parallelism) . . . .

45

2.6 MIMD (Task Level Parallelism) . . . .

45

2.7 High-level view of the complete DTSE flow . . . .

46

3.1 Communication architecture . . . .

50

3.2 OSI communication protocol stacks implementation in the communication architecture . . . .

51

3.3 NoC design decisions for each OSI layer . . . .

52

3.4 Inter-process communication . . . .

55

xvii

(20)

3.5 Quality of Service Negotiation . . . .

57

3.6 Communication interface protocol . . . .

58

3.7 Connection-oriented vs. connection-less services . . . .

61

3.8 Topology graph . . . .

64

3.9 Aggregated and Bisection Bandwidth of some common topolo- gies . . . .

66

3.10 Examples of common network topologies . . . .

67

3.11 Crossbar architecture . . . .

70

3.12 Batcher banyan switch . . . .

71

3.13 Full vs. Partial crossbar . . . .

72

3.14 From a shared bus to a segmented bus . . . .

73

3.15 From a segmented bus to a crossbar . . . .

74

3.16 Routing algorithm design decisions . . . .

76

3.17 Connection-oriented vs. connection-less services implementa- tion . . . .

80

3.18 Pipelined bus transactions . . . .

82

3.19 Buffer position: input, output and virtual output queuing . .

85

3.20 Message units: packets, flits, phits . . . .

86

3.21 Store and Forward switching technique . . . .

87

3.22 Virtual Cut Through switching technique . . . .

88

3.23 Wormhole switching technique . . . .

89

3.24 Mad postman switching technique . . . .

89

3.25 Switched Virtual Circuit switching technique . . . .

90

3.26 Switching technique design space . . . .

91

3.27 Link-level decision . . . .

92

3.28 Intra/extra-core switches . . . .

97

3.29 Interstratal interconnect . . . .

101

3.30 3D System-on-Chip . . . .

102

3.31 Platform exploiting Wafer Level Packaging interconnect . . .

103

3.32 Staggered repeaters . . . .

104

3.33 Decisions ordering for an area optimization . . . .

107

3.34 Decision ordering for an energy optimization . . . .

108

3.35 Decision ordering for an end-to-end network latency optimiza- tion . . . .

109

3.36 Decision ordering for an optimization of the network reliability110 4.1 AMBA advanced on-chip bus architecture . . . .

120

4.2 AMBA hierarchical bus system . . . .

124

4.3 IBM CoreConnect architecture . . . .

128

4.4 Element Interface Bus (EIB) . . . .

131

4.5 MARBLE asynchronous bus architecture . . . .

132

4.6 SPIN communication architecture . . . .

133

4.7 RAW architecture . . . .

135

4.8 Octagon topology . . . .

139

(21)

LIST OF FIGURES

xix

4.9 Spidergon topology . . . .

141

4.10 The MANGO communication architecture . . . .

142

4.11 Æthereal network interface architecture . . . .

145

4.12 QNoC design methodology . . . .

149

4.13

×pipesCompiler . . . 150

4.14 Proteo hierachical network architecture . . . .

152

4.15 Arteris NoC Design Flow . . . .

154

5.1 Network simulator framework . . . .

173

5.2 OMNeT++ simulation environment . . . .

176

5.3 Global view of a mesh instance of the OMNeT++ NoC model

177

5.4 Structure of a Network Interface . . . .

180

5.5 Router canonical architecture . . . .

180

5.6 OMNeT++ architecture of the router . . . .

182

5.7 Cache architecture . . . .

188

5.8 Transistor capacitance model . . . .

190

5.9 Transistor gate capacitance variations . . . .

191

5.10 Transistor gate capacitance . . . .

192

5.11 Model for transistors of width less than 10µm . . . .

192

5.12 Folded transistor model for transistors of width exceeding 10µm192 5.13 FIFO buffer architecture . . . .

193

5.14 Four different interconnect models . . . .

195

5.15 Region where inductance becomes important . . . .

196

5.16 Return path of the current . . . .

197

5.17 Parallel plate capacitance . . . .

198

5.18 Inter-wire capacitance . . . .

198

5.19 Architecture of a matrix crossbar . . . .

200

5.20 Sotiriadis bus model . . . .

200

5.21 Sotiriadis bus receiver - driver model . . . .

201

5.22 Equivalent capacitive model of the bus . . . .

202

5.23 Simpler bus model . . . .

203

5.24 Comparison between Sotiriadis model and a standard bus model204 5.25 TNT interconnection model . . . .

205

5.26 Memory and logic area contributions for various Intel Pen- tium generations . . . .

206

5.27 Interface between OMNeT++ and the energy/delay model .

207

5.28 Structure of a generic router . . . .

209

5.29 Spatial pattern . . . .

210

5.30 Block diagram of the configurable router used . . . .

211

5.31 Structure of the VCT router . . . .

212

5.32 Structure of the SAF router . . . .

212

5.33 Structure of the WH router . . . .

212

(22)

6.1 Time Division Multiplexing (TDM) vs. Spatial Division Mul- tiplexing (SDM) . . . .

218

6.2 Comparison of the network interface architectures for TDM

(a) and SDM (b) . . . .

223

6.3 Architecture of a TDM router illustrating the content of the

local Output Reservation Table (ORT) . . . .

224

6.4 Consecutive time slot (TS) reservation required at the regular

time division multiplexing . . . .

225

6.5 Architecture of a P×P SDM router with 3 virtual circuits:

A,B and C . . . .

226

6.6 Segment reservation with space division multiplexing . . . . .

227

6.7 Non-consecutive segment reservation with space division mul-

tiplexing . . . .

227

6.8 Clos network . . . .

230

6.9 Banyan network . . . .

231

6.10 Evolution of the normalized power consumption of the Beneˇ s

switch and the crossbar for a fixed port width . . . .

233

6.11 Evolution of the area overhead for the Beneˇ s switch and the

crossbar for a fixed port width . . . .

233

6.12 Evolution of the normalized power consumption of the Beneˇ s

switch and the crossbar for a fixed port-width . . . .

234

6.13 Evolution of the area overhead of the Beneˇ s switch and the

crossbar for a fixed port-width . . . .

234

6.14 Evolution of the critical path delay for the Beneˇ s switch and

the crossbar for a fixed port-width . . . .

235

6.15 Recursive Beneˇ s switch construction and a 4×4 switch instance236 6.16 Atomic Beneˇ s switch . . . .

237

6.17 Evolution of the average set-up time per virtual circuit and

of the total set-up time for all circuits . . . .

239

6.18 Compared evolution of the SDM and TDM router area . . . .

241

6.19 Compared evolution of the SDM and TDM router power con-

sumption . . . .

241

6.20 Compared evolution of the SDM and TDM router normalized

power consumption . . . .

242

6.21 SDM network interface architectures . . . .

243

6.22 SDM network interface: area and power breakdown . . . .

245

6.23 SDM network interface area and power consumption (8 bits

data width) . . . .

246

6.24 SDM network interface area and power consumption (32 bits

data width) . . . .

246

6.25 SDM network interface area (variable processor and network

port widths) . . . .

247

6.26 SDM network interface absolute power consumption(variable

processor and network port widths) . . . .

247

(23)

LIST OF FIGURES

xxi 6.27 SDM network interface normalized power consumption (vari-

able processor and network port widths) . . . .

248

6.28 Layered control plane implementation . . . .

250

6.29 A 5

×

5 Beneˇ s switch . . . .

251

6.30 Connections of the SDM network interfaces to the network

routers . . . .

252

6.31 Video chain, with indication of bandwidths requirements . . .

253

6.32 Logical view of the platform . . . .

256

6.33 Physical view of our platform (die size = 101

mm²

) . . . . .

258

6.34 Experimental tool flow for area and power measurement . . .

260

6.35 Power Breakdown for the embedded platform . . . .

261

6.36 Power consumption of the different routers composing . . . .

262

(24)

(25)

List of Acronyms

3MF MultiMedia Multi-Format

ADRES Architecture for Dynamically Reconfigurable Embedded System ASIC Application Specific Integrated Circuit

API Application Programming Interface AVC Advanced Video Codec

BE Best Effort BIST Build In Self Test

CACTI Cache Access and Cycle Time Information CDFG Control and Data Flow Graph

CDMA Code Division Multiple Access DLP Data Level Parallelism

DSM Deep Sub Micron

DTSE Data Transfer and Storage Exploration DVS Dynamic Voltage Scaling

FDMA Frequency Division Multiple Access FIFO First-In-First-Out

GALS Globally Asynchronous Locally Synchronous GT Guaranteed Throughput

ILP Instruction Level Parallelism IP Intellectual Property

IPC Inter-Process Communication

ITRS International Technology Roadmap for Semiconductors LSB Least Significant Bit

MIMD Multiple Instruction Multiple Data MIN Multi-stage Interconnect Network MISD Multiple Instruction Single Data

xxiii

(26)

MPI Message Passing Interface MPSoC Multi-Processor System-on-Chip MSB Most Significant Bit

NB Non Blocking NI Network Interface NoC Network on Chip OCP Open Core Protocol

OSI Open Systems Interconnection P2P Point-to-Point

PDA Personal Digital Assistant QoS Quality of Service

RNB Rearrangeable Non Blocking RTL Register Transfer Level SAF Store And Forward

SDM Spatial Division Multiplexing SDMA Space Division Multiple Access SIMD Single Instruction Multiple Data SISD Single Instruction Single Data SMT Simultaneous Multi-Threading SNB Strictly Non Blocking

SoC System on Chip

SVC Switched Virtual Circuit SVC Scalable Video Codec TDM Time Division Multiplexing TDMA Time Division Multiple Access TLP Task Level Parallelism

VCT Virtual Cut Through

VLIW Very Long Instruction Word

XML eXtensible Markup Language

(27)

Introduction

The tremendous evolution of microelectronics in the past decade has led to the integration of complete complex systems on a single chip, a concept known as Systems-on-Chip (SoC).

An optimized on-chip communication architecture is required to inter- connect the different platform components. The objective of this thesis is to optimize this communication architecture in the specific context of low- power SoC.

SoC platforms that will be used in future generation hand-held devices will have to satisfy many critical requirements: they will have to be energy efficient, cheap, reliable and offer sufficient computing power for advanced multimedia and wireless applications. To satisfy all these requirements si- multaneously, future SoCs will have integrate various types of processor cores and data memory units, resulting in very heterogeneous platforms.

Main objectives of the thesis

In this thesis, we propose to investigate the following fundamental questions related to on-chip communication architecture design:

•

What are the main constraints on the communication architecture design in the future low-power Systems-on-Chip context? What are the current solutions exploited by the industry or proposed by research groups?

•

How can on-chip communication architecture characteristics be mea- sured? How can design alternatives be efficiently compared?

•

Are there interesting network design alternatives that still remain un- explored? How are they performing compared to traditional solutions?

Original contributions

Existing studies generally presents only a small part of the on-chip commu-

nication design space. We have put an important effort in identifying as

completely as possible this vast design space and the multitude of existing

(28)

It also offers a global view of the overall on-chip network design space and thus facilitates a global research strategy.

Our detailed state-of-the-art is, to the best of our knowledge, the most complete existing so far. It has contributed to enhance our research team’s knowledge in the domain of on-chip communication architectures.

Based on our design space definition and on our detailed state-of-the art, we have identified yet unexplored regions in the communication architecture design space which are as many future research topics.

We have developed generic on-chip communication architecture simula- tor environments both at high and low levels of abstraction.

Finally, we have designed and validated an innovating communication architecture based on spatial multiplexing as an alternative to current ar- chitectures and we have showed that our solution can be better adapted to the on-chip context for the specific constraints related to future low-power MPSoC.

Thesis organization

The thesis is organized as follows:

Chapter

1

introduces the context of the thesis.

Chapter

2

presents the specific requirements for a communication archi- tecture and the corresponding cost function.

Chapter

3

presents our analysis of the on-chip communication architecture design space

Chapter

4

presents the state of the art in on-chip communication archi- tecture design.

Chapter

5

then presents the design of two on-chip communication archi- tecture simulators at different abstraction levels.

Chapter

6

presents in details our solution based on SDM

Finally, our conclusions and future work are presented.

(29)

Chapter 1

Context and Motivations

Abstract

Designing low power embedded systems based on Systems-on-Chip is a very difficult activity. Many strin- gent constraints must be satisfied in terms of energy con- sumption, hard real-time constraints, manufacturing cost, etc. The increasing importance of Deep Sub-Micron ef- fects which are appearing for technology nodes lower than 90 nm is imposing even more constraints on the design.

This chapter introduces the embedded systems context and its specific constraints. It also exposes the problems ap- pearing with Deep-Sub-Micron technologies (static power consumption, reliability issues, run-time unpredictabil- ity). We identify global interconnect optimization as a very interesting challenge in the System-on-Chip context.

Solutions based on complex on-chip communication ar- chitectures have been proposed. We also motivate why existing solutions from the macro-networks optimization cannot be simply re-applied to the on-chip domain and why careful optimization of the communication architec- ture is required.

1.1 Introduction

The microelectronics industry has mainly two core business activities: Gen- eral Purpose computing and Embedded Systems.

This chapter introduces the context of this thesis and motivates our choice of the embedded systems context as its main target.

Section

1.2

distinguishes the embedded systems context from the high

performance general purpose processor design. Section

1.3

describes future

SoC platforms. Section

1.4

exposes the main challenges imposed by Deep

(30)

Sub Micron effects. Section

1.5

gives a strong motivation for the optimiza- tion of the communication architecture. Finally, section

1.6

concludes.

1.2 General Purpose vs. Embedded Systems

The evolution of microelectronics is allowing the number of transistors to double each generation. This phenomenon is described by the famous Moore’s law, an empirical law that drives the whole microelectronics in- dustry for years [Moore65].

Today, in 2006, the Itanium 2 Montecito dual core processor built on 90 nm technology already integrates 1.7 billion transistors for a massive die size of 580

mm²

for a power budget of 100W at 1.6 GHz [Corp.06].

By the end of the decade, the International Technology Roadmap for Semiconductors (ITRS) [ITRS01] [ITRS03] predicts chip containing tens of billions of transistors, offering a tremendous potential of computation power.

The microelectronics industry has mainly two core business activities:

one aims at exploiting the potential offered by billion-transistors chip to reach the highest possible performances from a general purpose architecture while the other is targeting application domain specific architectures that have to deal with various difficult design constraints.

Note also that the emerging ambient intelligence context can also be considered as another class of embedded platforms. They rely on ultra-low power architectures (< 1mW ) that are massively distributed in the human environment and scavenge energy from their surroundings. This context is however not the target of this thesis.

General Purpose Architectures

General purpose architectures typically concerns processor architectures which target the highest possible brute performances and flexibility, other factors being less critical. This domain mainly concerns general purpose microprocessors used in desktop personal computers.

The main priority is to reach the highest performances even if it implies a high cost in terms of power consumption or manufacturing costs. This industry is mainly driven by big companies like Intel

^{T M}

and AMD

^{T M}

.

Current high-performance microprocessors are based on fully syn- chronous designs, the clock frequency being increased each generation to get better performances [ITRS01]. However, since the distance reachable in one clock cycle is decreasing at each generation and is now attaining the dimension of a chip, designers are facing the physical limits of fully- synchronous designs. The solution is to move to new designs based on smaller synchronous resources interconnected asynchronously

¹

, a concept

1this term will be defined in chapter3

(31)

1.2 General Purpose vs. Embedded Systems 5

known as Globally Asynchronous Locally Synchronous (G.A.L.S.) designs.

This solution is universally accepted intellectually but not yet adopted in- dustrially.

The current trend in processor architecture consists of putting on a single chip several identical general purpose microprocessor cores like Pentium

^R

processors, exploiting parallelism to increase performances rather than keep- ing up increasing the clock frequency.

Dual core processors are now becoming popular on the high performance market. The processor industry has recently made a very sharp transition between trying to push the frequency of a mono core processor as high as possible to the exploitation of parallelism offered by multiple cores proces- sors. The duel between AMD and Intel on multi-core processors has just begun. The battle on processor clock frequency has moved and now, the goal is to provide as many thread-level parallelism as possible by integrating more processor cores inside the chip with possibly some specialized cores.

One of the real challenge is in fact to provide compilers and technologies that allow software programmers to exploit the offered thread-level paral- lelism. Both Intel and AMD are actively looking for applications that could exploit the extra parallelism provided by their architectures.

Figure 1.1: The Intel Core Duo processor is composed of two processor cores clocked 3GHz @ 1.25V, the die size is 435mm²@ 65nm, 8 metal layers ( cIntel)

The competition is now moving to the number of processor cores that

will be integrated on-chip. Both Intel and AMD are about to launch the

(32)

production of quad-core processors by the beginning of 2007.

David Perlmutter - Intel Senior VP [Gru06]:

“Core scales and it will be scaling to the level we expect it to.

That also applies to the upcoming generations - they all will come with the right scaling factors. But, of course, I would be lying if I said that it scales from here to eternity. In general, I believe that we will be able to do very well against what AMD will be able to do. I want everybody to go from a frequency world to a number-of-cores-world. But especially in the client space, we have to be very careful with overloading the market with a number of cores and see what is useful. I believe ’2’ is a good number. ’4’ will be an interesting number for the high-end.

Will we see eight cores in the client in the next two years? If someone chooses to do that, engineering-wise that is possible.

But I doubt this is something the market needs.”

Embedded Systems

The specificity of embedded systems is that they have to satisfy multiple opposing constraints making the trade-off very difficult to reach:

•

low power and/or energy consumption

•

high computing/communication performance requirement

•

stringent real-time constraints

•

short time to market

•

low cost

•

high volume

•

high reliability

Battery driven multi-media consumer devices such as advanced handheld devices (PDA, mobile phones,...) are requiring all those constraints to be satisfied. They are sold at extremely low prices, require high computing performance for a low energy budget of only hundreds of milliwatts and must meet strict deadlines.

Today, embedded systems are generally implemented by Systems-on- Chip. Systems-on-Chip integrate a whole complex system into a single chip:

digital, analog, mixed-signal and radio-frequency functions can be mixed in

the chip. SoC are built using a platform-based design approach, developing

simultaneously hardware and software at different levels of abstraction.

(33)

1.2 General Purpose vs. Embedded Systems 7

Systems-on-Chip are application domain specific or application specific:

they are optimized at design-time for a particular set of applications or for one specific application.

SoC’s are currently found into many consumer devices and industrial systems. It can be:

•

Very application specific processors like network processors used in high-performance routers

•

Battery powered hand-held devices merging telephony and multimedia capabilities

•

Digital televisions and set-top boxes based on sophisticated multipro- cessors to perform real-time video and audio decoding.

•

Video games using several complex parallel processing machines to render gaming action in real time.

In this context, the future billions of transistors per chip would make possible the integration of a multitude of components such as complex pro- cessor cores, large embedded memories and coarse-grained reconfigurable logic.

Most current SoCs are Multi-Processors SoCs (MPSoCs). They contain multiple instruction-set processors (CPUs) because complex systems-on-chip are difficult to design without making use of multiple CPUs.

Philips’ Nexperia platform [Philips06] presented on figure

1.2

is a good example of an early MPSoC platorm. This MPSoC contains two cores:

one general purpose RISC processor (MIPS) and a VLIW media processor (TriMedia). The platform also contains a variety of IP blocks depending on the targeted application (DSPs, UART, ...).

Other examples of comparable MPSoC platforms are provided by Texas Instrument’s OMAP microprocessor family [Instruments06] and the ST Mi- croelectronics Nomadik multimedia platform [Microelectronics] which are both based on an ARM core supported by one or several DSPs.

An example of current state-of-the-art in SoC is provided by the 3MF MPSoC platform developed at IMEC [IMEC06]. The 3MF platform will support high-end audio and video compression standards MPEG-4 AVC (Advanced Video Codec) and the emerging SVC (Scalable Video Codec) standard which enables the possibility to optimize the user’s visual percep- tion by dynamically adapting the video stream to the current communication and computing conditions of the platform. It will also support 3D-graphics standards. The challenge is thus to achieve a cost-efficient/low power im- plementation of state-of-the art SVC techniques on a flexible platform.

The 3MF platform is based on an MPSoC architecture containing mul-

tiple ADRES [Mei05,

Mei03] processors (see figure 1.3). The architecture

(34)

Figure 1.2: The MPSoC Nexperia platform contains two processor cores and a multitude of IP blocks ( cPhilips)

should dissipate less than 700mW in maximum performance conditions (for a 90nm technology at 1.0V).

Current trends

The situation has evolved since the time when general purpose platforms were exclusively devoted to personal computers. Nowadays, general pur- pose processors are integrated in lap-top personal computers and even in small consumer devices like Personal Digital Assistants (PDA). The frontier between high performance computing processors and low power Systems- on-Chip has thus recently become blurrier.

With the explosion of portable computers, high performance general computing is now very concerned about energy savings. Power consumption specifications have been typically defined at around 25 Watts for notebooks, around 65 Watts for the desktop computers and 100 Watts and higher for high-end desktop systems [STAR06].

The high performance computing processors is also constrained in terms of power consumption due to heat dissipation problems.

Major actors in the general purpose computation domain are also plan- ning to integrate specialized cores in their future multiprocessor general purpose chips

On the other hand, embedded devices are becoming more and more

(35)

1.3 The low power MPSoC Context 9

Figure 1.3: The 3MF MPSoC platform developed by IMEC to support the MPEG 4 compression standard is based on a set of ADRES cores interconnected by a Network-on-Chip (NoC) ( cIMEC)

performance-hungry.

The Cell processor platform [Kahle05] designed by Sony, Toshiba and IBM is an example of high performance MPSoC. The platform will be used for the Playstation 3 game console, high definition television sets and com- puter servers. The configuration used for the Playstation 3 is based on one general purpose processor (IBM Power PC) and 8 graphical co-processors (see figure

1.4).

1.3 The low power MPSoC Context

This thesis targets in particular the MPSoC’s that will be used in future hand-held devices (PDA’s, mobile phones). These devices will be used for various demanding applications such as advanced multimedia, wireless communication, cryptography and voice recognition. As we target battery- powered devices and we have large computational and data access require- ments, energy consumption will be by far the most constraining resource.

In particular, embedded multimedia applications are very demanding

in terms of both performance and flexibility. Future SoCs will therefore

probably be very heterogeneous containing a mix of specialized and recon-

figurable processing cores to ensure low energy consumption per task while

(36)

Figure 1.4: The Cell processor platform (clocked at 3.2 GHz, die size 221mm², 8 metal layers) ( cIBM)

maintaining flexibility [Zhong05].

Platform-based design

Full custom-logic design is not possible anymore. As complexity of today’s SoC rises rapidly, full-custom design would require extremely long design cycles. The gap between the design productivity and the potential offered by future silicon technologies is becoming bigger and bigger each generation.

Time-to-market is shrinking. The faster a product is put on the market, the better chances it will have to hit the market.

Platform-based design is a solution to the complexity and time-to-market problems. A platform is a framework composed of hardware, middleware and software components. The middleware plays the role of an interface between the architecture and the software components. It typically con- sists of a real-time operating system and a set of hardware drivers. The platform can be reused and customized for different customer’s need, thus considerably accelerating the design process.

The SoC hardware implementations can be based on architectures com- posed of a set of two-dimensional independent tiles (see figure

1.5). Those

tiles can have various sizes and aspect ratios.

A tile is defined as an independent subsystem of the System on Chip ar- chitecture that can accomplish a high-level function. It can combine storage, computation and communication interfaces with the system environment. It can be a subset of processor cores (including cache memory) or rather big on-chip memories.

An Intellectual Property core (IP core) is a tile designed by another party

(37)

1.3 The low power MPSoC Context 11

DSP

RAM I/O ASIC

intf

I/O intf

GPP

CGA RAM

RAM

ASIC I/O intf

CGA

RAM I/O

intf

GPP DSP

Figure 1.5: A tile-based design of a System-on-Chip. The architecture is composed of a set of independent tiles (cores): a Digital Signal Processor (DSP), a General Purpose Processor (GPP), Coarse Grain Architecture CGA), an Applica- tion Specific Integrated Circuit (ASIC), RAM blocks and input/output interfaces

that can be integrated into a platform. Major IP vendors like ARM license their IP cores to platform integrators.

A task (or process) is defined as a component of the application under a single thread of control that can be scheduled individually on a processor core.

A thread is defined as a component of the application which generally shares its address space with other threads.

Number of tiles

Predicting the number of tiles that will be present in future tile-based SoC platform is often a matter of confusion.

Many papers in the NoC literature are considering platforms based on resources of 50K to 100K gates complexity [Wielage02] [Kumar02]

[Guerrier00]. As announced by Wielage et al., the number of tiles is go- ing to explode in that case (see figure

1.6) and interconnecting them can

become an important problem.

This prediction is generally based on the work of Sylvester et al. They

proposed a methodology based on 50K gates modules to tackle global inter-

connect problems [Sylvester99]. This core size results from a certain inter-

connect power-performance trade-off [Sylvester98]. It is also motivated by

the fact that such cores can be designed easily with the traditional physical

(38)

Figure 1.6: Explosion of the number of cores for a 50K gates complexity ( cPhilips)

design tools and they scale efficiently for future technology nodes.

However, this result is not directly applicable to the domain of low- power domain-specific SoC. This work is indeed assuming high-performance microprocessors operating at about a 10GHz frequency. It is also considering a standard cell design and only takes into account processing, not on-chip storage which represents most of the tile area as we will see.

For domain specific low-power platform, we will rather consider an ar- chitecture based on programmable and reconfigurable cores clocked at lower frequencies (500MHz ... 1GHz).

Increasing the clock frequency decreases the maximum size of the iso- synchronous regions (i.e. the maximum distance between two components that can be synchronized with the same clock). A common rule of thumb for sizing iso-synchronous region is that wire delay should not exceed 20%

of the clock period to prevent clock skews problems [Khatri01].

In this thesis, we will assume that core sizes can vary between 100K gates to up to 10M gates per tile, most of the area being dedicated to the memory hierarchy (mainly L1 and L2 caches) [Catthoor04]. This would result in a platform composed of 10 to 20 tiles for 90nm technology and that number could possibly grow for more advanced technologies.

We assume that the tiles are based on multiple datapath processor cores

with dozens of vectors per path. Clocked at 1GHz, a tile composed of 10

datapaths of 24 FUs would thus offer 250 Gops (10x24x1GHz). A 10-tile

platform would thus provide 2.5 Tops which should be largely sufficient for

personal future applications, assuming that those applications are able to

exploit this large degree of computing parallelism. Performances in the order

of tera-operations per second are indeed usually only required by very high

performance scientific computing applications which are not relevant in the

context of embedded devices.

(39)

1.4 Challenges of Deep Sub-Micron Technologies 13

An efficient and scalable communication architecture is thus needed to interconnect them. This thesis will describe the design of a general simulator environment that allows performing a fair comparison between the different communication architectures in terms of energy consumption for a given set of applications and performance.

1.4 Challenges of Deep Sub-Micron Technologies

The integration of more and more transistors is made possible by constantly reducing their size. A common unit for measuring transistor size relative to a given technology is the minimal distance separating two wires. This unit is typically known as minimum feature size.

As minimum feature size became smaller than 90 nm, microelectronics has entered a new era of design challenges called the Deep Sub-Micron era.

This era is mainly characterized by the surprising fact that for the first time, communication will become more critical than computation. Interconnect will become the dominating factor determining speed, noise and power.

Compared to transistors, interconnect has evolved very slowly since CMOS technology has been introduced. In the past thirty years, the wire delay has been reduced by a factor 60 while during the same time for tran- sistors, a factor 1000 has been reached.

But so far, interconnect has never been a critical issue as transistors were dominating the delay [Horowitz99].

This time is over. In the coming years, interconnect will have to dra- matically improve if microelectronics industry wants to keep up following Moore’s law [Meindl03] (see table

1.1).

Technology node 1µm 100nm 35nm

MOSFET switching delay (ps) 20 5 2.5

Interconnect RC response time 1mm (ps) 1 30 250 Interconnect/transistor delay ratio 0.05 6 100 MOSFET switching energy (fJ) 300 2 0.1 Interconnect switching energy (fJ) 400 10 3

Vdd

(V) 5 1.0 0.5

I (A) 2.5 150 360

Wiring level 3 8-9 10

Table 1.1: Predicted evolution of the interconnect vs. gate parameters (ITRS):

gate delay and energy consumption is shrinking while interconnect delay and energy consumption is growing. The power supply voltage is shrinking while the current is growing. The number of wiring levels is also growing as technology evolves

(40)

The gate delay and switching energy is shrinking with the evolution of technology. The interconnect delay and switching energy is growing on the other hand.The current intensity is reaching very high current densities (up to 360 A) and very low power supply voltage (0.5V) for the future 35nm technology. The number of wiring levels is continuing to rise with technology nodes.

Nowadays, interconnect is indeed based on a complex hierarchical struc- ture (see figure

1.7). Up to 9 wire levels are used in current designs and

complexity will grow even more for future generations [ITRS01].

Figure 1.7: Wiring hierarchy (CMOS 7S) composed of 6 layers ( cIBM): global (layers 5 and 6), semi-global (layers 3 and 4) and local (layers 1 and 2)

The interconnect hierarchy allows designers to use the most appropriate wires at each layer. The layering is based on wire length and functionality.

The bottom layer is dedicated to local wires, used for intra-tile intercon- nect within large blocks of gates. Their lengths are typically smaller than 100

µm. They represent about 90% of the total number of wires (see figure 1.8).

Intermediate wires are used to interconnect contiguous large blocks of gates with each other within one tile. Their length can reach 500

µm. They

represent about 10% of the total number of wires.

The top layer is dedicated to global wires which are used to interconnect blocks of the size of a whole tile. They represent less than 1% of the total number of wires.

The global wires have a higher aspect ratio than local wires (height larger

than width). Global wires are thus often called “fat wires” by opposition to

the local “flat wires” which have a low aspect ratio.

(41)

1.4 Challenges of Deep Sub-Micron Technologies 15

Figure 1.8: Distribution of the interconnect net length [Kang99]

As we will see in this section, Deep Sub-Micron technologies come up with many challenges mainly affecting the global wire interconnect:

•

increasing complexity of the wiring hierarchy

•

increasing interconnect delay

•

increasing energy consumption

•

decreasing interconnect reliability

•

increasing complexity of the testing

1.4.1 Complexity of the interconnect hierarchy

The number of metallization levels will certainly keep up growing in future DSM designs as interconnect is becoming more and more heterogeneous. It will be a challenge for silicon designers to deal with the complexity of such interconnect hierarchy [Theis00].

1.4.2 Interconnect and communication architectures

Two different hierarchies related to communication co-exist in the platform.

On the physical plane, the interconnect hierarchy consists of different layers:

•

local interconnect

•

intermediate interconnect

•

global interconnect

On the logical plane, the on-chip communication architecture hierarchy

also consists of different layers:

(42)

•

local on-chip communication-architecture targeting intra-tile commu- nications

•

global on-chip communication architecture targeting inter-tile commu- nications

The interconnect hierarchy is the hardware support exploited to imple- ment the communication architecture hierarchy.

Global on-chip communication architectures are usually implemented with the global and intermediate interconnect layer.

1.4.3 Interconnect delay

As technology scales down, local and intermediate wires become shorter in average. Together with the introduction of new materials, it should lead to a dramatic improvement in local and intermediate wire delay.

Figure 1.9: Evolution of interconnect hierarchy: the number of interconnect layers is growing rapidly ( cThe Electrochemical Society)

However, many other phenomena are deteriorating this wire RC delay improvement: some affect the wire resistance (skin effect, effective resistivity, inelastic scattering at the boundaries, process variations...) while others affect the wire capacitance (cross-talk, fringing capacitance,...).

The major effect comes from the reduction of wire pitches in order to

achieve higher wiring densities. This affects the RC delay because the wire

cross-section decreases and thus, its resistance increases. To tackle this

problem, designers are scaling wire height slower than width, resulting in

taller and thinner wires characterized by a higher aspect ratio. The cross-

section being larger, it improves the wire resistance. The impact on the RC

(43)

1.4 Challenges of Deep Sub-Micron Technologies 17

delay results of a trade-off between the improved wire resistance and a larger inter-wire capacitance.

Transistor gates also become smaller, leading to lower transistor energy consumption and delay. The ratio between gate delay and local wire delay remains about the same. Local and intermediate interconnect will therefore not be a critical issue for future technologies.

While local wire lengths scale with the technology, global do not. The length of the longest global wires remains about the same as technology scales. It could even increase as silicon dies could become larger. So, the relative contribution of the global interconnect to the power consumption and delay increases considerably (see figure

1.10).

The situation for intermediate wires is in between those two extremes.

A solution consists of increasing the height and width of the global wires and the thickness of the surrounding insulators by a factor

λ, resulting in

the fat wires describes in previous section. This allows to maintain the capacitance close to its original value while decreasing the wire resistance by a factor

_λ¹2

and thus, also the RC delay.

The heterogeneity in wires aspect ratio and size is likely to grow in future design generations.

Figure 1.10: Comparison between gate delay and local/global interconnect delay:

local wires and gate delay are scaling down while the relative contribution of the global wire delay is increasing with technology [ITRS01]

(44)

1.4.4 Energy consumption

Energy consumption is a critical issue for hand-held devices as battery life is very limited. Moreover, heat removal becomes a critical problem in DSM technology as heat dissipation systems are already reaching their limits to- day.

The power consumption of a CMOS circuit has two components: dy- namic and static.

Ptot

=

Pdyn

+

Pstat

(1.1)

Dynamic power consumption

Dynamic power consumption occurs when a gate is switching from one state to another. The switching energy corresponds to the amount of energy required to charge the equivalent capacitance of the wiring and the gates connected to its output. Also, a reduced amount of energy is spent when the complementary transistors are simultaneously conducting during the switching transient. This energy is referred to as short-circuit energy and can be neglected compared to the switching energy.

P_dyn

=

Pshort circuit

+

P_switching

(1.2) Static power consumption

Static power consumption is due to the existence of small leakage currents

Iof f

that flow through transistors in cut-off mode. [Butts00].

P_stat

=

V_DDI_{of f}

(1.3)

Precisely, CMOS technology has been initially chosen mainly because its static power consumption was very low compared to other technolo- gies. However, as the industry is moving to Deep Sub Micron technologies, leakage power consumption of CMOS circuits cannot be neglected anymore [Keshavarzi97].

Leakage current appears by complex mechanisms. It can be basically viewed as a sum of several current contributions [K. Roy03], [Keshavarzi97]

(see figure

1.11). Not all leakage currents correspond to pure static energy

consumption: currents

I1

and

I3

occurs both in the on and off transistor states.

Reverse biased pn junction current Junction leakage current (I

₁

) ap-

pears in the reversed biased pn junction formed by the drain and the

diffused p-well. It is a function of the junction area and the doping

concentration. Its contribution to off-state leakage current is minor.

(45)

1.4 Challenges of Deep Sub-Micron Technologies 19

Figure 1.11: Leakage current contributions: junction leakage current (I₁), sub- threshold leakage current (I₂), gate-oxide tunneling current (I₃), hot-carrier injec- tion current (I4), gate induced drain leakage current (I5), punchthrough leakage current (I6) [K. Roy03]

Sub threshold leakage current Sub-threshold leakage correspond to current flowing from the drain to the source of a transistor (I

₂

) when it is in the cut-off region i.e. when gate voltage is below the threshold voltage

V_th

. It is due to weak channel inversion. The sub-threshold leakage current is an inverse exponential function of

V_th

.

I2

=

µ0CoxW

L

(m

−

1)(

kT q

)

²e

(VG−Vth)

m kTq

(1

−e⁻^{vDS q}^kT

) (1.4) In Deep Sub-Micron technologies, the combination of lower subthresh- old voltage, reduced transistor gate length and considerably increased number of transistors makes the situation very bad. Leakage current of one transistors is indeed negligible in absolute but as transistors density is reaching billions of transistors per chip, sub-threshold leak- age becomes a major problem. Note also the temperature dependence of the leakage current.

Optimizing the on-chip communication architecture of low power Systems-on-Chip in Deep Sub-Micron technology

Ann´ ee Acad´ emique 2006-2007 Facult´ e des Sciences Appliqu´ ees