The Platform - A SystemC Platform for NoC Analysis

A SystemC Platform for NoC Analysis

7.2 The Platform

NOCEXplore (http://sourceforge.net/projects/nocexplorer/) [24] is a SystemC library for NoC performance comparisons and investigations. It has two main aims: to provide a platform for comparison of NoCs chosen in a high and up-gradeable design space and to provide to designer a high level of detail of events that happen in the network during simulation. Nevertheless, the CPU time required for simulation and data post-processing are acceptable.

In the NOCEXplore platform models of the networks are configurable by a set of 19 parameters that define the possible configurations of the networks. The traffic description involves three additional parameters. Globally, the configurations space has the 22 dimensions listed below, so the exploration of the solutions is done over a 22 dimensional space. Each dimension could have a physical value, a numeric value or a identification.

1. The networkquality of service is an identifier and describes global network services and main router architecture.

2. The network size is a numeric value and indicates how many modules are connected to the network.

3. Topologyis a number that identifies how routers are connected by links.

4–7. Link type,link width,link delayandnumber of physical link per topological arcare four parameters that describe the link; the first one identifies link protocol and communication scheme, the second one is the dimension of the flit circulating in the network, the third one represents inter-router flit latency and the last one allows to use more than one link between a couple of routers for a topological arc.

8–9. flit_ per_ packetandpacket_ per_messagetake into account how many flits correspond to each packet and, in bursty communications, how many

7 NOCEXplore 93

packets are in a message. The flits of the same packet move in the same path; different packets, even if belonging to the same message, can be routed in different ways. The distinction between packet and message is necessary when a huge amount of data must be transferred (bigger than the maximum value that the network protocol can provide) from an IP to another in an efficient way.

10. Each router, seen as a synchronous machine, has a local clock generator of a certainfrequency; each generator has its own starting delay independent from the others.

11. Routing algorithmimplemented in each router is modeled by three functions:

• the ‘‘routing function’’ is responsible of avoiding deadlock and livelock and it can use topological information to get the distance between current node and destination node crossing the particular output port.

• the ‘‘selection function’’ selects one of the admitted output ports carried out by previous function: the choice can be taken considering the actual status of the router congestion, the actual congestion of the neighbour routers and/or the overall or partial network status based on the elaboration of control messages produced and consumed by routers. The selection is responsible of the degree of adaptability of the algorithm.

• the ‘‘header function’’ modifies, if necessary, some field of the header flit of the packet.

12. Arbitering schemesolves contentions where resources are shared in the router.

The points of contention are two: different input buffers request the same output buffer; and different output buffers request the same output port or output channel. The arbitering scheme can be based on some user defined fields in the header flit (transaction identifier, priority related to arrival time or packet living time).

13. Theswitch structurerefers to the architecture of the component that performs the flit crossover from input stage of the router to the output stage; it affects the number of flit crossovers that can be performed in a router in a single clock cycle.

14. TheDPM policyis a set of rules determining the router power state for getting the best trade-off between communication performances and energy con-sumption. Workload conditions are estimated by measuring buffer occupancy and flit rate. The techniques considered are DVS (Dynamic Voltage Scaling) and DFS (Dynamic Frequency Scaling). The designer must provide the power models, the power states and status changing rules.

15. Theflow control coordinates router internal resources.

16–17. Input port buffer lengthandinput port buffer numberdescribe the number and the depth of the buffers in input stage: several buffers can be placed in parallel and flit insertion in buffer can take place via insertion or shifting.

18–19. Output port buffer length and output port buffer number describe the number and the depth of the buffers in output stage.

94 S. Gigli and M. Conti

The traffic is described by three parameters:

20. Thetraffic intensityis the mean value of flit injection operated by the IPs; this value is normalized to the maximum value of one flit per clock cycle, so the traffic offered in the entire network is given by the product of the following three terms: traffic intensity, number of IPs connected to the network and link width.

21. Thetraffic scenario describes the spatial distribution of message flows: the tool offers a small set of traffic primitives where designers can define any mixture message flows given by the CTG (Communication Task Graph) of the applications under exam.

22. Theburstynessis a percentage value of bursty traffic over total traffic emitted by each source node.

The set of the 22 parameter values is defined as network configuration, the nodes attached to the network are traffic generators and they are source and sink at the same time.

The platform has been created for being easily upgradeable: adding a new numeric or physical value, for example a new buffer length, simply needs to insert the new value in the list of this parameter; adding a new behavior, for example a new topology, designers must create a new topology class derived by the topology base class and they must overload one or more virtual methods that describe the topology.

7.3 Investigations

The information extracted by the platform from the simulations of the network stimulated under a defined traffic scenario can be saved in files and/or can be submitted to the postprocessing phase. Four types of analysis can be carried out:

global statistical analysis, global and detailed probabilistic analysis, dynamic analysis, power analysis.

7.3.1 Statistical Analysis

The statistical analysis allows a comparison between the performances of different network configurations under identical or different scenario. We can distinguish three types of global performances: the communication performances are based on message delays: minimum, maximum, standard deviation and mean delay of all steady state messages. Moreover throughput is also evaluated. A second nature of overall performances regards collections of events like routing calls, number of routed flits, commutations on links (seen as a parallel bus of wire of length equal to flit length) and flit shift when the shift register structure is adopted in buffers; these events are collected during the simulation of the network.

7 NOCEXplore 95

The third nature of global performances is related to power consumption: based on user defined power models of router and link, an overall estimation of energy consumption is provided taking into account all routers and all links together.

In this way, for example, the designer can verify the goodness of dynamic power management policy on routers and links.

As an example some simulation results are reported in the following. Figure7.1 compares message mean latency of the same network under different routing algorithms. The network consists of 16 nodes in a 494 mesh topology. The traffic scenario contains four hotspots, where the flows overlap themselves on three links; the other nodes inject uniform traffic of intensity five times lower than the others involved in the hotspots. Two routing algorithms are used: the first one is deterministic and it performs the dimensional ordering routing; the second one is partially adaptive and it performs the west-first algorithm. The graph indicates the mean delay of all the packets received from and sent to all nodes in steady state condition versus the traffic intensity. The overlapping flows yield the saturation threshold to half link bandwidth and network saturates at the intensity of 38% of

Fig. 7.1 Example of statistical analysis of the communication performances of a 494 mesh network subject to a overlapped hotspot and uniform traffic scenario under two different routing algorithms: the adaptive one perform an increment of the saturation threshold of 50% about, bringing it from 34 to 50% of traffic intensity

96 S. Gigli and M. Conti

the maximum in the case of deterministic routing; the adaptive algorithm saturates at about 50% of the maximum intensity and with a slower slope.

7.3.2 Probabilistic Analysis

The second type of analysis is detailed and local. A probabilistic analysis of message delays highlights the traffic flows that do not match task constraints.

A distribution of message delays is performed taking into account:

• all messages circulating in the network;

• all messages emitted by a specified source node;

• all messages consumed by a specified sink node;

• all messages emitted and consumed by a specified couple of source/sink nodes.

Moreover, the four main statistical indexes (mean value, standard deviation, skewness and kurtosis) of each distribution are calculated. This kind of analysis evidences message flows that satisfy specific communication performance con-straints at a specific traffic intensity.

The example of Fig.7.2shows the probability density function of the delay of the network with adaptive routing at the begin of its saturation, at traffic intensity of 54%. Despite the more important peak occurs at low delays, the not negligible density peaks at 40, 80, 200, 250, 320 and 500 clock cycles yield the mean value of latency to 125 clock cycles. However, the majority of the packets are delivered in reasonable delays.

7.3.3 Dynamic Analysis

The third type of analysis investigates the temporal evolution of some ‘‘events’’

that take place during the simulation. If the previous analysis indicates which flow does not match constraints at the terminal part of the network, designer could discover which part of the network are involved in the implied flows. These

‘‘events’’ are:

• utilization level of buffers of each router;

• moving average of switch transversal flit;

• moving average of link writings;

• moving average of routing activities.

Moreover, a statistical analysis of these occurrences is provided.

This kind of investigation allows the designer to discover which resources could be oversized or undersized and which traffic conditions influence some temporal congestion and for how long the congestion status is maintained. Each elaboration performs overall, steady state and transient plots of utilization of the buffers.

7 NOCEXplore 97

7.3.4 Power Analysis

The power analysis of Network-on-Chip involves energy estimation and dynamic power management and it can be performed in two possible ways depending on the detail required and energetic models accuracy. The coarse analysis considers routers having a power dynamic management module and a power consumption is associated to each power state of the router.

The total energy dissipated by the network is obtained by summing up each energy consumed in each state by each router plus the energy necessary for state changes.

In the parameter 14 described in previous section, theDPM policy, the designer has to define both power models and the power state machine implementing the DPM policy.

Figure7.3 shows a graph where the power state of each router is reported during time: time and router identification number is reported in the x axis and y axis, respectively. The colour indicates the state and colour-bar on the right side indicates the legend with colours and numbers corresponding to the power states.

This analysis highlights repercussions of some router energy and performance states on the neighbour routers and evaluates the goodness of DPM policy Fig. 7.2 Example of probabilistic analysis of the communication performances of a 494 mesh network subject to a overlapped hotspot and uniform traffic scenario under west-first partially adaptive routing algorithm

98 S. Gigli and M. Conti

performed on the triple topology/routing/scenario; policy goodness is measured in effective power saving, in sensitiveness to local and temporal congestion and stability of power state.

The second way of evaluation of Network-on-Chip power consumption, as mentioned above, can be done by collecting ‘‘activities’’ related to energy dissi-pation. These activities are: commutations inside the link based on data value, incoming to and outgoing from router of a flit, routing function calls and flit crossings in the switch. These measures, mixed with technological constants and power models provided by designer, give information about power consumption.

Accuracy of communication and power performances depends on the accuracy of models. Simple models let the designer to get performance trend of architectural choice and more detailed model will improve performance accuracy. At the moment router model needs three clock cycles for a flit crossing without con-tention and routing decision is done in one clock cycle; inter-router link latency can vary from one to ten nanoseconds depending of the circuitry complexity modeled.

Some words about simulation time. A performance profiling of the platform is planned and the developers are aware that some performance improvement can be done. Simulation time strongly depends on network size and traffic intensity;

SystemC simulation time strongly depends by the number of simulation kernel context switchings.

Fig. 7.3 Example of power analysis where energetic states of the router are plotted over the time. Darker colours means more performing and power consuming state while lighter colours less performing and more power saving states. Router power state machine has nine power states and it follows ACPI standard (http://www.acpi.info)

7 NOCEXplore 99

The increment of network size increases the number of modules instantiated and consequently the number of context switchings increases causing performance penalties. Moreover, we have registered a not negligible simulation time increment when traffic intensity increases and when networks start to be congested; this behaviour is due to the amount of data allocated in memory.

Actually, about 2 min is needed on a commercial notebook (64 bit—1.5 GHz CPU with 4 GB of RAM) for simulating and postprocessing a 16 nodes and 16 routers NoC in worst case condition. We consider this computation performance quite good because simulations are cycle accurate and user can access to lots of event details for more detailed investigations.

Dans le document Lecture Notes in Electrical Engineering (Page 97-104)