Verification Environment for the Spidergon STNoC Router IP

Case Study of a Router for Network-on-Chip Communication in Embedded Systems

8.4 Verification Environment for the Spidergon STNoC Router IP

The functional verification of Spidergon STNoC building blocks has been carried out by building a coverage driven simulation environment. In this case study, the DUT is a Spidergon STNoC Router IP core but the same approach has been followed for NIs and links. In this Section, first the functional features of the designed Router and its implementation results in submicron CMOS technology are described and then the design of the verification environment is presented.

The Router architecture has been defined according to a parametric and mod-ular approach using VHDL language. The Spidergon router, see Fig.8.5, can be connected through two unidirectional links with three other routers into directions Right (R), Left (L) and Across (A), plus the fourth connection to the local Network Interface, used as the network entry/exit point. The physical link consists of two unidirectional data channels, Downstream (DS) and Upstream (US), with the relevant handshake signals to realize a credit-based hop-by-hop flow control (valandcreditin Fig.8.5). The router adopts wormhole packet-switching, where a packet is subdivided into flits and all of them follow the same path reserved for the header. The routing algorithm is deterministic, so that always the same path is chosen between a source and a destination node, even if multiple paths exist. This choice avoids costly flit reordering at packet reception. The idea is to move along the ring, in the proper direction, to reach nodes which are near the source node,

8 Coverage-Driven Verification of HDL IP Cores 111

using the Across link as first or last hop to jump to a part of the network that is too far away. The router uses a simple source-based routing: the entire path is encoded in the packet header, so each router has just to extract the forward information, without any need of computation or any look-up table. The routing scheme, along with a proper QoS scheduling policy, is free of starvation issues. The router avoids deadlock also by deploying Virtual Networks (VNs in Figs.8.5 and 8.6, also Fig. 8.5 S-STNoC router ports breakdown and Downstream (DS) and Upstream (US) channels with the relevant handshake signals

Fig. 8.6 Environment adopted for the Router functional verification

112 S. Saponara et al.

called Virtual Channels, VCs). VNs provide logical links over the same shared physical channels, by establishing a number of independently allocated flit buffers in the corresponding transmitter/receiver nodes. Currently the two request and response logical paths are implemented on top of two disjoint VNs for sharing the physical link bandwidth and maximizing wire efficiency. The parametric number of VNs supported by the router can lead to advanced routing schemes or inde-pendent QoS traffic classes for real time and low latency flows. The credit-based flow control works on a per flit basis. Flits can be sent in the US direction only if there are enough credits, i.e. the DS interface of the receiving component has enough free locations in its input buffer to store incoming flits. Output Queues on US ports can be instantiated for enhanced performance, avoiding head-of-line blocking. Queues are shared among input flows to limit costly time/space speed up factors and they have the bypass feature to reduce the minimum router crossing latency in case of low traffic conditions. The architecture also supports the possibility of not instantiating the Output Queue for low cost implementations, when performance or traffic types do not require output buffering. It is optionally possible to instantiate a separate Output Queue for each input port directed to that output. This configuration increases global network performance when a lot of traffic is concentrated towards the considered output. The applied QoS mechanism is the Fair Bandwidth Allocation (FBA). It allows for a flexible, scalable and low cost management of the allocation of the available bandwidth. The requested bandwidth value is programmed at injection point (Network Interface) and is not explicitly linked to the path of a data flow through the router like in other NoC architectures. It avoids complexity inside the router by providing all necessary information in the network header and limiting the router behavior to a simple two-step arbitration. When all data flows have the same bandwidth reservation, the arbitration algorithm becomes one of the following: Round Robin (RR), Least Recently Used (LRU) or fixed priority schemes, configurable by the user.

The router has been implemented for different configurations in different (90 nm, 65 nm and 45 nm) STMicroelectronics CMOS standard-cells technolo-gies always achieving optimal trade-off between performance and complexity.

As example in 65 nm 1.1 V standard-cells CMOS technology a full Router con-figuration with all 4 ports enabled (Spidergon topology) and all with 2 VNs, a size of 72 bits on VN1 (request path) and 64 bits on the VN2 (response path), using input buffers (IB) and output queues (OQ) able to store respectively 4 and 5 flits, with LRU arbitration and FBA management, has a circuit complexity of roughly 70 Kgates (including the flip-flop implementation of IB and OQ memory resour-ces). It achieves a clock frequency of 500 MHz, i.e. at least 32 Gbps data transfer for US and DS channels, with a low-leakage library version ensuring a static power consumption less than 100lW. By using a standard-cells library version opti-mized for high-speed, with the same IP configuration and CMOS technology node, clock frequencies up to 1 GHz are met, i.e. up to 64 Gbps data transfer per channel, but with an increased static power of 1 mW. Obviously, by changing the Router configuration different results are achieved: as example a basic Router with 3 ports (Ring topology without the Across link), 36-bit size for the flits, 1 VN,

8 Coverage-Driven Verification of HDL IP Cores 113

no OQs instantiated, has a circuit complexity lower than 9 Kgates and achieves a clock frequency up to 1 GHz, i.e. at least 36 Gbps data rate, with a static power consumption of roughly 200lW (the static power is 10 lW if targeting 500 MHz frequency).

Proper component operation should be assessed for every Router configuration.

However, only a subset of all possible configurations has been actually exploited for assembling platforms to synthesize. For such Router configurations a full regression set of test simulations has been carried out to check correct component operation when stressed with several different traffic scenarios. The DUT of the simulation was the single Router block. Figure8.6shows a sample Router DUT surrounded by the corresponding verification environment; in this example, a 4–port Router with 2 VNs and both US/DS directions on each port is considered.

Each Router port has its own relevant BFM units which drive and monitor traffic (master agents are connected to DS ports and slave agents are connected to US ports). All BFM units are connected with both the monitor unit and the checker unit. The former is in charge of protocol checking and data coverage, the latter implements a scoreboard for checking correct routing and other traffic properties.

At the interface level, the following categories of checks were implemented:

• Routing: (i) each transmitted packet exits one and only one time; (ii) each transmitted packet is output from the correct port (according to header infor-mation); (iii) flits within a packet are kept in the correct order and are not interleaved with flits from other packets.

• Credit-based protocol: (i) when a flit is read from the Input Buffer, a credit is sent back by the router; the valid signal is high on an output port when a significant flit is transmitted.

• Network Layer Header: FBA bit management (the FBA bits of a packet are correctly updated when it exits the router).

The different BFMs may be configured to generate different kinds of traffic scenarios so to reach the desired functional coverage. For example, configuring properly a test file, it is possible to generate packets that are sourced from 3 ports and all having the same destination port; this is useful for stressing arbiters and output queues as well as for achieving some corner cases coverage points. The developed environment discovered a number of bugs that it was not possible to find with hand-written HDL testbenches.

To achieve full code and functional coverage by exercising some corner cases, for some Router configurations a deeper level of checking has been implemented by means of internal probes (e.g. monitoring internal DUT signals). Indeed, while some DUT functionalities may be easily checked/covered without knowledge of timing such as the data integrity from one port to another, check rules for other DUT functionalities depend on the timing of what is happening on the various ports; as example, the buffers status (empty/full) depends on the rate with which packets are injected into the DUT, besides the destination of those packets; also the correct behavior of an arbitration algorithm depends on the timing with which the different packets accessing the same resource are served by the Router.

114 S. Saponara et al.

Therefore, to achieve 100% functional verification, the basic coverage-driven approach should be enhanced either through the design of a software golden model able to predict timing-dependent properties (very time consuming approach) or alternatively, as we have done in this work, using internal probes to rapidly achieve detailed information about operation of internal router blocks and state machines. An internal probe means monitoring a hardware signal within the router: for example monitoring the inputs of an arbiter block allows to gather extensive coverage information about arbitration scenarios and successful appli-cation of a specific arbitration algorithm. The drawback of this approach is that using probes requires a deep knowledge of Router VHDL implementation and it is a hard-to-reuse solution. Figure8.7shows the UML diagram of the probes portion of the RoutereVC. Thanks to these additionaleVC units, some internal arbitra-tions and QoS mechanisms have been verified such as the LRU arbitration of the

get_number_of_ls(list of port_kind_t, llist of port_kind_t)() get_links_needing_ls(list of port_kind_t, llist of port_kind_t)() number_of_vns

Fig. 8.7 UML diagram of the probes portion of the Router eVC

8 Coverage-Driven Verification of HDL IP Cores 115

link scheduler; furthermore, this allowed for the implementation of additional coverage points related to arbiters and internal buffers.

To be noted that many checks must be performed independently for each port, so theeVC architecture needs to be highly modular and configurable. It is worth noting that all the checks and coverage points may be selectively enabled to meet various user requirements. For example, if there is no interest in testing queue utilization, simulations can be speeded up by disabling the internal probes portion of theeVC (which works with cycle level accuracy), leaving only the transaction level architecture. It is also possible to disable a single check; for example, by disabling the check about routing the RoutereVC can be used as traffic monitor on a single US or DS bus. To conclude the verification flow, several coverage points have been defined to describe all possible traffic scenarios that can stress the Router; they may be conceptually grouped into the three main categories reported in Table8.1: Traffic flow, queue utilization, arbitration mechanisms. Some coverage points (such as the ones in queue utilization) are available only as part of the internal probeseVC extension, because it is possible to easily retrieve some information only accessing the micro-architecture of some blocks.

Dans le document Lecture Notes in Electrical Engineering (Page 115-120)