HAL Id: hal-02667533
https://hal.inrae.fr/hal-02667533
Submitted on 31 May 2020
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de
Sequential in-core sorting performance for a sql data service and for parallel sorting on heterogneous clusters
Christophe Cérin, Michel Koskas, Hazem Fkaier, Mohamed Jemni
To cite this version:
Christophe Cérin, Michel Koskas, Hazem Fkaier, Mohamed Jemni. Sequential in-core sorting per-
formance for a sql data service and for parallel sorting on heterogneous clusters. Future Generation
Computer Systems, Elsevier, 2006, 22 (7), pp.776-783. �10.1016/j.future.2006.02.014�. �hal-02667533�
Sequential In-core Sorting Performance for a SQL Data Service and for Parallel Sorting on
Heterogeneous Clusters
Christophe C´erin a , Michel Koskas b , Hazem Fkaier c , Mohamed Jemni c
a Universit´e de Picardie Jules Verne, LaRIA, Bat Curi, 5 rue du moulin neuf, F-80039 Amiens cedex 1- France
b Universit´e de Picardie Jules Verne, LaMFA/CNRS UMR 6140, 33 rue St Leu, F-80039 Amiens cedex 1- France
c Ecole Sup´erieure des Sciences et Techniques de Tunis, Unit´e de recherche UTIC, 5 Av. Taha Hussein, B.P 56 Bab Menara, 1008 Tunis - Tunisie
Abstract
The aim of the paper is to introduce techniques in order to tune sequential in-core sorting algorithms in the frameworks of two applications. The first application is parallel sorting when the processor speeds are not identical in the parallel system.
The second application is the Zeta-Data Project (Koskas , 2003) whose aim is to de- velop novel algorithms for databases issues. About 50% of the work done in building indexes is devoted to sorting sets of integers. We develop and compare algorithms built to sort with equal keys. Algorithms are variations of the 3way-Quicksort of Segdewick. In order to observe performances and to fully exploit functional units in processors and also in order to optimize the use of the memory system and the dif- ferent functional units, we use hardware performance counters that are available on most modern microprocessors. We develop also analytical results for one of our al- gorithms and compare expected results with the measures. For the two applications, we show through fine experiments on an Athlon processor (a three-way superscalar x86 processor), that L1 data cache misses is not the central problem but a subtil proportion of independent retired instructions should be advised to get performance for in-core sorting.
Key words: hardware performance counters, in-core sorting algorithms with equal
keys, two levels memory hierarchy, optimizing memory accesses, parallelism at the
chip level, data structures for databases, parallel sorting.
1 Introduction
The story of sorting is long enough to deserve many books, among them the Knuth series (Knuth, 1998). One reason among others for the popularity of sorting is that sorted data are easier to manipulate than unordered data, for instance a sequential search is much less costly when the data are sorted.
The quasi non predictable aspects of memory references in sorting algorithms make them good candidates to appreciate the performance of processors in real situations. Two supplementary elements contribute to the ’no-end’ story.
First of all, the data are never accessed instantaneously with current proces- sors. Computer architects (Hennessy and Patterson, 2002; Shriver and Smith, 1998) have imagined more and more complicated memory designs, called caches, in order to get the fastest memory accesses. Cache design implies time penalties and many parameters should be considered (cache line size, write and read policiess, replacement algorithms. . . ).
Second, a majority of processors available on the market place (Athlon, Pen- tium processors for instance) are now equiped with special registers that record the activity of the processor at no cost, we mean with no overhead caused by the observation of events. So, we have now the possibility to observe finely the behavior of codes and analyse their performance without the use of a simula- tor. Many works have been accomplished in order to interface the observable events with codes. Among the current project we deal with Perfctr 1 which is used by many performance analyzer tools, for instance Papi 2 .
Moreover, modern processors include parallelism in the chip. For instance, the Athlon processor is a three-way superscalar x86 processor. The chip has three integer pipelines, three floating point pipelines and three full x86 decoders.
This means that the chip can potentially executes three instructions in parallel.
In this paper, we study how the number of L1 data cache misses and the num- ber of retired instructions influence the “quality” of parallelism of a modern chip and for sorting with equal keys. This problem, studied for instance by Sedgewick 3 (Sedgewick, 1977; Bentley and Sedgewick, 1997), is of crucial im- portance in the Zeta-Data project. The problem of sequential sorting is also important for the implementation of parallel sorting when processor speeds are not identical (C´erin and Gaudiot, 2000a,b, 2002) because the parallel al- gorithms execute in parallel portions of sequential codes for sorting.
1 See: http://user.it.uu.se/~mikpe/linux/perfctr/
2 See: http://icl.cs.utk.edu/projects/papi/software/
3 For more information, see also Robert Sedgewick home page at
http://www.cs.princeton.edu/~rs
In this paper we investigate the performance of 7 sorting algorithms. All the algorithms are based on invariants that maintain (in different ways) equal keys in one position, keys lower than equal keys in another one and keys greater to equal keys in yet another position in the input array.
One goal is to sort ’in place’ i.e. without any supplementary memory. The maintaining of the invariant implies memory references that can potentially create excessive cache misses or induces dependent instructions that will freeze the functional units. We show how to extract temporal and spacial locality in order to better exploit the different functional units (working in parallel) of the Athlon microprocessor. In this tuning, we still improve the sorting (with equal keys) stage of our first implementation by 20% and we beat again the 3way-Quicksort implementation of Sedgewick.
The organization of the paper is as follows. In Section 2 we summarize the advantage of working at the register level to collect information about the activity of a processor against the use of simulation techniques. Section 3 is about our motivations and related work on sorting. Section 4 introduces our selected algorithms incase of our applications. Section 5 introduces our experimental results. Section 6 concludes the paper.
2 The opportunities of working at the register level
Monitoring the collected processor events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture and done by the compiler. This correlation has a variety of uses in performance analysis including hand tuning, compiler optimization for better ressource usage, debugging, benchmarking, monitoring and performance modeling. In addition, it is hoped that this information will show itself useful in the discovery of commonly occurring bottlenecks in high performance computing codes.
We address the problem of cache effect on the performance of sorting with equal keys. We consider only the first level of the memory hierachy: register and L1 data cache. We do prefer to monitor the hardware events than to use a simulator. Despite the fact that a (software) simulator has the advantage to require no hardware, we guess that very fine architectural details are very difficult to simulate with a good precision.
For instance, modern processors are based on pipelined functional units, multi-
ple functional units, speculative execution, several levels of cache memory and
sometimes cache lines are shared between CPUs. Factors such as variations
in the process scheduling and the operating system’s virtual to physical page
mapping policy contribute to the difficulty to analyze cache misses output by a simulator.
To gain access to the hardware events on Linux/x86 platforms, we use Perfctr.
With Perfctr, each Linux process has its own set of “virtual” counters: the counters appear to be private to a process and unrelated to the activities of other processes in the system. The virtual counters can be sampled in user- space without the overhead of a system call.
3 Motivations
Many works about sequential sorting have been done in the past to analyse the performances of RISC processors (Agarwal, 1996), (Nyberg et al., 1994), (Larriba-Pey et al., 1997) or to study processors with a low number of registers or with small caches (Arge et al., 2001), (Ranade et al., 2000).
In (Rahman and Raman, 2000), N. Rahman and R. Raman studied radix sort and more precisely the importance of reducing misses in the translation-look aside buffer (TLB). No experimental measures of the misses are accomplished.
We do not use here radix-sort because LaMarca and Ladner in (LaMarca and Ladner, 1999) proved that radix trees do not perform well comparing to QuickSort based sorting algorithms. They showed that despite its lowest instruction count, radix sort has poor cache performance hence its overall performance is worse than the memory optimized versions of mergesort and quicksort.
Indeed, in our Zeta-Data application we sort couples of values (cell value, line index) and radix-sort based algorithms are in this case more complicated to implement efficiently.
3.1 Sorting and cache effects
One of the most valuable paper about the influence of cache and the impact of the memory hierarchy on sorting is the paper of LaMarca and Ladner (LaMarca and Ladner, 1999). However, the experiments are done with ATOM (Srivastava and Eustace, 1994) which is a simulator built in the beginning of nineties.
The paper of LaMarca and Ladner 4 (LaMarca and Ladner, 1999) explores the performance of four popular sorting algorithms: mergesort, quicksort, heap-
4 A prior technical report is dated from 1994
sort and radix sort on DEC Alphastation 250 and trace driven simulation with ATOM. For each of the four sorting algorithms they choose “an implemen- tation variant with potential for good overall performance and then heavily optimize this variant using traditional techniques to minimize the number of instructions executed”. They concentrate on three performance measures:
instruction count, cache misses and overall performance (time).
The two first competitors, regarding the execution time, are mergesort and quicksort. The main general lesson of the paper is that “Improving an algo- rithm’s overall performance may require increasing the number of instructions executed while, at the same time, reducing the number of cache misses”. We confirm only in this paper the first part of the sentence. We will show later that what is important (today with the current technology) is the number of instructions executed per processor cycle.
The authors apply memory optimizations in order to improve cache perfor- mance. Optimizations are based on temporal locality and spacial locality prin- ciples. A program exhibits temporal locality if there is a good chance that an accessed data will be accessed again in the near future. A program exhibits spacial locality if there is a good chance that subsequently accessed data are located near each other in memory.
Databases applications generate duplicates and Quicksort based algorithms can not provide performance in this case. It is well known for instance that Quicksort, even optimized for caches, has a complexity of O(n 2 ) in case of duplicates i.e. in the case of an input “already” sorted.
We absolutely need new algorithms to capture performance for situations that are part of databases benchmarks such as the TPC benchmark. For instance, while building or modifying the table of indexes, one has to sort the fields of each column of the tables. This means for instance that a table with 5 lines (for instance the “region” table of the TPC) may be expanded in a table with 6 millions of lines (for instance the “lineitem” table of the TPC). Thus one has to sort the fields of each column of the table “region” expanded in the table “lineitem”. This means that one has to sort arrays of 6 millions of items with only 5 different values.
4 Some algorithms for in-core parallel sorting
According to LaMarca and Ladner (LaMarca and Ladner, 1999) results, we
keep Quicksort and Mergesort algorithms because they have produced the best
performance in their simulations. So, we reduce here our study to comparisons
based sorting algorithms.
Fastsort (Nilsson and Raman, 1995; Nilsson, 2000) is an O (n log log n) sort that uses properties of the input (integers) to compress it in order to reduce the number of compare instructions to execute. No experimental feedback is known for Fastsort. Due to its potential, we also keep it.
We choose FAME (Ranade et al., 2000) which was drawn for the purpose of sorting on CPU with a few number of registers. FAME is a m-way merge- sort, thus, by essence it may reduce also cache misses since reducing register movements “implies” reducing data movement and L1 misses. The m-streams are merged by organizing comparaison tournament among the elements at the heads of the streams. The control structure is a finite state machine. Only time sorting results are known for FAME. Cache results is not reported in (Ranade et al., 2000). The implementation results done in (Ranade et al., 2000) are partly accomplished with m = 4 and thus FAME makes log 4 n passes over memory. We keep the same setting in this paper.
4.1 ZZZmerge and Zmerge
We introduce now our new cache concious algorithms ZZZmerge and Zmerge.
ZZZmerge is simply an optimized version of Zmerge with less copies. So, we only introduce the principles of ZZZmerge. ZZZmerge is a z-way-mergesort with two steps. A virtual binary tree is built. We first sort all the leaves by packets of size z ∗ z. We use insertion sort. Then a merge step occurs: at a certain level in the tree, we merge all values contained in pointers below nodes at that level, two by two. As with any mergesort algorithm, the merging step requires a supplementary buffer of size n but we do not make any copies:
alternatively during the tree search we choose to work on the “original” buffer or on the “supplement” buffer. A swap on pointers implements the technique.
The trick reduces significantly the cache misses. All of our experiments are done with z = 4 to facilitate the comparison with FAME.
We developed a single loop that performs n/(z ∗ z) − 1 iterations or merge steps. A tedious calculus is required to compute the bounds of the portions to be merged. The last iteration merges the portions between indices 0 and n/2 − 1 and n/2 and n − 1, the two previous ones merge portions [0 · · · n/4 − 1]
with [n/4 · · · n/2 − 1] and [n/2 · · · n/2 + n/4 − 1] with [n/2 + n/4 · · · n − 1] and
so on! The first n/z ∗ z ∗ 2 iterations serve to merge the leaves, the following
n/z ∗ z ∗ 4 serve to merge nodes at the first level in the virtual tree, and so
on. Note that the input size should divide z ∗ z and be a power of 2. It is a
limitation of our code at present time.
Theorem 1 The number of cache misses of ZZZmerge is n/(z ∗ z) + n
z ∗ z log n/(z ∗ z)
Proof: we decompose the proof into two parts corresponding to the two steps of the algorithms. We assume that z ∗ z integers can fit in a line cache and that M stands for the cache size and it is a multiple of z ∗ z i.e. M = k ∗ z ∗ z, k ≥ 1.
Moreover, a cache miss will bring z ∗ z integers in the cache. Since we have n/(z ∗ z) insertion sorts to accomplish in the first step, each on vectors of size z ∗ z, we get at most n/(z ∗ z) misses which is optimal.
For the second step now. Let C(p) be the number of cache misses for merging two sorted sub-lists, each of size p. The cache complexity of this step is given by:
log n/(z∗z) X
i=1
2 i − 1 × C(n/2 i )
Clearly, C(n/2 i ) = 2 ∗ (n/2 i )/(z ∗ z) = n/(2 i−1 ∗ z ∗ z), thus our sum reduces to:
log n/(z X ∗ z) i=1
n/(z ∗ z) = n z ∗ z
log n/(z X ∗ z) i=1
1 = n
z ∗ z log n/(z ∗ z) (1)
The total number of cache misses n/(z ∗ z) + n
z ∗ z log n/(z ∗ z) follows.
4.2 Selection of algorithms for the Zeta-Data project
Remind here that our goal is to efficiently sort integers with many duplicates.
Our different strategies for sorting consider 4 bytes long integers. All the tested algorithms are a variation of the 3way-Quicksort algorithm and are linked to the ’dutch national flag problem’. So, we maintain here the same invariant than the invariant used for the dutch national flag problem in the sense that we maintain, during the execution, the following condition for the partitioning step:
< = >
The algorithm for the Dutch national flag gathers, in the middle of the array,
elements equal to a partitioning element; on left, elements that are lower
than the partitioning element; on right, elements that are greater than the
partitioning element. After the partitioning step, it remains to sort recursively
the left and right portions. Our algorithms differ in the ways they produce the invariant above.
4.3 The different ways we maintain the invariant
Seven algorithms are under concern: TriEntiers,TriEntiers1, TriEntiers2, 3way- Quicksort, TriEntiers4, TriEntiers5 and TriEntiers6. 3way-Quicksort has been developped by Sedgewick, all the others by us. All the algorithms are Quick- Sort based algorithms and they differ on the partitioning step.
The different strategies in order to maintain the above invariant are presented in Figure 1. The picture describes the ”intermediate invariant” that we use before rearranging data in order to produce the Duch national flag invariant.
In practice, our algorithms start to check if the portion under concern is sorted (forward or backward) then, if not, we select a pivot (median element) in order to partition the input.
= < > =
= < > =