Third Parallel Version: Ordering Sequences to Improve Load Balancing

6 Databases: Starting with Agenda

6.7 Third Parallel Version: Ordering Sequences to Improve Load Balancing

Having addressed the issue of fitting a large database into a small space (an issue of interest for all database applications), let's tackle the implications of wide variance in sequence length. The first question is "what's the problem?" It comes down to load balance. Suppose we have lots of short sequences and one very long one. Suppose that the last task chosen by the last worker involves the very long one. While every other worker sits idly (all other tasks having been completed), this one unfortunate worker hacks away at his final, extra-long task.

For example: suppose we have a database consisting, in aggregate, of 10⁶ bases, that the longest single sequence is 10⁴ bases and the rest are of roughly equal length. (Say they're 100 bases each.) Now assume that we have deployed 20 comparison workers to search this database. If the long sequence is picked up near the end of the computation, we will have one comparison worker scanning ~ 60,000 bases ( i.e. an extra ten thousand) while the rest scan ~ 50,000 bases each. Assuming that each worker's compute time is proportional to the total number of database bases it scans, we see that we will achieve a speedup of ~ 17, significantly less than the ideal 20.

Pop Quiz: Why? Justify the claimed 17-times speedup.

If, on the other hand, the long sequence is picked up early, the natural load balancing properties of our program will result in all workers doing about the same amount of work, and together achieving a speedup close to the ideal.

This last observation suggests an obvious attack on the problem. As part of the work done in preparing the database, order it longest sequence first. Then, instead of a bag of sequences, maintain an ordered list of sequences in a one-source multiple-sink stream. The comparison workers process this stream in order, ensuring that the longest sequences are done first, not last.

Once again, we solve a problem by introducing additional synchronization. This time it isn't between the input and comparison processes

(recall there is no interprocess synchronization for the writer of a one-source/many-sink stream), but between the comparison processes themselves, in the form of the tuple used to index the stream. Here, synchronization is added to address an efficiency concern—but adding synchronization to solve an efficiency problem is extremely non-intuitive. We've balanced the load across the workers, but at the cost of introducing a potential bottleneck. Each worker needs to access the index tuple to acquire a sequence (a task). Suppose that the time required to do a task is too short to allow all other workers to access the index tuple. In this case, when the first worker (now done with the last task) reaches for the index tuple again, it's likely that at least one other worker will also want to claim the index tuple. The more workers in the field, the more time will be spent contending for the index tuple.

For example: suppose that updating the index tuple takes 1 unit of time and every comparison takes the same amount of time, say 100 units.

Now if we start 10 workers at once, the time at which they actually start doing comparisons will be staggered: at time 0, one worker updates the index tuple (it will begin computing at time 1), some other worker grabs the index tuple at time 1 (and start computing at time 2) and so on. The last worker gets his shot at the index tuple at time 9 and starts computing at time 10. As of time step 10 all workers are busy, and they will remain so until time step 101. At this point the first worker will have finished its first task. It proceeds to grab the index tuple in order to learn the identity of its next sequence. The process repeats, with the result that all workers (except for a brief start-up and shut-down phase) are kept busy with only a 1% task-assignment overhead per task.

If we have 200 workers, however, we have a problem. The first round of work will not have been played out until time 200, but the first worker will be ready for more work at time 101! On average, half our workers will be idle at any given time awaiting access to the index tuple. Note that, under the ideal circumstances assumed here, the performance of the code will improve in a well-behaved way up through 100 workers, then abruptly go flat. (In the less-than-perfect real world, performance may degrade instead of flattening out—heavy contention for the index tuple may result in increased access time, making things even worse. The effect will likely kick in sooner than the ratio of average comparison time to access time would predict.) If the user's understanding was based on an purely empirical study of the program's performance, there would be no warning that a performance collapse was imminent.

This problem needs to be dealt with pragmatically. Actual problems must be solved; problems that are merely "potential" need not be (and should not be, to the extent that a solution complicates the code or increases overhead at runtime). As we discuss in the performance section, for typical genetics database searches using a modest number of workers (under 100), there is no index-tuple bottleneck. For other forms of this problem, in other machine environments, there might be. A solution is obvious: use many index tuples instead of one.

Pop Quiz: Develop a multiple-index program. Workers might be assigned to a sequence stream at initialization time, or they might choose streams dynamically. All streams can be fed by the same master or input process.

We've arrived at a realistic, practical solution to our search problem. Realistic, because we've used a strategy that will enable this code to handle virtually any size database. Practical because our program is highly parallel, and we've taken pains to ensure a reasonably good load balance.

We can install one final refinement, however. The actual constraints on update_result() are less severe than we've planned for. We collapsed two distinct jobs, result collation and output, into this one function. In the actual case of interest, we don't need to generate output for every result—just for the best (or n best) results, where n is much smaller than the total number of sequences. Hence we don't need to serialize the invocations of update_result() (or at least, we don't need to serialize them completely). We can migrate the update

functionality from update_result() to compare() by having the latter keep track of the best result (or best n results) its worker has seen so far. When a worker's last task is complete, it reports its best results to the master. The master collects these local results and reduces them to one global result. The goal, of course, is to reduce the volume of data exchanged between the workers and the master (and the attendant overhead), and to parallelize the best-result computation.

Figures 6.4 and 6.5 present the final database code. Workers use the index tuple to manage the task stream. The result tuple has now become two tuples, labeled task done and worker done. The first signals task completion to the master, for watermarking purposes; the second holds each worker's local maximum.

t_length = get_target(argv[1], target);

open_db(argv[2]);

out("target", target:t_length);

out("index", 1);

/* Loop putting sequences into tuples */

tasks = 0;

task_id = 0;

while(d_length = get_seq(dbe)) {

out("task", ++task_id, get_db_id(dbe), dbs:d_length);

if (++tasks > upper_limit) /* Too many tasks, get some results. */

do in("task done"); while (--tasks > lower_limit);

}

/* Poison tasks. */

for (i = 0; i < num_workers; ++i) out("task", ++task_id, 0, "":0);

close db();

while (tasks--) in("task done"); /* Clean up*/

real_max = 0

for (i = 0; i < num_workers; ++i) { /* Get results */

in("worker done", ?, db_id, ? max);

if (real_max < max) { real_max = max;

real_max_id = db_id;

} }

print_max(db_id, real_max);

}

Figure 6.5

Database search: Final version (worker) char dbe[MAX + HEADER], target[MAX];

char *dbs = dbe+HEADER;

/* Work space for a vertical slice of the similarity matrix*/

ENTRY_TYPE col_0[MAX+2], col_1[MAX+2], *cols[2}={col_0,col_1};

compare() {

SIDE_TYPE left_side, top_side;

rd("target", ? target:t_length);

left_side.seg_start = target;

left_side.seg_end = target + t_length;

top_side.seg_start = dbs;

local_max = 0;

while (1) {

in("index", ? task_id,);

out("index", task_id+1);

in("task", task_id, ? db_id, ? dbs:d_length);

/*If poison task, dump local max and exit. */

if (!d_length) break;

/* Zero out column buffers. */

for (i=0; i <= t_length+1; ++i) cols[0][i]=cols[1][i]=ZERO_ENTRY;

top_side.seg_end = dbs + d_length;

max = 0;

similarity(&top_side, &left_side, cols, 0, &max);

out("task done ");

if (max > local_max) { local_max = max;

local_max_id = db_id;

} }

out("worker done", local_max_id, local_max);

}

Dans le document HOW TO WRITE PARALLEL PROGRAMS A FIRST COURSE (Page 52-55)