Performance Numbers - Copyright © 2002 by Microsoft Corporation

This section covers performance numbers from the different servers provided in Chapters 5 and 6. The various servers tested are those using blocking sockets, non-blocking with select, WSAAsyncSelect, WSAEventSelect, overlapped I/O with events, and overlapped I/O with completion ports. Table 6-3 summarizes the results of these tests. For each I/O model, there are a couple of entries. The first entry is where 7000 connections were attempted from three clients. For all of these tests, the server is an echo server. That is, for each connection that is accepted, data is received and sent back to the client.

The first entry for each I/O model represents a high-throughput server where the client sends data as fast as possible to the server. Each of the sample servers does not limit the number of concurrent connections. The remaining entries represent the connections when the clients limit the rate in which they send data so as to not overrun the bandwidth available on the network. The second entry for each I/O model represents 12,000 connections from the client, which is rate limiting the data sent. If the server was able to handle the majority of the 12,000 connections, then the third entry is the maximum number of clients the server was able to handle.

As we mentioned, the servers used are those provided from Chapter 5 except for the I/O completion port server, which is a slightly modified version of the Chapter 5 completion port server except that it limits the number of outstanding operations. This completion port server limits the number of

outstanding send operations to 200 and posts just a single receive on each client connection. The client used in this test is the I/O completion port client from Chapter 5. Connections were established in blocks of 1000 clients by specifying the ‘-c 1000' option on the client. The two x86-based clients initiated a maximum of 12,000 connections and the Itanium system was used to establish the

remaining clients in blocks of 4000. In the tests that were rate limited, each client block was limited to 200,000 bytes per second (using the ‘-r 200000' switch). So the average send throughput for that entire block of clients was limited to 200,000 bytes per second (not that each client was limited to this amount).

Table 6-3I/O Method Performance Comparison

I/O

Blocking 7000/ 1008 25,632 36,121 10–60% 2016 2,198,148/

2,198,148

12,000/ 1008 25,408 36,352 5– 40% 2016 404,227/

402,227

Non-blocking 7000/ 4011 4208 135,123 95–100%* 1 0/0

12,000/ 5779 5224 156,260 95–100%* 1 0/0

WSA-Async Select

7000/ 1956 3640 38,246 75–85% 3 1,610,204/

1,637,819

12,000/ 4077 4884 42,992 90–100% 3 652,902/

652,902

WSA-Event Select

7000/ 6999 10,502 36,402 65–85% 113 4,921,350/

5,186,297

12,000/ 11,080 19,214 39,040 50–60% 192 3,217,493/

3,217,493

46,000/ 45,933 37,392 121,624 80–90% 791 3,851,059/

3,851,059

Over-lapped (events)

7000/ 5558 21,844 34,944 65–85% 66 5,024,723/

4,095,644

12,000/12,000 60,576 48,060 35–45% 195 1,803,878/

1,803,878

49,000/48,997 241,208 155,480 85–95% 792 3,865,152/

3,834,511

7000/ 7000 36,160 31,128 40–50% 2 6,282,473/

3,893,507

I/O

12,000/12,000 59,256 38,862 40–50% 2 5,027,914/

5,027,095

50,000/49,997 242,272 148,192 55–65% 2 4,326,946/

4,326,496 The server was a Pentium 4 1.7 GHz Xeon with 768 MB memory. Clients were established from three machines: Pentium 2 233MHz with 128 MB memory, Pentium 2 350 MHz with 128 MB memory, and an Itanium 733 MHz with 1 GB memory. The test network was a 100 MB isolated hub. All of the machines tested had Windows XP installed.

The blocking model is the poorest performing of all the models. The blocking server spawns two threads for each client connection: one for sending data and one for receiving it. In both test cases, the server was unable to handle a fraction of the connections because it hit a system resource limit on creating threads. Thus the CreateThread call was failing with ERROR_NOT_ENOUGH_MEMORY.

The remaining client connections failed with WSAECONNREFUSED.

The non-blocking model faired only somewhat better. It was able to accept more connections but ran into a CPU limitation. The non-blocking server puts all the connected sockets into an FD_SET, which is passed into select. When select completes, the server uses the FD_ISSET macro to search to determine if that socket is signaled. This becomes inefficient because the number of connections increases. Just to determine if a socket is signaled, a linear search through the array is required! To partially alleviate this problem, the server can be redesigned so that it iteratively steps through the FD_SETs returned from select. The only issue is that the server then needs to be able to quickly find the SOCKET_INFO structure associated with that socket handle. In this case, the server can provide a more sophisticated cataloging mechanism, such as a hash tree, which allows quicker lookups. Also note that the non-paged pool usage is extremely high. This is because both AFD and TCP are

buffering data on the client connections because the server is unable to read the data fast enough (as indicated by the zero-byte throughput) as indicated by the high CPU usage.

The WSAAsyncSelect model is acceptable for a small number of clients but does not scale well because the overhead of the message loop quickly bogs down its capability to process messages fast enough. In both tests, the server is able to handle only about a third of the connections made. The clients receive many WSAECONNREFUSED errors indicating that the server cannot handle the FD_ACCEPT messages quickly enough so the listen backlog is not exhausted. However, even for those connections accepted, you will notice that the average throughput is rather low (even in the case of the rate limited clients).

Surprisingly, the WSAEventSelect model performed very well. In all the tests, the server was, for the most part, able to handle all the incoming connections while obtaining very good data throughput. The

drawback to this model is the overhead required to manage the thread pool for new connections.

Because each thread can wait on only 64 events, when new connections are established new threads have to be created to handle them. Also, in the last test case in which more than 45,000 connections were established, the machine became very sluggish. This was most likely due to the great number of threads created to service the many connections. The overhead for switching between the 791 threads becomes significant. The server reached a point at which it was unable to accept any more

connections due to numerous WSAENOBUFS errors. In addition, the client application reached its limitation and was unable to sustain the already established connections (we'll discuss this in detail later).

The overlapped I/O with events model is similar to the WSAEventSelect in terms of scalability. Both models rely on thread pools for event notification, and both reach a limit at which the thread switching overhead becomes a factor in how well it handles client communication. The performance numbers for this model almost exactly mirror that of WSAEventSelect. It does surprisingly well until the number of threads increases.

The last entry is for overlapped I/O with completion ports, which is the best performing of all the I/O models. The memory usage (both user and non-paged pool) and accepted clients are similar to both the overlapped I/O with events and WSAEventSelect model. However, the real difference is in CPU usage. The completion port model used only around 60 percent of the CPU, but the other two models required substantially more horsepower to maintain the same number of connections. Another

significant difference is that the completion port model also allowed for slightly better throughput.

While carrying out these tests, it became apparent that there was a limitation introduced due to the nature of the data interaction between client and server. The server is designed to be an echo server such that all data received from the client was sent back. Also, each client continually sends data (even if it's at a lower rate) to the server. This results in data always pending on the server's socket (either in the TCP buffers or in AFD's per-socket buffers, which are all non-paged pool). For the three well-performing models, only a single receive is performed at a time; however, this means that for the majority of the time, there is still data pending. It is possible to modify the server to perform a

non-blocking receive once data is indicated on the connection. This would drain the data buffered on the machine. The drawback to this approach in this instance is that the client is constantly sending and it is possible that the non-blocking receive could return a great deal of data, which would lead to starvation of other connections (as the thread or completion thread would not be able to handle other events or completion notices). Typically, calling a non-blocking receive until WSAEWOULDBLOCK works on connections where data is transmitted in intervals and not in a continuous manner.

From these performance numbers it is easily deduced that WSAEventSelect and overlapped I/O offer the best performance. For the two event based models, setting up a thread pool for handling event notification is cumbersome but still allows for excellent performance for a moderately stressed server.

Once the connections increase and the number of threads increases, then scalability becomes an issue as more CPU is consumed for context switching between threads. The completion port model still offers the ultimate scalability because CPU usage is less of a factor as the number of clients increases.