Experiments to Eliminate the Data Copy from User to Kernel Space

As observed earlier, data movement operations add significant overhead on the end sys tem. One method to reduce the cost of data movement for a send operation, prototyped on an exp erimen

tal version of the OSF/ 1 opera ting system, is to replace the data copy from user space to the kernel socket bu ffer by a new virtual memory page remap function. Instead of copying the data from physical pages in the user map to physical pages in the kernel map, the physical pages associated with the user v irtual address range in the user map are remapped to kernel virtual addresses. The pages associated with the new kernel virtual addresses are then masqueraded through the network as mbufs.

Pre limi nary results ind icate that a virtual memory mapping technique can be used on the OSF/1 _oper

ating system to significantly reduce the overhead associated with the transmission of messages.

The und erlying design of the remap operation affects app l ication semantics and performance.

The semantics of the app l ication are affected by which underlying page remap operation is selected. Performance may also be affected by the impkmentation of the page map operation and how wel l certain TCP/J P configuration variables are tuned to match the processor arch itecture and the network adapter capabilities. user ends up with demand zero pages. On the other hand, i n the borrow page operation, the same phys

ical pages are g iven back to the user. If the user accesses a page that the kernel was sti l l using, the user process either "sleeps,"' waiting for that page to become ava il able or (depending upon the imple

mentation) receives a copy of the page . For the page borrow operation, the user buffer size must be greater than the socket bu ffer size , and the user buffer must be referenced in a round-mbi n fashion to ensure that the appl ication does not s leep or receive copies of the page.

Both the page steal and the page borrow ope ra

tions change the semantics of the send( ) system cal ls, and some knowledge of these new seman tics of the send system calls needs to be reflected in the

application. The application's bu ffer a llocation and usage is dependent upon how the u nderlying remap operation is implemented . An i mportant consideration is the impact on the appl ication pro

gram ming i n terface. In particu lar, the extent to which the semantics of the send system calls (e .g . , al ignme nt requirements fo r the user message buffe r) need to change to support the remap opera

tions is an area that is currently under study.

The page remap feature has not yet been incorpo

rated in the DEC OSF!l version 1 . 2 prod uct. Inclusion of this feature in the product is expected to reduce CPU util ization. WJ1 ile page remapping does reduce the cost of processing a packet, the design issues outl ined above i mpact appl ications. To ach ieve performance benefits and appl ication portabil

ity across mul tiple heterogeneous open systems, future work continues in this area. In addition, in te

grated hardware solutions to reduce the cost of the copy operation are also u nder investigation.

The performance nu mbers presented in this paper did not include the improvements described in this section on experi mental work. We a n t icipate that the overa l l performa nce wou ld see substantial improvement with the i nclusion of these changes.

Conclusions

1ncreases i n communication link speeds and the dramatic i ncreases in processor speeds have increased the potential for widespread use of d is

tributed comput ing. The typical throughp ut del iv

ered to appl ications, however, has not in creased as dramat ical ly. One of the primary causes has been that network 1/0 is i ntensive on memory band

width, and the in creases in memory bandwidths have only been modest. We described in this paper an effo rt using the new Alpha A.XP workstations and the DEC OSF/ 1 operat i ng system fo r commun ication over FDDI to remove this 1/0 bottleneck from the end system.

We described the characteristics of the DEC 3000 AXP Model '500 workstation which uses D igital's Alpha AXP 64-bit RTSC microprocessor. With the use of wider access to memory and the use of multi level caches, which are coherent with DMA, the memory subsystem provides the needed bandwidth for appl ications to ach ieve su bstantial throughput while performing network 1/0.

We described the implementation of the internet protocol suite, TCP/IP _andUDP/I P, _{on the}DEC OSF/ 1 operating system. One of the primary characteris

tics of the design is the need for data moveme nt

lirJ/. 5 No. I Winter 199.3 D igilal Technical Jounwl

High-performance TCP/IP and UDP/IP Networking in DEC OSFI 1 for Alpha AXP

across the kernel-user address space boundary. I n addition, both TCP a n d U D P use checksums for the data. Both these operations introduce increasing overhead with the user message size and comprise a significant part of the total processing cost. We described the optimizations performed to make these operations efficient by taking advantage of the wider cache lines for the systems and the use of 64-bit operations.

We incorporated several optimizations to the implementation of TCP in the D^EC OSF/1 operating system. One of the first was to i ncrease the default socket buffer size (and hence the window size) used by TCP from the earlier, more conservative 4K bytes to 32K bytes. With this, the throughput of a TCP connection over FOOl between two Alpha AXP workstations reached 76.6 Mb/s. By i ncreasing the window size even further, we found that the throughput increases essentially to the full FOOl bandwidth. To i ncrease the window size beyond 64K bytes requires the use of recent extensions to TCP using the window scale option. The window scale option, which is set up at the connection ini

tialization time, allows the two end systems to use much larger windows. We showed that, when using a window size of 150K bytes, the peak throughput of the TCP connection increases to 94.5 Mb/s.

We also improved the performance of UDP through implementation optimizations. Typical BSD-derived systems experience substantial loss at the receiver when two peer systems communicate using UDP. Through simple modifications in the processing for UDP and reordering the processing steps, we improved the del ivered throughput to the receiving application substantially. The UDP receive throughput at the application achieved was 9756 .Mb/s for the typical NFS message size of 8K bytes.

Even at this throughput, we found that the CPU of the transmitter was not saturated. When a transmit

ter was allowed to transmit over two different rings (thus removing the communication l ink as the bot

tleneck) to two receivers, ^asingle Alpha AXP work

station (DEC 3000 AXP Model 500) is able to transmit an aggregate throughput of more than 149 Mb/s for ^amessage size of 8K bytes.

We also described throughput measurements with the FOOl full-duplex mode between two Alpha

�'\P workstations. With ful l-duplex mode there are no latencies which are associated with token rota

tion, lost token recovery, or l imitations on the amount of data transmitted at a time as imposed by the FDDI timed-token protocol . As a result, with

Digital Technical journal Vol. 5 No. I Winter 1993

full-duplex mode there are performance improve

ments. With TCP, we achieve a throughput of 94.5 Mb/s even w ith the default socket buffer of 32K bytes. This is smal ler than the buffer size needed in token passing mode to achieve the same level of throughput. Since the I ink becomes the bottleneck at this point, there is no substantial increase in throughput achieved with the use of window scal

ing when FDDI is being used in fu l l-duplex mode.

An increase in peak transmit throughput with UDP is a lso seen when using FDDI in ful l-duplex mode.

Finally, a few implementation ideas currently under study were presented .

Acknowledgments

This project could not have been successful with

out the help and support of a number of other ind i

viduals. Craig Smelser, Steve .Jenkins, and Kent Ferson were extremely supportive of this project and ensured that the important ideas were incor

porated into the ^OSF V 1 . 2 product. Tim Hoskins helped tremendously by providing many hours of assistance in reviewing ideas and the code before i t went into the product. In addition, we thank the short notice. We wouJcl like to thank Gary Lorenz in particular for his help with the DEFT A adapters.

References

1 . D. Clark and D. Tenneohouse, "Architectural Considerations for ^a New Generation of Protocols," Proceedings of the Symposium on Communications Architectures and Protocols, ACM 5/CCOMM 1990, ACJll! Com

puter Communications Review, vol . 20, no. 4 (September 1990).

2. J Lumley, "A High-Throughput Network Inter

face to a R1SC Workstation,'' Proceedings of the IEEE Workshop on the Architecture and Implementation of High Peiformance Com

m unication Subsystems, Tucson, AZ (Febru

ary 17-19, 1992).

3. P Druschel and L. Peterson, "High-perfor

mance Cross-domain Data Transfer," Techni

cal Report TR93-5, Department of Computer

DECnet Open Networking

Science (Tucson, AZ: U n iversity of Arizona, March 1993).

4. G. Chesson , " XTP/PE Overview," Proceedings of the 13th Conference on Local Computer Networks, Min neapol is, MN (October 1988).

5. ^FDDI fltledia Access Control, American National Standard, ANSI X3J39-1987

6. ^FDDI Physical Layer Protocol, American National Standard, ANSI X3.J48-1988.

7 J Postel , " User Datagram Protocol," RFC 768, SRI Network Information Center, Me nlo Park, CA (August 1980).

8. J Postel , " Internet Protocol," R FC 791 , SRI Network Information Center, Menlo Park, CA (September 1981).

9. J Postel, "Transmission Con trol Protoco l,"

RFC 793, SRJ Network I nformation Center, Menlo Park, CA (September 1981 ).

10. V jacobson, R. Braden, and D. Borman, "TCP Extensions fo r High Performance," RFC 1323.

Internet Engineering Task Force ( February 1991 ).

1 1 . K . Ramakrishnan, " Performance Considera

tions i n Designing Network I nterfaces," ^lEI::E journal on Selected Areas in Commun ica

tions, Special Issue on High Speed Compute1!

Network lntelfaces, voL 1 1 , no. 2 ( February 1993).

12. S. Leffler, M. McKusick, M . Karels, and J Quar

terman, The Design and Implementation of the 4.] ^{BSD UNIX}Operating s:ystem ( Read ing, MA: Addison-Wesley Publ ishing Company, May 1989).

13. R. Si tes, ed., Alpha A rchitecture Reference Manual ( Burl ington, MA: Digital Press, 1992).

14. D. Dobbe rpuhl et aL, "A 200-MHz 64-bit Dual

issue CMOS Microprocessor," Digital Techni

cal journal, vol 4, no. 4 (Special Issue 1992):

35- 50.

15. T D u tton, D. Eiref, H . Ku rth , J Reisert, and R . Stewart, "The Design of the DEC 3000 A,'\Cl' Systems, Two H igh-performance Work

stations," Digital Technical journal, vol . 4, no. 4 (Special Issue 1992): 66-81 .

16. R . Braden, " Requ irements For In ternet Hosts-Commu nication Layers,'' RFC 1 122, Internet Engi n eering Task Force (October 1989).

17 R. Braden, ''Requirements For I n ternet Hosts-Appl ication and Support," RFC 1 123, I nternet Engi neering Task Force (October

1 989).

18. D. Katz, "Transm ission of JP ami ARP over FDDI Networks," R FC 1:190, Internet Engineering Task Force (January 1993).

19. V Jacobson, " Congestion Avoidance and Control," Proceedings of the Symposium on Commun ications Architectures and Proto

cols, ACM ^SIGCOMM 1988^, ACM Computer Communications Review, vol. 18, no. 4 (August 19R8).

20. R. Grow, "A Timed To ken Protocol for Local Area Networks," Presented at EJectro/82, Token Access Protoco ls, Paper 17/3, May 1982.

2 1 . R. Jain, "Performance An alysis o f ^{F D D J}Token Ring Networks: Effect of Parameters and Gu idel i nes fo r Set ting TTRT," Proceedings of the Symposium on Communications Archi

tectures and Protocols^,ACM ^S!GCOMM1990, ACM Computer Communications Review, voL 20, no. 4 (September 1990).

22. M . Wenzel, " CSR Archi tecture (DNIA A rchitec

ture)," IEEE P 1 212 Working Group Part I l l-A, Draft 1.3, May 1'5, 1990.

23. K. Ramakrishnan, "Sched u l i ng Issues for Interfacing to H igh Speed Networks," Pro

ceedings of Globecom '92 IEEE Global Telecommunications Conference, Session 18 04, Orlando, FL (December 6 - 9, 1992).

24. D. Cl ark, "Wi ndow and Acknowledgment Strategy i n TCP," RFC 813, SJU Network Infor

mation Center, Menlo Park, CA (July 1982).

25. L. Zhang, S. Shenker, and D. D. Clark , "Obser

vations on the Dynam ics of a Congestion Con

trol Algorithm: The Effects of Two-Way Traffic," Proceedings of the Symposium on Communications Architectures and Proto

cols, ACM SIGCOMM 1991, ACM Computer Commun ication Review, vo l . 21, no. 4 (September 1991).

Vol 5 No. I Winter 1993 Digital Technical journal

High-performance TCP/IP and UDP!IP Networking in DEC OSF/1 for Alpha A)(p

26. V jacobson, "Efficient Protocol Implementa

tion," ACM SICCOM,H 1990 Tutorial on Pro

tocols for High-Speed Networks, Part B (September 1990).

27. C. Partridge and S. Pink, "A Faster UDP,"

Swed ish I nstitu te of Computer Science Tech

n ical Report (August 1991).

Dans le document Digital Technical Journal Digital Equipment Corporation (Page 60-63)