From: Donald Becker To: Peter Jay Salzman > > For a communication-intensive Beowulf system I recommending using a card > > that doesn't require bounce buffers. The RTL8139 requires bounce copying on > > both Tx and Rx. Ideally you should use a card that can receive into > > (dword-misaligned) Rx buffers that paragraph-aligns the IP header. This > > latter detail is more important for Alpha systems than x86 machines. > > can you give me a reference to read on what bounce buffers and dword > misaligned rx buffers are? i'm eager to learn about anything that i > haven't even heard of. There is a brief discussion at the top of most drivers about their structure. I wrote most of the drivers on my own initiative, so I'm pretty explicit about any performance or capability flaws. First some definitions for alignment word "word" 16 bit (2 byte) alignment dword "double word" 32 bit (4 byte) alignment paragraph 16 byte alignment cache line 16 or 32 byte alignment An Ethernet header looks like 0---- 6---- 12---- e.g. 0x0806 14---- 14+4*N ---- IP header elements are always 4 byte values, and thus maintain whatever dword alignment they started with. So if you start with an aligned buffer, adding a 14 byte Ethernet header misaligns every IP header element, and even the payload data is also misaligned! There are two data paths, Tx and Rx. Tx There are a few chips that can only transmit dword-aligned data buffers. Linux creates most of its packets with dword aligned payloads, prepends a dword multiple IP header, then adds a 14 byte Ethernet header and ends up with a misaligned packet. The only way to transmit this is to *copy* the entire packet to an aligned "bounce buffer". Some of the copy cost is mitigated because the data is usually hot in the cache, but we are still doing an extra copy of every byte transmitted. Example chips with this issue: Via-Rhine, RTL8139. Rx There are a many chips that can only receive into dword-aligned data buffers. Similar to the Tx case, the IP header is always misaligned. For most architectures this creates a diffuse, difficult to measure performance loss as every read of a dword data element triggers the hardware to read two dwords and assemble the result. For a few architectures (e.g. Alpha) the situation is even worse. The generic dword read only works with dword aligned addresses. The Linux protocol stack isn't written with re-alignment in mind, so driver must always copy-align the received data into a new buffer. The "rx_copybreak" parameter that exists in many drivers to effect this changed behavior without impacting architectures that don't require aligned memory operations. Some of the copy cost is mitigated by doing copy-checksum and preloading the cache, but the cache preloading turns into cache flushing when heavy network traffic caused Rx packets to be queued. For chips that can handle Rx buffers on arbitrary boundaries we start with a paragraph aligned buffer, and don't use the first two bytes. After the 14 byte Ethernet header, we end up with the IP header paragraph aligned. In an ideal world we would have 16 bytes of IP header, resulting in paragraph and cache aligned payload data. Alas, current Linux protocol stacks go wild with extra IP options. Chip designs with the unaligned Rx issue are the tulip (yes, Digital's own chip works badly with the Alpha!), eepro100 (in some modes) rtl8139 (special case: always requires a copy, even on the x86) starfire via-rhine winbond-840. Designs without this problem are the 3Com (3c900 series) epic100 hamachi natsemi, and yellowfin. Note that designs that require dword aligned buffers never allow buffers that are not dword multiples, so you cannot realign with e.g. a 14 byte preliminary data buffer. Donald Becker becker@scyld.com Scyld Computing Corporation 410 Severn Ave. Suite 210 Annapolis MD 21403