From: Donald Becker <becker@scyld.com>
To: Peter Jay Salzman <covenant@dirac.org>


> > For a communication-intensive Beowulf system I recommending using a card
> > that doesn't require bounce buffers.  The RTL8139 requires bounce copying on
> > both Tx and Rx.  Ideally you should use a card that can receive into
> > (dword-misaligned) Rx buffers that paragraph-aligns the IP header.  This
> > latter detail is more important for Alpha systems than x86 machines.
>  
> can you give me a reference to read on what bounce buffers and dword
> misaligned rx buffers are?   i'm eager to learn about anything that i
> haven't even heard of.

There is a brief discussion at the top of most drivers about their
structure.  I wrote most of the drivers on my own initiative, so I'm pretty
explicit about any performance or capability flaws.

First some definitions for alignment
   word	  "word"	16 bit (2 byte) alignment
  dword   "double word"	32 bit (4 byte) alignment
paragraph		16 byte alignment
cache line		16 or 32 byte alignment

An Ethernet header looks like
 0----
   <six-byte-destaddr>
 6----
   <six-byte-srcaddr>
12----
   <two-byte-type>	e.g. 0x0806
14----
   <IP header start, N elements>
14+4*N ----
   <payload data start>


IP header elements are always 4 byte values, and thus maintain whatever
dword alignment they started with.  So if you start with an aligned buffer,
adding a 14 byte Ethernet header misaligns every IP header element, and even
the payload data is also misaligned!

There are two data paths, Tx and Rx.
 Tx
  There are a few chips that can only transmit dword-aligned data buffers.
  Linux creates most of its packets with dword aligned payloads, prepends a
  dword multiple IP header, then adds a 14 byte Ethernet header and ends up
  with a misaligned packet.  The only way to transmit this is to *copy* the
  entire packet to an aligned "bounce buffer".  Some of the copy cost is
  mitigated because the data is usually hot in the cache, but we are still
  doing an extra copy of every byte transmitted.

  Example chips with this issue: Via-Rhine, RTL8139.

Rx
  There are a many chips that can only receive into dword-aligned data
  buffers.  Similar to the Tx case, the IP header is always misaligned.  For
  most architectures this creates a diffuse, difficult to measure
  performance loss as every read of a dword data element triggers the
  hardware to read two dwords and assemble the result.

  For a few architectures (e.g. Alpha) the situation is even worse.  The
  generic dword read only works with dword aligned addresses.  The Linux
  protocol stack isn't written with re-alignment in mind, so driver must
  always copy-align the received data into a new buffer.  The "rx_copybreak"
  parameter that exists in many drivers to effect this changed behavior
  without impacting architectures that don't require aligned memory
  operations.

  Some of the copy cost is mitigated by doing copy-checksum and preloading
  the cache, but the cache preloading turns into cache flushing when heavy
  network traffic caused Rx packets to be queued.

  For chips that can handle Rx buffers on arbitrary boundaries we start with
  a paragraph aligned buffer, and don't use the first two bytes.  After the
  14 byte Ethernet header, we end up with the IP header paragraph aligned.
  In an ideal world we would have 16 bytes of IP header, resulting in
  paragraph and cache aligned payload data.  Alas, current Linux protocol
  stacks go wild with extra IP options.

  Chip designs with the unaligned Rx issue are the
     tulip (yes, Digital's own chip works badly with the Alpha!),
     eepro100 (in some modes)
     rtl8139 (special case: always requires a copy, even on the x86)
     starfire
     via-rhine
     winbond-840. 

  Designs without this problem are the
    3Com (3c900 series)
    epic100
    hamachi
    natsemi, and
    yellowfin.

Note that designs that require dword aligned buffers never allow buffers
that are not dword multiples, so you cannot realign with e.g. a 14 byte
preliminary data buffer.


Donald Becker				becker@scyld.com
Scyld Computing Corporation
410 Severn Ave. Suite 210
Annapolis MD 21403