Industry


Ads by TechWords

See your link here


Jeff Boles's picture
Jeff Boles

Virtual Frontiers

VTL Under Fire -- Is that an Opteron I see? Part 3 - bad northbridges, and why your x86 server is falling all over itself.

Last time I reviewed some of the bus architecture of current Intel Xeon servers, and talked about some of the speeds and feeds across that bus.  The fact of the matter is, there's a big design problem going on in Northbridge architected x86 servers, and the numbers I gave you last post just don't add up to what you really get.  Today, let's take a look at why this is.

 

In an optimal, maximum-throughput scenario with the fastest stuff out there (say dual port 4Gb Fibre Channel adapters or 10Gb Ethernet NIC's), three PCI-X 133MHz cards feeding 3 PCI hubs in a Northbridge chipset could potentially occupy 1/2 of the entire CPU and Memory bus bandwidth that is available in the latest Xeon servers (3*1066MB/s = 3.2GB/s, and the latest Xeon bus is 6.4GB/s).  While this will change with new Front Side Bus (FSB) speeds in the future, the fact is that you just don't get to dump data across the server at 1/2 of the total bandwidth of this bus to start with, because there's a lot more going on than just straight throughput of data.

 

What I hope to give you from this discussion is an answer to a question I've struggled with.  The question has been how much throughput to expect from a server, and the answer is a rule of thumb for estimating this.

 

But to get there, let's talk about what's really going on inside the memory and CPU bus as data is being transferred.  Keep in mind, this is referring to current, typical PCI-X 133MHz born I/O and doesn't address some of the capabilities you might find in more atypical solutions, such as Infiniband and RDMA.  I may talk about those a bit in a final section of this Opteron series.

 

When you're moving data across an I/O bus -- and that is a basic assumption in an I/O oriented server -- you must move that data "from somewhere" and "to somewhere".  This happens in two basic phases: the first is receiving data, and the second is transmitting data.  Let's break down each of these and look at what happens with data in each of them.

 

Receive Data:

  1. Data arrives at the interface.  Data is received by the interface of an adapter card (NIC, HBA, etc.)
  2. Data is moved to memory.  If using Direct Memory Access (DMA), which is usually the case today, the data is written directly into system memory into an area of space identified as the socket buffer.  The socket buffer is essentially the parking lot for incoming data, before an application starts working with it.  (Since this is new, untouched data, any L2 cache recognition of previous data in this space is identified as invalid at high speed by the Northbridge, and the only delay incurred is moving the data to the system memory).
  3. Data is moved to the processor/cache.  The CPU reads this data from the socket buffer area of system memory and it is transferred to L2 cache (since it's new data, it has to be pulled fresh into L2 memory).
  4. Data is moved to the application area of memory.  As the incoming data is manipulated and moved across the system, it's not worked with directly in the socket buffer area of memory, but is moved to a new place in system memory allocated to an application process.  This is actually a fairly intense process, requiring the read of memory into L2 cache above, the transfer of that data into a cache area associated with the application buffer, and the transfer of this data back to the application area of system memory.  If some identified area of L2 cache is dirty (meaning a change has been made in cache and it needs to be written back to system memory), then there may be additional transactions from the CPU to system memory.

What this all means for the cycle of receiving data into a server bus, is that a data block may move from a server interface, across the memory bus into memory (1 time), and generate as many as 3 or more other data block movements across the memory bus for each incoming data block (read into L2 socket buffer cache and transfer to L2 application buffer cache [1], clearing out application buffer cache if dirty[2], and writing application buffer cache to system memory [3]).  The number may be higher than this depending upon cache performance, but 4 data movements across the memory bus is a good general assessment of receive transactions.

 

When receiving data into the system, for every data block received, it may generate up to 4 times as much memory data transmission across the memory bus.

 

Transmit Data:

 

Once some piece of data has made it into the application buffer area of system memory (by a receive process or some application process), it can be transmitted out another interface.  Since we're discussing an I/O oriented server, it's a safe assumption that we're going to exhaust L2 cache in many cases and be moving data to and from system memory before it goes from receive to transmit.  So in most cases, as an application is working with data or as new data is being received, we're likely to find the data that is to be transferred sitting in system memory.  Consequently, the transmit process is similar to this:

 

  1. Get data from system memory.  It is read from the application buffer area of system memory into L2 cache across the memory bus, where it is consequently moved into the L2 cache area associated with the socket buffer.  This represents one data movement across the bus (unless different buffer sizes end up pushing the socket buffer out of cache, in which case two more transfers might take place to retrieve from and push into system memory).
  2. The interface card uses Direct Memory Access (DMA) to retrieve the data from either cache or system memory.

What this means is that for data to be transmitted out another interface, data is pulled from system memory (1 transfer), and retrieved through the Northbridge into the socket buffer of the interface card (1 transfer).  Assuming no data is pushed from cache, this means two data transfers are required to send data, and if cache is pushed out to memory, potentially two more data transfers are required. 

 

When transmitting data out of the system, for every data block transmitted, it may generate 2 to 4 times as much memory data transmission across the memory bus.

 

Conclusions:

As you can tell from the direction my discussion above has leaned, the real limitation in throughput on I/O oriented servers is in moving data across the front side bus - between the PCI buses and Northbridge, to the processor or memory, and back and forth between the processor and memory.  For data that is in a processor (in L2 cache), we don't generally feel the limitations of the front side bus, as the L2 cache and processor are operating at as much as 6 or 7 times the speed of the memory bus where data transfer is taking place.

 

So what does examining these cycles of transfer tell us about what we can expect from typical Intel server hardware?

 

First, when you're doing input and output operations across a server with large volumes of data, that data will likely generate from 6 to 8 times more data on the memory bus itself.  On top of this, you lose a little headroom with registered ECC memory.  Coupling this with sustained memory access, which operates at a lower speed than the peak bandwidth of the memory (by spec, sometimes as low at 65% of the advertised memory bandwidth), means your server isn't nearly as fast as you think it should be.  For example:

 

Assuming you're running the latest Xeon architecture, and you're using a 800MHz FSB with 6.4GB/s of bandwidth.  Shaving 10% off memory bandwidth for registered ECC overhead, and calculating 65% of what's left for the sustained memory transfer that I/O saturation is going to require leaves us with about 3.75GB/s bandwidth to start with.  If we assume that on average every byte transferred across this bus generates 7 times as much traffic by the above receive and transmit processes, we're down to 565MB/s of I/O throughput on this server's CPU bus.  My rule of thumb formula is this: 6.4GB/s * .65 for sustained memory speed * .95 for ECC overhead / 7 for data duplication on the memory bus.  Factor in some overhead for other processes and OS, and things start looking pretty bad pretty quickly. 

 

Rule of Thumb for Northbridge Bus I/O Throughput: Because of the multiple data transfers across the memory bus and lower capabilities for sustained memory throughput, in current Intel architectures, in an ideal world, expect that you may see maximum I/O representing about 9% of your total memory bus bandwidth.  This represents an ideal world, with low overhead, large data blocks and minimal interrupt processing.

 

In our case, with a Xeon-based server, this would mean we might see in the area of 500MB/s to 600MB/s of total I/O throughput when receiving and transmitting across different interfaces, assuming your system is highly optimized.

 

Also, if you're building a Virtual Tape Library (VTL) or other I/O throughput intensive server, your throughput constraint will be faced on the input or receiving side, because of the extra overhead of receiving data.  So don't over-build your disk until you're sure your system is optimized.

 

As usual, I have a caveat here.  This is a theoretical guideline, gross generalizations and simplifications are made here, and I'm not a system hardware engineer.  Actual implementation may vary.  We'll be building a couple of servers, both Xeon and Opteron in the coming weeks, and I'll report back on findings here.  If you have immediately different experience, comment below.

 

Next week, I'll move on to talking about what the Opteron architecture really gives you that the current Intel Xeon server architecture can't.