Here is an interesting contribution on the topic from Preston Elder
I've worked in multiple extremely super-scaled applications (including ones sustaining 70,000 connections at any one time, 10,000 new connections each minute, and 15,000 concurrent throttled file transfers at any one time - all in one application instance on one machine).
The biggest problem I have seen is people don't know how to properly define their thread's purpose and requirements, and don't know how to decouple tasks that have in-built latency or avoid thread blocking (and locking).
For example, often in a high-performance network app, you will have some kind of multiplexor (or more than one) for your connections, so you don't have a thread per connection. But people often make the mistake of doing too much in the multiplexor's thread. The multiplexor should ideally only exist to be able to pull data off the socket, chop it up into packets that make sense, and hand it off to some kind of thread pool to do actual processing. Anything more and your multiplexor can't get back to retrieving the next bit of data fast enough.
Similarly, when moving data from a multiplexor to a thread pool, you should be a) moving in bulk (lock the queue once, not once per message), AND you should be using the Least Loaded pattern - where each thread in the pool has its OWN queue, and you move the entire batch of messages to the thread that is least loaded, and next time the multiplexor has another batch, it will move it to a different thread because IT is least loaded. Assuming your processing takes longer than the data takes to be split into packets (IT SHOULD!), then all your threads will still be busy, but there will be no lock contention between them, and occasional lock contention ONCE when they get a new batch of messages to process.
Finally, decouple your I/O-bound processes. Make your I/O bound things (eg. reporting via. socket back to some kind of stats/reporting system) happen in their own thread if they are allowed to block. And make sure your worker threads aren't waiting to give the I/O bound thread data - in this case, a similar pattern to the above in reverse works well - where each thread PUSHING to the I/O bound thread has its own queue, and your I/O bound thread has its own queue, and when it is empty, it just collects the swaps from all the worker queues (or just the next one in a round-robin fashion), so the workers can put data onto those queues at its leisure again, without lock contention with each other.
Never underestimate the value of your memory - if you are doing something like reporting to a stats/reporting server via. socket, you should implement some kind of Store and Forward system. This is both for integrity (if your app crashes, you still have the data to send), and so you don't blow your memory. This is also true if you are doing SQL inserts to an off-system database server - spool it out to local disk (local solid-state is even better!) and then just have a thread continually reading from disk and doing the inserts - in a thread not touched by anything else. And make sure your SAF uses *CYCLING FILES* that cycle on max size AND time - you don't want to keep appending to a file that can never be erased - and preferably, make that file a memory mapped file. Similarly, when sending data to your end-users, make sure you can overflow the data to disk so you don't have 3mb data sitting in memory for a single client, who happens to be too slow to take it fast enough.
And last thing, make sure you have architected things in a way that you can simply start up a new instance on another machine, and both machines can work IN TANDEM, allowing you to just throw hardware at the problem once you reach your hardware's limit. I've personally scaled up an app from about 20 machines to over 650 by ensuring the collector could handle multiple collections - and even making sure I could run multiple collectors side-by-side for when the data is too much for one collector to crunch.
I don't know of any papers on this, but this is my experience writing extremely high performance network apps
[ This post was originally written by Preston on slashdot. Reproduced here with his permission. If you have an interesting comment or a writeup you want to share on this blog, please forward it to me or submit a comment to this post ]