Sep 9, 2013

The Technology That Drives Trade-Ideas

Written by Katie Gomez

I’ve been proud of our servers at Trade-Ideas from the beginning.  But I never discussed them much because the technology was so esoteric.  Now that I see some of our core ideals gaining wider acceptance, I think it’s worth sharing our experiences.  

A lot of this recent activity can be summarized in one article.  This article describes how a server can efficiently talk with a lot of clients at once.  Let me add just one bit of advice on top of this excellent paper.  select() is not just slow, but if you try to listen to more than a thousand or so sockets at once, it can cause your program to dump core.

The bulk of this article is about 13 years old.  However, the ideas are just now becoming mainstream.  Nginx, as described in a recent article in Wired Magazine, is a web server based on these ideals.  A more interesting example is Node.js.  This has recently gained attention from google and others.
It’s funny that Node.js and other modern platforms use a single threaded model to maximize concurrency, while multiple cores have become so common.  A lot of these examples work well because they are essentially proxies.  They copy data from one place to another, and do very little else.  Here at Trade-Ideas we need some proxies, but the interesting part of our operation is a cluster of servers doing a lot of work.
To understand the right way to solve this problem, let’s start by looking at older, less scalable solutions.  The best example is the Apache web server.  This is great software because it can do so much.  The problem is that there’s no organization.  Each request tries to do its own thing, grabbing whatever resources it needs.  That’s not scalable.
The normal complaint about Apache is that it’s big.  One person asks for a big database query, followed by some work in PHP.  That Apache process asks for a lot of resources.  The next person wants an icon, a very small file that gets copied as is.  The Apache process isn’t smart enough to release the extra resources.  If you typically have 99 people making simple requests and one person making a big request at any given time, that takes up almost the same resources as 100 people making big requests.
As far as I’m concerned, that’s only part of the problem.  Imagine you configure Apache to use 100 processes.  That’s fine for a while, but what happens when someone makes a request that involves a very slow database query?  What happens when several people do at once?  You know, if it’s slow, some people will keep hitting refresh and start even more sessions.  As things get slower, new requests will come in faster than old ones can be processed.  The phrase “wound around the axle” is vividly appropriate.
In this case, adding more more processes won’t help.  You’ll quickly run out of resources.  If you add more web servers, that’s a temporary solution at best.  You will quickly overload your database server.  So what if you don’t add any more processes?  Then you find people who make simple requests waiting behind these slow requests.  It might take a minute to serve a single icon.  For that matter, simple database queries will take a lot longer than they should.  This problem can be solved by finesse, but not by brute force.
Our server design turns the Apache model on its head.  This design is based on experiences from my previous job designing realtime systems for a defense contractor.  We most definitely do not have one thread or process per request.  We typically have one thread per job.  The C10k paper does a good job describing our listener thread.  That thread grabs data from the network and nothing more.  Each time it gets some bytes, it sends a message to another piece of software in another thread.  That thread unmarshals the bytes into messages.  Those messages are sent to other threads, like the history thread, the user management thread, and the icon thread, just to name a few.  These threads are all connected by queues.
(This model is a very similar to the Erlang model.  We actually have one server written in Erlang.  That language is not right for all tasks.  But that one server has been working for years without any trouble.  I can’t remember the last time I had to tune or restart it.)
There are a number of advantages to this model.  First is simplicity.  I can write one small piece of software without thinking about the other pieces.  The inputs and outputs are clearly defined.  There are almost no mutexes (outside of the queues) because almost no data is shared between the threads (aside from the queues).  Reducing the number of mutexes can offer performance benefits, but that’s not my biggest concern.  In my experience, the hardest part of multithreaded programming is the proper use of a mutex. 
The next benefit is control.  I know exactly how many threads I have.  I know exactly how many database connections I have.   I know how many people are trying to read from the disk at once.  I know how many tasks are doing CPU intensive work at once.  And I can tweak these.  If I have a lot of history requests, I can create multiple copies of this history thread, each with its own database connection.  If my database is being overwhelmed, I can make fewer of these threads.  If a lot of history requests come in at once, they will patiently wait their turn in the queue.  Because I can fine tune the rate at which we’re executing different types of requests, the request at the end of the queue will probably still be serviced sooner than if all requests tried to run at once.
The next benefit is segregation.  What if we get a request for an icon and a complicated history request at the same time?  These require different resources.  With no effort on the programmer’s part, they can run independently.  Presumably the icon request will be orders of magnitude faster than the history request, so the icon should never have to wait for the database.  But what if I’m wrong?  Maybe the icon is not in the cache, and the disk is busy doing overnight maintenance.  In that case the server will automatically queue up the icon requests, and these won’t slow down the history requests.  The programmer doesn’t even have to think about that case; it just works.  A login request will require the database but should be much faster than a history request.  We use the same solution:  we created one new thread to handle all login requests.
Segregation is the key to performance, but it’s also essential for robustness.  Really these two bleed into one another.  What if one component is poorly tuned or overloaded with user requests?  It might go a little slow.  It might go very slow.  If one component completely dies, maybe it must be manually restarted.  In any case, some requests will be slow or will not be satisfied at all, but everything else goes at full speed.  The secret to all of these is segregating the tasks, attaching the tasks with queues, and making sure the the requests sitting in the queue use very few resources.  That last part is typically easy.  A request in the queue is often exactly what the user gave us, like “price > 5 && average volume > 100,000” or “custom formula #29”.  We can easily store millions of those in memory, even with older hardware.
The design I’ve described above is very clean, and works very well.  But there was one unexpected problem.  What happens when you need the Apache-like solution?  Maybe you’re throwing together a quick prototype.  You have several well behaved servers all sharing a database with one server that is out of control.  In that case the one bad server can cause the good ones to starve.
(Prototypes and quick-and-dirty jobs are essential!  There’s no way to build a large, complex system without these.)
The solution to this is quite simple and elegant.  I created a semaphore manager.  When the Apache-style server needs to access a database or similar resource, it requests a semaphore.  This limits the number of simultaneous database connections that the server can make.  I can easily tweak that number.  I can even segregate the requests on that server.  I can make one semaphore for slow database requests, one for simple database requests, and one for memory or CPU intensive requests.  This successfully keeps the old-style server from killing the better servers.  And it even helps the performance of the old-style server because of the segregation.  It’s not perfect because it doesn’t solve the queuing problem.  As in a typical Apache server, there is only one queue when you run out of processes.  When a process is waiting for a semaphore, it is sitting in a type of queue, but that’s very inefficient.  A process uses so much more memory than a request in the new servers.
(A side note, I took a class from Edsger Dijkstra, the inventor of the semaphore.  He really was as smart as everyone said!)
In order to  make this new server architecture work, we needed a custom communications protocol.  Typical HTTP requests are expensive and you don’t want to have a lot of outstanding requests.  Often a client is limited to two or four outstanding HTTP requests to a server.  This is the exact same problem that we keep seeing over and over.  What if the client has sent four slow requests and now wants to send a request that should be fast?  Our custom protocol was inspired by HTTP but has some key differences.  First of all, you only open one TCP/IP connection between the client and the server.  You can send any number of requests to the server at once.  These do not have to be answered in order.  With very few exceptions, the client software doesn’t have to worry about any of these issues we’ve mentioned.  The client will make requests.  The server will queue then answer the requests.  And we added other useful features, like automatic grouping of messages, session based compression (rather than message based compression), a text-based debug mode, and more.
Sometimes we can’t avoid HTTP.  Sometimes a firewall or other issues will get in the way.  In that case we use an HTTP tunnel on top of our protocol.  It’s less efficient.  It takes more bytes, seconds, and CPU cycles to do the same task.  But we still get a lot of the benefits of our protocol.  In particular, the ability to send several requests at once and receive responses in whatever order is best for the server.  And it’s nice from a software engineering perspective.  Because this a just a tunnel, not a new protocol or new server, the bulk of the client and server software can be reused without any changes.
When discussing efficient, scalable servers, it’s traditional to discuss “zero copy”.  The good news is that avoiding copies is often easier than it sounds.  GCC’s std::string class typically reuses your data rather than copying it.  Strings are implemented with smart pointers.  (The C++ standard allows but does not require smart pointers.)  And, when used correctly, standard strings are thread safe.
The message passing aspect of our software means that we don’t copy a lot of data.  We don’t even need smart pointers in this case.  One thread creates a request as an object on the heap.  A pointer to the request gets passed around from one thread to the next.  Exactly one thread at a time has access to the request.  Even within a thread, typically exactly one data structure will be responsible for each request at any given time.  It’s common for a thread to use a simple FIFO to listen for new requests from other threads.  But within a thread, there will be some sort of priority queue.  The pointer moves from one object to the next in an organized fashion.  
We do use smart pointers in some places.  A lot of times this isn’t about performance, but ease of development.  A database result is a good example.  We have our own object that is a wrapper around the database result.  We are storing an opaque handle that is only meaningful to the database API.  We couldn’t duplicate that if we wanted to.  The smart pointer means that we can easily generate the result in one function and process it in another.  It means that we can store results in data structures, as we often do.  And it means that we will never accidentally forget to release the resources when we are done with the data.  Most of the time (including this database example) the data is not thread safe, so there is no need for the smart pointer to be thread safe.
Sometimes we use thread safe smart pointers.  This allows several threads to use the same data without copying it.  This also makes the software easier to write.  We can queue up a lot of requests without worrying about the order in which they are executed.  If two threads happen to need the same object at the same time, we don’t worry about it.  When an object needs to be thread safe, it typically uses mutexes.  It’s worth a moment to understand the performance characteristics of mutexes.  When you ask linux for a mutex, it actually gives you a futex.  The “f” is for “fast”!  As long as there is no contention for the mutex, this might be as fast as a single memory access.  The first time I tested this I thought I made a mistake because time required to lock and unlock a mutex was so short.  There is only a penalty when two threads try to lock the mutex at the same time.  The way our servers are designed, this can happen but it is rare.
Const smart pointers are a special case.  This is often used when we have one set of instructions that is shared by a lot of requests.  This allows us to split one problem up into smaller pieces and let multiple threads work on the pieces in parallel.  This is especially useful for CPU intensive operations on machines with lots of cores.  Most of the time read only operations are naturally thread safe.  So it’s common to separate the constant parts of the data from the parts that might change.  In this case, no mutexes are needed at all.  Note that incrementing and decrementing the reference count does not require a mutex.  On a modern system, that’s a simple interlocked increment statement.
These ideas aren’t new, they’re just becoming more common.  For example, I’ve run into several variations of the interlocked increment statement recently.  But I first saw them in grad school about 20 years ago!  Of course, they weren’t in the compiler yet.  They were in the assembler, but not really.  Some of the instructions weren’t standardized yet, so my 486 CPU didn’t match my assembler.  I had to hand code some of these in machine language!  These ideas have been waiting, and their time has finally come!