As of version 0.11, old is multithread. The way it uses threads is a bit unusual, so this document tries to describe it in detail, so readers of the code don't panic when reading it. How it used to work ------------------- Before these changes, old had a simple select() loop which waited for connections, and then called net_get_cmd() and net_parse() to process input from the network. This was pretty straightforward, simple and fast because, as we are CPU bound, there was no need for a more complex scheme which have overhead and offer no gain at all. But there was one drawback tho: if we had more than one processor, old only used one because it did all the work in a single loop, and thus we didn't use the machine capacity at full scale. How it works now ---------------- I didn't want to slow down anything or complex the code too much, so I started doing something quite simple: split the old loop into three functions: net_init() (to initialize basic structures and open the listening fd), net_select_loop() (to do select and only select), and net_proc_loop() (to do processing of the network i/o). So now, we call net_init() as expected, then create threads which do net_proc_loop() to process, and finally call net_select_loop() to get notifiction when i/o is ready, and in turn tell the processing threads that they have work to do. So far so good, this is a normal approach to do things. But now comes the intresting part: how does net_select_loop() tell net_proc_loop() it has work to do? This is a very intresting thing to do, and there are several "classic" ways of doing it, the most popular being using wait queues to queue up pending work, and conditional variables provided by pthreads. Those work very well but have some drawbacks, mainly the overhead: because the operations we do are really really fast, the overhead of using this approaches was too big for my taste. So I came up with a simple and fast way of doing it that has minimal overhead: we rely on three arrays: of locks, of integers that represent the file descriptor each thread has to process input from, and another one of integers used to tell if the thread is busy or idle. The thread number is used to index each array, thus we get O(1) access to per-thread data. Each thread tries to lock its own lock from the array, which is normally locked (we will see below why). When it grabs the lock, it gets the fd to do i/o from the array and does the work; and finally it marks itself idle and goes back to attemp to lock its own lock. What the select loop does is, of course, call select() on all the file descriptors to see when we have data waiting on a given fd, and when that happens we look for an idle thread by looping on the busy array, then assign the fd to that thread (by using the fd_to_process array), and finally unlock the thread's lock, which will wake it up (remember, from the previous paragraph, that the thread is idle trying to lock its lock). If all threads were busy, it just moves on, which ends up going back to select() which will return inmediately because there are fds with data; this is done for performance reasons explained below in the 'Scheduler interaction' section. And basically that's it, I don't think the code is _that_ messy (in fact I found it quite simple) but I felt the need to write a small document so first time readers don't get lost because of this unusual approach. Benefits of the new approach ---------------------------- The most important thing is the old one-processor case: the added overhead is minimum, basically just one lock/unlock call pair, which is unmeasurable in normal conditions. I've tested it and compared against the old big select loop and we show no slowdown at all. But furthermore, it allows old to scale pretty well, because now it can use more than one processor in an efficient way by being able to process commands simultaneously. The rest of the server is already threadsafe so there was no need of modifying any code at all, and internal lock contention is expected to be pretty small. Scheduler interaction --------------------- The nature of the lock server and the way it handles work among threads make it quite a case for the system scheduler. This is seen mostly on UP with one select thread and one processing thread, and continuously streaming it with work. On the first implementation (0.11) we did a busy loop inside the select thread, looking for idle threads; that means the loop only stopped looping when it found one. This lead to very bad performance, because the scheduler identified the select thread as a cpu-hog (because while looping it would use all the available timeslice) and gave it its timeslice all at once, and then kept it aside for a while. The problem was that while the loop was running, the processing thread couldn't run because the CPU was being held by the select thread, so a lot of time was wasted spinning on the busy loop. To fix this we had several options, most of them combinable. First of all, we could stop spinning endlessly looking for an idle thread, and just keep going like nothing happened and go back to select() (which will return inmediately because there is a fd with pending data). This has a small advantage that it goes back to the kernel in the select() call, who might or might not return inmediately or reschedule, but we are for sure being nicer this way. Another thing was to use priorities to 'lower' the priority of the select thread, so it gets shorter timeslices and gives the processing threads more chances to run. And finally, on top of the first one, we could call sched_yield() before select() to give up the rest of our timeslice and give time to other threads to run. But this has a very big downside, which is that the behaviour is quite dependant on the scheduler, and it might decide to put the select thread to sleep for a _very_ long time, which ends up in a performance loss. After doing performance testing (using a Linux 2.6 kernel), the combination of the first and second approaches showed to perform much better, as expected. Also, the third one slowed things down a bit. Recommendations --------------- Be aware that, if you start more processing threads than processors you have on your system, you'll probably suffer slowdowns due to the completely 'cpu-bound'iness of the server, so it's not recommended. Also bear in mind that there will always be one and only one select thread. By default we start only one processing thread, but in the future we'll probably switch that to as many processing threads as the number of processors in the system, but it's not set in stone. Please let me know if you have any doubts or thoughts about this. Thanks, Alberto