git » nmdb » commit adc64ec

Add some documentation.

author Alberto Bertogli
2006-09-15 05:14:47 UTC
committer Alberto Bertogli
2006-09-15 05:14:47 UTC
parent cabc09732e593f63218bdc2032a52010a31611fc

Add some documentation.

doc/design.rst +207 -0
doc/network.rst +144 -0

diff --git a/doc/design.rst b/doc/design.rst
new file mode 100644
index 0000000..2acb447
--- /dev/null
+++ b/doc/design.rst
@@ -0,0 +1,207 @@
+
+=====================================
+nmdb - A TIPC-based database manager
+=====================================
+:Author: Alberto Bertogli (albertito@gmail.com)
+
+
+Introduction
+============
+
+nmdb_ is a simple and fast cache and database for TIPC clusters. It allows
+applications in the cluster to use a centralized, shared cache and database in
+a very easy way. It stores *(key, value)* pairs, with each key having only one
+associated value.
+
+This document explains how the server works internally, and why it works that
+way.
+
+
+Network interface
+=================
+
+The server communicates with its clients using TIPC_ messages, which are
+limited to 66000 bytes in size. It's completely connectionless, and uses the
+reliable datagram layer provided by TIPC_. The network protocol is specified
+in another document, and will not be subject of analysis here.
+
+The interaction is very simple: the client sends a request message for an
+action, and the server replies to it. There is only one message per request,
+and only one message per reply. This imposes a limit on the size of keys and
+values, which is fixed at 64Kb for each, and for their concatenation. That
+means each key or value must be shorter than 64Kb, and the sum of their sizes
+must be shorter than 64Kb too.
+
+There are several requests that can be made to the server:
+
+get *key*
+  Retrieves the value for the given key. If the key is in the cache, it
+  returns immediately. If not, it performs a query in the database.
+
+set_async *key* *value*
+  Stores the *(key, value)* pair in the database. It does the set in the cache,
+  queues the operation for the database, and returns.
+
+del_async *key*
+  Removes the key and it's associated value from the database. It does the del
+  in the cache, queues the operation for the database, and returns.
+
+set_sync *key* *value*
+  Like *set*, but return only after the database has completed the operation.
+
+del_sync *key*
+  Like *del*, but return only after the database has completed the operation.
+
+cache_get *key*
+  Like *get*, but only affects the cache and not the database. If the key is
+  not in the cache, returns a special value indicating "miss".
+
+cache_set *key* *value*
+  Like *set*, but only affects the cache and not the database.
+
+cache_del *key*
+  Like *del*, but only affects the cache and not the database.
+
+
+As you can see, it's possible to operate exclusively with the cache, ignoring
+the database completely. This is very similar to what memcached_ does. Note
+that the downside is that it's possible to mess with the cache, and leave it
+out of sync with the database. You can only do this if you mix *cache_set*
+with *set* or *set_sync*, which is hard to miss, so it's unlikely you will do
+this.
+
+The server assumes you have a brain, and you will not make a mess out of it.
+
+
+Request processing
+==================
+
+The server consist of two threads: the main thread (or event thread), and the
+database thread.
+
+The main thread is event-based, using libevent_ for network polling, and
+acting on incoming messages. Each message goes through an initial decoding
+stage, and then depending on the requested command, different functions are
+invoked, all which go through the same basic steps.
+
+The database thread waits on an operation queue for operations to perform. The
+operations are added by the main thread, and removed by the database thread.
+When an operation appears, it process it by invoking the corresponding
+database functions, and goes back to wait. This is completely synchronous, and
+all the operations are processed in order.
+
+After a request has been received, the steps performed to execute it are:
+
+#. Do some basic sanity checking in message parsing
+#. Act upon the cache. If it's a cache operation, send the reply and it's done.
+#. Queue the request in the operation queue
+#. If the operation is asynchronous, send the reply and it's done.
+#. If not, signal the database it has work to do and it's done.
+
+
+Note that for asynchronous operations the signal is not sent. This is because
+sending the signal is a significant operation, and avoiding it reduces the
+latency. It also means that the database might take a while in noticing the
+operation, but that's not usually a problem (after all, that's why the request
+was asynchronous in the first place). The signalling method chosen is a
+conditional mutex, on which the database thread waits if the queue is empty.
+
+While some operations are asynchronous, they are always processed in order. If
+an application issues two operations in a row, they're guaranteed to be
+completed in order. This avoids "get after del/set" issues that would
+complicate the application using the database.
+
+
+The database thread
+===================
+
+There is only one database thread, which makes the overall design much simpler
+(there are no races between different operations, as they're all executed in
+order), but reduces the potential synchronous performance.
+
+This trade-off was chosen on the basis that most applications will use
+asynchronous operations, and most DBMs do not support multithreading
+operation. A specific solution could have been used, and the database backend
+code is isolated enough to allow this to happen in the future if necessity
+arises.
+
+QDBM_ was chosen for the backend, because its speed, although most DBMs would
+have been equally useful regarding features, because the server is not at all
+demanding.
+
+The processing is performed by taking requests from the aforementioned queue,
+and acting upon the database accordingly, which involves calling the backend's
+get, set or del depending on the operation in question. Then, if the operation
+was synchronous, a response is sent to the client.
+
+As mentioned in the previous section, a conditional mutex is used for
+notification. When the queue is not empty, the thread waits upon it until the
+main thread wakes it up. This provides low latency wakeups when necessary
+(synchronous operations, specially get which is quite common), and very low
+CPU usage when the database is idle.
+
+
+Passive mode
+============
+
+The server has a special mode, *passive mode*, where it listen to requests,
+acts upon them internally, but never sends any replies. It is used for
+redundancy purposes, allowing the administrator to have an up-to-date copy of
+the database in case the main one fails.
+
+The implementation is quite simple, because the code paths are exactly the
+same, with the exception of skipping the network replies, so they're done
+conditionally depending on the passive setting.
+
+Live switching of a server from passive to active (and vice-versa) should be
+possible, although it is not yet implemented.
+
+
+The cache layer
+===============
+
+The cache layer is implemented by a modified hash table, to make eviction
+efficient and cheap.
+
+The hash table is quite normal: several buckets (the size is decided at
+initialization time), and each bucket containing a linked list with the
+objects assigned to it.
+
+There a some tricks, though:
+
+- In order to keep a bound on the number of objects in the cache, the number
+  of elements in each linked list is limited to 4.
+- Whenever a lookup is made, the entry that matched is promoted to the head of
+  the list containing it.
+- When inserting a new element in the cache, it's always inserted to the top
+  of the list, as its first element.
+- When there is excess on the number of elements in the list, the bottom one
+  is removed.
+
+This causes a natural *LRU* behaviour on each list, which is quite desirable for
+a cache of this kind. The size of the linked lists was chosen to be short
+enough to keep lookups fast, but long enough for the *LRU* mechanism to be
+useful.
+
+If two "hot" objects were to end up in the same bucket, the cache will behave
+properly, because the chances of them being evicted by a third "cold" object
+are pretty low. Under stress, cold objects move to the bottom of the list
+fast, so the cache does not misbehave easily.
+
+This makes the choice of inserting new objects to the top an easy one. In
+other cache implementations, adding new objects as "hot" is dangerous because
+it might be easy for them to cause unwanted evictions; but on the other hand
+some workloads perform better if the new entries are well ranked. Here, due to
+the list size it's quite difficult for it to cause a hot object to be evicted,
+so it's not a problem.
+
+Nonetheless, it's advisable to use a large cache size, specially if the usage
+pattern involves handling lots of different keys.
+
+
+.. _nmdb: http://auriga.wearlab.de/~alb/nmdb/
+.. _libevent: http://www.monkey.org/~provos/libevent/
+.. _TIPC: http://tipc.sf.net
+.. _memcached: http://www.danga.com/memcached/
+.. _QDBM: http://qdbm.sf.net
+
diff --git a/doc/network.rst b/doc/network.rst
new file mode 100644
index 0000000..78b26a1
--- /dev/null
+++ b/doc/network.rst
@@ -0,0 +1,144 @@
+
+======================
+nmdb_ Network Protocol
+======================
+:Author: Alberto Bertogli (albertito@gmail.com)
+
+**NOTE:** All integers are in network byte order.
+
+
+Requests
+========
+
+All requests begin with a common header, and then have a request-specific
+payload. They look like this::
+
+  +-----+------------+------------------+--- - - - ---+
+  | Ver | Request ID |   Request code   |   Payload   |
+  +-----+------------+------------------+--- - - - ---+
+
+Where the fields are:
+
+Ver
+  Version of the protocol. 4 bits. Must be 1.
+Request ID
+  A number identifying the request. A request is uniquely represented by the
+  *(ID, sender)* pair, where *sender* is the sender host information. IDs can
+  be reused once a matching response has arrived to the sender. 28 bits.
+Request code
+  The code of the operation to be performed. They're defined in the server
+  source code. 32 bits.
+Payload
+  The payload is specific to the request code. Some requests can carry no
+  associated payload.
+
+
+Request codes
+-------------
+
+The following table was taken from the server source, which should be the
+authoritative source of information. The codes are included here just for
+completeness.
+
+============== ======
+     Name       Code
+============== ======
+REQ_CACHE_GET  0x101
+REQ_CACHE_SET  0x102
+REQ_CACHE_DEL  0x103
+REQ_GET        0x104
+REQ_SET_SYNC   0x105
+REQ_DEL_SYNC   0x106
+REQ_SET_ASYNC  0x107
+REQ_DEL_ASYNC  0x108
+============== ======
+
+
+Request payload formats
+-----------------------
+
+REQ_GET and REQ_CACHE_GET
+  These requests have the same payload format, and only differ on the code.
+  First the key size (32 bits), and then the key.
+REQ_SET_* and REQ_CACHE_SET
+  Like the previous requests, they share the payload format. First the key
+  size (32 bits), then the value size (32 bits), then the key, and then the
+  value.
+REQ_DEL_* and REQ_CACHE_DEL
+  You guessed it, they share the payload format too: first the key size (32
+  bits), and then the key.
+
+
+Replies
+=======
+
+Replies begin with the ID they correspond to, then a reply code, and then a
+reply-specific payload. They look like this::
+
+  +------------+------------------+--- - - - ---+
+  | Request ID |    Reply code    |   Payload   |
+  +------------+------------------+--- - - - ---+
+
+Where the fields are:
+
+Request ID
+  The request ID this reply corresponds to. 32 bits.
+Reply code
+  The code of the reply. They're defined in the server
+Payload
+  The payload is specific to the reply code. Some replies can carry no
+  associated payload.
+
+All integers are in network byte ordering.
+
+
+Reply codes
+-----------
+
+The following table was taken from the server source, which should be the
+authoritative source of information. The codes are included here just for
+completeness.
+
+================ ======
+      Name        Code
+================ ======
+REP_ERR          0x800
+REP_CACHE_HIT    0x801
+REP_CACHE_MISS   0x802
+REP_OK           0x803
+REP_NOTIN        0x804
+================ ======
+
+
+Reply payload formats
+---------------------
+
+REP_ERR
+  The payload is a 32-bit error code, according to the table below.
+REP_CACHE_MISS and REP_NOTIN
+  These replies have no payload.
+REP_CACHE_HIT
+  The first 32 bits are the value size, then the value.
+REP_OK
+  Depending on the request, this reply does or doesn't have an associated
+  value. For *REQ_SET** or *REQ_DEL** there is no payload. But for *REQ_GET*
+  the first 32 bits are the value size, and then the value.
+
+
+Reply error codes
+-----------------
+
+============ ====== =========================
+    Name      Code         Description
+============ ====== =========================
+ERR_VER      0x101  Version mismatch
+ERR_SEND     0x102  Error sending data
+ERR_BROKEN   0x103  Broken request
+ERR_UNKREQ   0x104  Unknown request
+ERR_MEM      0x105  Memory allocation error
+ERR_DB       0x106  Database error
+============ ====== =========================
+
+
+.. _nmdb: http://auriga.wearlab.de/~alb/nmdb/
+