Introduction ------------ The networking interface to the locking daemon is really lightweight and simple, based on small requests and responses that represent the locking primitives. We use the term 'server' to refer to the locking daemon, and 'client' obviously to the apps that use it through the network. All integers that are sent over the network are in network byte order. This document describes version 1 of the protocol. Principles of operation ----------------------- The idea is that the client opens a connection to the server to talk to it, making requests and waiting for the server to respond. Note that if a lock is granted, the connection must be kept open in order for the server to notice when a client crashed. When a socket that holds object locks is closed, those locks became "orphans", and if they're not "adopted" in a given timeframe by some other client, they'll get automatically unlocked. This is done to be able to support fault tolerance, having two or more servers working at the same time holding the same locks but without knowing who owns them: if one crashes, the client holding the lock can go to the second server and adopt it. The same can go for the clients: if one of the client crashes holding a lock, a second one can step up and adopt his locks. The same way as with normal threads, nobody 'owns' a lock, so anybody can unlock somebody else's lock; however, when a lock is being held, we do keep track of the connection holding it for the reasons specified avobe. The network operations divide between requests and responses, the former being sent by the client and the latter by the server. They can be batched, but they will be processed in order. If an operation would block (for instance, a lock acquire request), an ACK would be sent to mark that the command was received, and when the lock becomes available another response would be sent. In the meantime, other commands can be sent and will be processed while the blocking operation is pending. Inter-server communication -------------------------- In order to support fault tolerance and distributed operation, servers communicate throught the network using the same protocol as the client, with some additional operations. All servers are peers, and while here we discuss only with a two-server setup, there is no essential difference in using many of them. For normal operations, server consult each other before taking decisions, like granting or releasing locks: this is done using the same commands client do, and in fact, for the consulted server, it looks like a normal client and it gets processed that way. This, while it might sound strange, is beautifully simple: a server (let's call it S1) acts like another server's (S2) client, consulting it before taking a lock, and notifying of the releases. Now, to S2, S1 is just another client that acquires and releases locks. If S1 crashes, its locked objects became orphans, as explained before, in S2, where the client can go and adopt them. Of course, as they are peers, the same goes the other way. Now, what happens when, after S1 crashed, it comes back again? It sends a SYNC command to S2, telling that he wants to syncronize its locks, so S2 gives it a list of the objects it has locked and from then on they talk as usual. The main problem is when the communication between servers is lost, but neither crash: that way clients can still talk to them, but they can't talk to each other. This is worst than both crashing, because it means that two clients can grab the same lock. It's also the harder to fix, and this is why most fault tolerant things have those weird designs and backend special connections. What we do in this case is called 'The Great Poncio', it means that we rely the decision to another program determined (and probably written) by the administrator, because he's the one with knowledge about the network layout and how to determine the best action course. Pretty much like in the old roman ages, it decides whether the server lives, or dies. Protocol specification ---------------------- The main structure of the command header is: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Version| Op. Code | Lenght | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Overall, the header takes 32 bits: 4 for the version, 8 for the operation code, and 20 for the payload lenght. After the header, a payload of the given lenght follows (and, logically, 0 lenght means no payload); its content changes according the operation, but in most cases refer to a string which identifies the object to block univocally. The maximum payload size is (2^20 - 1) = 1048575 bytes = 1 Mb, which is really much more than we need. The following operations are defined: Name Code Description REQ_ACQ_LOCK 1 Request to acquire a lock REQ_REL_LOCK 2 Request to release a lock REQ_TRY_LOCK 3 Request to attempt to acquire a lock REQ_PING 4 Ping request REQ_ADOPT 5 Request to adopt an orphan lock REQ_SYNC 6 Syncronize request for inter-server operation REP_LOCK_ACQUIRED 128 The lock has been acquired REP_LOCK_WBLOCK 129 The operation would block REP_LOCK_RELEASED 130 The lock has been released REP_PONG 131 Ping response REP_ACK 132 Generic acknowledge response REP_ERR 133 Generic error response REP_SYNC 134 Syncronize reply for inter-server operation For the following operations: REQ_ACQ_LOCK REQ_REL_LOCK REQ_TRY_LOCK REQ_ADOPT REP_ACK (in response to the four previous requests) REP_ERR (in response to the four previous requests) REP_LOCK_ACQUIRED REP_LOCK_WBLOCK REP_LOCK_RELEASED the payload is a null-terminated string identifying univocally the object involved. For REQ_SYNC, no payload is used, and REP_SYNC payload contains all the locked object's null-terminated strings (ie. "obj 1\0obj 2\0obj3\0"). For REQ_PING the payload can be any arbitrary string that returned untouched in REP_SYNC. The possible response to commands are: Command Valid responses REQ_ACQ_LOCK REP_ACK, REP_LOCK_ACQUIRED REQ_REL_LOCK REP_LOCK_RELEASED REQ_TRY_LOCK REP_LOCK_ACQUIRED, REP_LOCK_WBLOCK REQ_PING REP_PONG REQ_ADOPT REP_ACK REQ_SYNC REP_SYNC Note that REP_ERR is always a valid response, and it always indicates an error on the request (ie. if you tried to unlock an object that wasn't locked).