Wednesday, November 29, 2006

Why another database manager?

A friend asked me recently why I spent my time implementing a DBMS, when there are already a number of open source databases. It is a good question because clearly I am not breaking any new ground here. In fact, I find myself incapable of inventing great new technology. Most of what I am implementing is well known stuff. I am a software engineer, rather than a scientist, going by the definitions of engineers and scientists by C.A.R.Hoare. It seems that this project is pure self indulgence, when I could be spending my time more fruitfully, either contributing to projects like Apache Derby, or working on something more relevant.

I guess that if I am honest with myself, I have to admit that there is an element of self indulgence here. But there is also some utility. The knowledge I gain in the process is a benefit to me in my work life. But apart from that, I think that my DBMS implementation is better documented and easier to understand than other opensource implementations. This is for a couple of reasons:
  1. I use well established algorithms, which are well documented in computer science literature. I am also putting more effort into documentation than is typical of many opensource projects.
  2. The system is decomposed into well defined modules which are loosely coupled. I find most other implementations are far more integrated, and therefore difficult to understand. I have traded off performance and efficiency in favour of ease of understanding.
There are still no books that describe how to build a real DBMS, and also show you with real code how DBMS features such as transactions, locking, recovery, btrees, etc. work. The only book that comes close is Transaction Processing: Concepts and Techniques, but the sample code contained in this book is not complete. In some ways, my project provides a sample implementation of many of the techniques described in this book.

Finally, there is the question of pride. When I started this project many years ago in C++, I never thought I could do it. I had no training in this field, and no access to people who do this type of stuff at work. Having come this far, it seems a shame to give up.

Tuesday, November 28, 2006

Premature Optimisation

Here is an example of C.A.R.Hoare's statement that premature optimization is the root of all evils.

In implementing the Lock Manager in SimpleDBM, I was concerned about the impact thread synchronisation would have on the scalability of the Lock Manager. The Lock Manager uses a hash table to speed up lookups of the locks. Many implementations of Lock Managers synchronize the entire hash table while manipulating locks. There is an issue within the Apache Derby project discussing the scalability problems posed by such global synchronisation. The SimpleDBM implementation does not use a global lock. Instead it uses a custom hash table, where every hash bucket contains a ReentrantLock. The hash bucket lock is acquired before any traversal or modification is performed on the bucket's chain. I went even further and put a ReentrantLock into each Lock Header object. A Lock Header is required for each active lock in the system. The Lock Header maintains a queue of lock requests (clients) for the lock. Lock Headers are the objects that get put on the hash chains.

The idea was that the hash bucket lock could be released early by obtaining a lock on the Lock Header, once the Lock Header had been found. This would increase concurrency by reducing the duration for which the hash bucket needs to be kept locked.

In my attempt to further reduce locking, I tried to find ways of reducing the time for which the hash bucket lock was held. One such optimisation involved setting a field in the Lock Header to null rather than manipulating the hash bucket chain when releasing a lock. This code worked fine until the day I tried testing in a multi processor environment (CoreDuo), when I found that one of my test cases started failing. This test case was designed to stress concurrent operations on the same hash bucket. The problem was that in attempting to optimise I had broken the code. The optimisation was flawed because in one section of the code I was modifying the field in the Lock Header while holding the Lock Header lock, while in another section, a search was being performed on the hash bucket while holding the hash bucket lock, and this search was inspecting the field that was being updated.

A lesson from this is that multi-threaded code must be tested on a multi-processor machine.

Anyway, when I hit the bug described above, I decided that it was time to revisit the thread synchronisation code in the Lock Manager. I realised that I was attacking the wrong problem in my tuning efforts, because there were bigger scalability issues that needed fixing compared to the problem I was trying to fix.

The first problem was of memory usage. I was using too many ReentrantLocks; one per hash bucket, plus one for each lock header. In a system with 10,000 concurrent active locks, the total number of ReentrantLocks would be 10,000 plus the number of buckets in the hash chain. If I wanted to make the Lock Manager scalable, I needed to reduce the amount of memory used per lock.

The second problem was that the hash table was not dynamic. Its size was fixed at the time of creating the Lock Manager. This was likely to cause severe performance and scalability problems due to hash collisions. It would be virtually impossible to set the correct hash table size statically, and a poor hash table would cause more hash collisions leading to long hash chains, which would mean greater contention between threads trying to access the same hash bucket.

I decided that locking at the level of Lock Headers was not giving me sufficient bang for money, whereas, if I made the hash table dynamic, and used Java's inbuilt monitors instead of ReentrantLocks to synchronise the buckets, the system's overall efficiency and scalability would be greater than what my current code was achieving. The increase in the number of hash buckets would reduce the number of hash collisions and thereby lock contention amongst threads. The reduction in the number of ReentrantLocks would reduce memory usage and therefore make the Lock Manager more scalable. The refactored code would also be simpler and easier to maintain.

Thursday, November 16, 2006

Lock Manager updates

SimpleDBM's Lock Manager is based upon the algorithms described in Transaction Processing: Concepts and Techniques. Until now, the Lock Manager did not have support for deadlock detection. A simple timeout mechanism was used to abort transactions that were in a deadlock. Yesterday I decided to start implementing a deadlock detector. As I was researching this, I discovered a problem in the way rollback to a savepoint is implemented by the Transaction Manager. When rolling back to savepoint, any locks acquired after the savepoint are released. However, any locks released after the savepoint are not re-acquired. There are times when locks can be released early, for example, when using cursor stability mode, the lock on current row is released when the cursor moves to the next row. When rolling back to a savepoint, we need to restore cursors to their status as at the time of savepoint, and ensure that the cursors reacquire locks on the current row. Since SimpleDBM does not have cursors yet, this is not a problem right now. The difficulty is how to implement this type of interaction between cursors and the Transaction Manager. It seems to me that the Transaction Manager needs to know which cursors are being used by a transaction, and for each cursor, save current row in the savepoint. When the transaction rolls back to a savepoint, the saved information can be used to reposition the cursors to the correct rows and reacquire locks. Since I want to avoid this type of close dependency between the transaction manager and other modules, it will most likely be necessary to implement the Observer pattern.

Speaking of the Observer pattern, I had to recently add a LockListener interface to the Lock Manager module. This was required to allow me to write better test cases. Designing test cases where multiple threads interact in a certain way is complicated. Earlier I used a simple strategy - threads went to sleep for certain intervals, and the pauses between the thread executions were timed just so that the correct thread interaction could be achieved. Unfortunately, this meant that the test cases often had a number of sleeps in them, which slowed everything down. Also, I dislike the approach because it is a hack. In some of the new and revised test cases, I now use the LockListener facility to identify wait points within the locking subsystem. Thus an event is generated when the lock requester is about to start waiting for a lock. This event is trapped by the test cases to synchronize between multiple concurrent threads of activity.

The deadlock detection algorithm I am using is based upon the simple deadlock detector described in the Transaction Processing book alluded to before. However, the code presented in the book requires tight coupling between the Lock Manager and the Transaction Manager, whereas in my implementation, the Lock Manager has no knowledge of the Transaction Manager. Loose coupling makes code more flexible, but there is a cost. Tight coupling can reduce memory consumption quite a bit and improve efficiency, as there is less need for redundant information. For example, in the book, each Transaction contains a pointer to the lock request that is being waited for, and the Lock Manager freely accesses this information. In SimpleDBM, the Lock Manager maintains a separate HashMap where lock waits are recorded, so that the Lock Manager can navigate the lock waits graph without any knowledge of the clients invoking the Lock Manager.

Saturday, November 04, 2006

New source repository for SimpleDBM

SimpleDBM has a new source code repository. I moved the source code from ObjectWeb's repository to Google Code because I found the ObjectWeb repository somewhat slow and unresponsive. The SimpleDBM web pages are still hosted at ObjectWeb.

In the past few months, SimpleDBM's source code repository has move from to ObjectWeb to Google Code, which I hope will be its final resting place. Compared with ObjectWeb and, Google Code has the usual Google stamp of clean uncluttered interface, and relatively faster response. Another thing I like is that as soon as the code is checked in, it appears on the web site.

Thursday, November 02, 2006

Package structure

There has been a revision in the package structure within SimpleDBM.

In the earlier structure, the API and the implementation of a module were designed to be together. Hence, if there was a module named test, the packages would have been org.simpledbm.rss.test for the API, and org.simpledbm.rss.test.impl for the implementation.

The current structure is more favoured towards separating the whole of the API from the implementation. In the new structure, all APIs are under org.simpledbm.rss.api.<module>, and all implementations are under org.simpledbm.rss.impl.<module>. I decided to separate the API from the implementation in this way for two reasons:
  1. It would be easier to generate JavaDoc for the API alone.
  2. The API could be packaged separately in its own jar file without too much effort.