In a customer project right now I need to be able to work and evolve code fast, with a relatively complex model. And by fast I mean that I want to cut away as much as possible of the efforts related to persistence. Generally this is what OODBs excel at.
In the Squeak world we have GemStone (commercial), GOODS and Magma as “full fledged” OODBs. Last century :) I worked with GemStone (both Gemstone/S and /J) and its a great product - but I want something lightweight and open source. And simple. And hackable. And new. :)
I also used Magma in the Gjallar project, and while I respect it highly - this time I want to try something with an “externally supported backend”. I also had a mixed performance experience, but this was “pre Cog” and Magma has also surely evolved lots since then, and I am not sure we did everything the way we should have either.
SandstoneDB could also be interesting to look more closely at, but since I have been working with Nicolas Petton on improving Phriak (Riak interface for Pharo) it was natural to take a look at one of his “under the radar” projects - Oak, an “OODBish” solution on top of Riak. At this point I have been doing much more than looking, in fact I am hacking on it! And oh, yeah, of course there are lots more persistence options available too.
What is Riak?
To understand Oak one should probably learn a bit about Riak first. Riak is the primary backend for Oak, although one can use different backends quite easily. The core Oak functionality only needs a key/value store. As Oak evolves I however suspect that some features will be more tightly associated with specific functionality in Riak, like the use of indexing and map/reduce. Nicolas has experimented with using MongoDB and combining with memcached, and I am intrigued to also test if Oak can run nicely on top of my Tokyo Tyrant binding. Bottom line is that while Oak is indeed abstracted on top of a key/value store - we are still focusing on Riak as the primary platform.
Riak is an open source ultra scalable key/value store modelled after the highly influential Dynamo database built and used internally at Amazon. The architecture of Riak sets it apart from most of its competitors - it is completely masterless. All nodes participate in a “hash ring” and key/value pairs are stored redundantly on multiple nodes in parallell. Automatically. If one box burns up, nothing happens because data is spread out and Riak will compensate the loss of a node by automatically rebalancing data.
If you want more requests/second (reads or writes) just add more boxes. On the fly. Anyway, this article is not about Riak, but having Riak at the bottom we have the ability to scale and best of all, we can sleep calm at night. There is a lot of advanced algorithms involved in Riak to make all this work, but from the outside it is a beautifully simple system.
The binding we use for Riak is called Phriak and it’s an HTTP binding (Riak also offers a Protocol Buffer API) written using Zinc. Phriak can be used all by itself and covers quite a lot of Riak features including secondary indexes and map reduce. Finally, Riak is “data agnostic” so we don’t need to store JSON if we do not want to.
Installing Oak
I am using Pharo 1.4 and there is a Metacello configuration available on SmalltalkHub so load it:
1 2 3 4 5 |
|
Then you need Riak and there are detailed instructions available and it is generally quite simple. Currently Oak does not use secondary indexing so at this point Riak should work fine, later when we start messing with secondary indexing you will need to switch to LevelDB as the backend library, but that is just a simple config tweak.
Now, open up TestRunner, filter on “Oak”, select “Oak-Tests”, then click “Run Selected” - if all is fine you should have 40 green tests. OAOakSessionTest is running against your local Riak. OASessionMockTest is just running against a Dictionary in the image.
How does Oak work?
Oak is a layer on top of a key/value store creating a semi transparent OODB. By “semi transparent” I mean that it is slightly less “automagic” than a real OODB, but on the other hand there is much less code and logic that can go wrong and we get the luxury of leaning against a rock solid industry strenght backend. We are also gradually making it more and more transparent.
Oak has some fundamental mechanisms:
- Serialization of objects using Fuel, no mapping to JSON.
- The concept of a transaction/”unit of work” which can be committed as a whole or aborted, no explicit write operations.
- The concept of “persistence by reachability”, no explicit insert operations (but we still need explicit delete).
- Proxies to do automatic “faulting” of more objects from disk, no explicit read operations.
So Oak doesn’t convert to JSON, instead we store Fuel blobs. Fuel is fast, heavily tested and supports schema changes. When it comes to generic serialization I would guess it is the best we have. We simply give Fuel an object to serialize into a ByteArray and Fuel follows all references and makes a blob for us.
The Oak transaction keeps track of all changes during a “unit of work” and applies them at the end, or not at all. The transactions are not truly atomic, but at least we postpone all operations until at the end, so if one decides to abort half way through - nothing is written. If the actual writing of the changes would fail it would still be non atomic. The operations we collect are basically delete, insert and update of Oak objects. An Oak object is a partial object graph.
Partial? Well, let’s say we have a domain object Person. It might be arbitrarily complex but since it is a domain object it is designed to be self contained and not referencing UI stuff etc. We can save this single object in Oak like this:
1 2 3 4 5 6 7 8 9 10 11 |
|
First of all, Oak doesn’t require anything special from the objects it saves. No need to inherit a base class or do any other kind of pre-processing.
One wouldn’t normally save a Person as the root. A more normal approach would be to save a Dictionary there, and then save other objects in it. Or set your domain top level object of your system as the root.
When objects are saved, Oak will use a “UUID new asString36” as the key. We call this key the “oid” (Object id) of the object. So in order to get an object back we would either need to know the oid or we get the object through some other object by reachability, typically starting at the root object. The root object is saved using a hard coded oid, so we can always find that one.
Now, let’s modify the person. Reuse the session you have, or create a new one. Let’s focus on the interesting bit:
1 2 3 4 5 |
|
When we send #root to the session it will actually not read the Person, it will instead create a proxy object which has a single instance variable holding the oid, which in this case is the hard coded oid for the root object. So the temp variable “person” will hold the proxy. The magic happens when we send #name: to this proxy. It will trap the message, fetch the Person instance from Riak (as a Fuel ByteArray) and materialize it into a “real” Person instance. Then the proxy will forward the #name: message to this Person instance and return whatever it returns.
Most OODBs tend to use #become: to actually turn the proxy into the real object. Oak currently does not do this, instead Oak keeps forwarding. There are advantages and disadvantages of both approaches. If one uses “short” transactions - quickly find your way down to your object, modify and commit - then you will send few messages and the commit will be quick. In this scenario one might argue that the cost of #become: is higher than doing a few message forwards. Perhaps Oak will use #become: later, the obvious advantage is that it enables more fool proof handling of identity.
Then we send #oakSave to the proxy, and this will schedule an update operation which in turn will be executed after the commit block has run to the end. The #oakSave has two functions here. First it tells Oak that the object is dirty and needs to be written to disk. Secondly it tells Oak that this object should be a separate Oak object, serialized, read and written as a separate key/value pair in Riak with its own oid.
In fact, Oak also implements #oakInsert but there is no real need to use it, because one can always use #oakSave and it will itself decide to use insert if the object is a new object. Finally we have #oakDelete to actually delete an object. Oak has no garbage collection at this point so it is not enough to just “not reference” an object anymore. Well, nothing bad happens if you forget to send #oakDelete, but the key/value pair will still be in Riak consuming disk space.
Now… lately we have added autoSave so one do not need to use #oakSave at all! This mechanism does two things automatically: - It will detect which objects have been changed during the commit block and will save them automatically. This costs a little bit of performance, but nothing dramatic. - It will also detect insertions automatically, but this depends on the instance answering true when asked #isOakPersistentWannabe. If an object answers true it means that it wants to be a separate Oak object if it happens to be reachable from another persistent Oak object. If it answers false it will not be saved as a separate key/value pair, instead it will be contained within the serialization of the object referencing it.
So if we use autoSave we simply implement #isOakPersistentWannabe as ”true” in those domain objects that we think should be serialized as separate key/value pairs. Neat!
Next in Oak
Next steps for Oak are probably:
- Polishing code by using it for real. :)
- Adding custom Oak collection classes that behaves like “normal” collections but leverage things like secondary indexing and map/reduce.
- Possibly adding some kind of automatic deletion.
Custom collection classes one would like to have is of course large scale OrderedCollection and large scale Dictionary for starters.
When it comes to deletion there are some things we can do. First we could automatically handle the special case of a single owner. If an object “promise” us that it is only referenced by ONE other Oak object, typically an owner, then we can automatically delete it when that owner “drops” it. For hierarchical models this would work, but of course it depends on the developer being sure these objects are only referenced from one and one object only. Since it wouldn’t solve the general case it might not be worth implementing it at all, not sure.
This turned into a rather long article, I hope that I have managed to tickle your interest and that you take Oak out for a spin and perhaps help out in making it better.