I have gotten into researching these new
breeds
of databases popping up all over the place. I started looking closer at
CouchDB after skimming a whole bunch, I even
implemented a "view server" for it (so that you can write
map/reduce functions in Squeak instead of javascript - why? Because
"you can":)). And yeah, CouchDB is really cool, no doubt about
that. They are just about to get 0.9 out the door btw.
But I started looking around again and this time I looked longer at the
Sourceforge homepage of
Tokyo Cabinet than the 3 seconds staring at japanese symbols I did
the first time. Mikio Hirabayashi, the author of the Tokyo
"products" seems to be an extremely talented and productive
developer. So what is all this stuff?
- Tokyo Cabinet. A new db
library seemingly beating the
crap out of all others on most aspects. You know, Berkeley DB and all
those. Floored!
- Tokyo Tyrant.
The remote server part on top, a thread-pool modeled implementation using
epoll/kqueue mechanisms and offering memcache/HTTP/socket protocols and Lua
scripting inside. It seems to scale. A lot.
- Tokyo Dystopia. An
advanced text search engine or a so called "inverted index". Not
yet reached 1.0 but since Mikio wrote Hyper Estraier I expect a solid
offering there too.
And it is all under the LGPL, which is fine with me (being primarily an
MIT/BSD guy).
I got dazzled by all the buzzwords and functionality in Tokyo Cabinet and
when I realized that the binary Socket protocol
for Tokyo Tyrant actually is quite small and well documented - what the
hell! Let’s hack up a Squeak port of this, and so I did. :)
TokyoTyrantProtocol
The binary Socket protocol actually supports all the different database
types that Cabinet offers:
- On memory hash and tree database
- On disk hash, B+ tree, fixed-length and table database
…and it also offers a way to call Lua extension code inside the
Tyrant server. Yum, yum!
So I wrote an API for Squeak. How to use it? Well, first you need Tokyo
Cabinet and Tokyo Tyrant. Just suck down the source and do the dance
(presuming you are on a real OS of course):
wget http://switch.dl.sourceforge.net/sourceforge/tokyocabinet/tokyocabinet-1.4.11.tar.gz
tar xzf tokyocabinet-1.4.11.tar.gz
cd tokyocabinet-1.4.11
./configure; make; make install
wget http://garr.dl.sourceforge.net/sourceforge/tokyocabinet/tokyotyrant-1.1.18.tar.gz
tar xzf tokyotyrant-1.1.18.tar.gz
cd tokyotyrant-1.1.18
./configure --enable-lua; make; make install
Well, ok, you might stumble on some dependencies of course, on Ubuntu for
me (but I have lots of dev libraries already installed) I only needed to
add these IIRC, your miles may wary:
apt-get install libbz2-dev liblua5.1-0-dev zlib1g-dev
Having installed TC and TT, let’s do the trivial thing and just fire
up a server on a disk based hash database (tch = Tokyo Cabinet Hash):
ttserver mydb.tch
If you run "ttserver" without a filename it will create an
on-memory hash database and then some of my unit tests fail. :) This server
is now up and listening on default port 1978.
Let’s fire up Squeak, I use Squeak 3.10.2 so no promises for anything
else. Open SqueakMap Package Loader, find "TokyoTyrant" and install.
Then open up Test Runner, select "TokyoTyrant" category and
select all test classes except the TokyoTyrantTableTest - you should get
all green.
Okidoki let’s do "hello world":
| db |
db := TokyoTyrantDB new.
db at: 'hello' put: 'world'.
Transcript show: 'Hello is: ', db stringAt: 'world';cr.
db close
…and you should get some output in Transcript. Using #new we get a
TokyoTyrantDB instance set for "localhost:1978", otherwise use
#host:port: to instantiate it. Then we send #at:put:, using a String key
and a String value. This will trigger the db to lazily open a SocketStream
to ttserver and then send key and value to be stored onto disk. Finally we
do a read by asking for it again, note that we use #stringAt: instead of
just #at:. If we use #at: then TokyoTyrantDB does not know the type of the
data being returned and we would get an instance of ByteArray.
Ok, so a TokyoTyrantDB instance behaves more or less Dictionary-like.
Let’s toy around:
| db |
db := TokyoTyrantDB new.
db removeAll.
db
at: 'a string' put: 'abc';
at: 'an integer' put: 1;
at: 'a float' put: 12.3;
at: 'a large integer' put: 100000000000000;
at: #(255 0 0 255) put: #(1 2 3 4).
Transcript show: 'Number of records: ', db size asString; cr.
Transcript show: 'Size in bytes: ', db byteSize asString; cr.
Transcript show: 'Status: ', db status; cr.
db at: 'a string' putCat: 'def'.
db at: 'an integer' add: 2.
db at: 'a float' add: 0.3.
Transcript show: 'String: ', (db stringAt: 'a string') ;cr.
Transcript show: 'Integer: ', (db integerAt: 'an integer') asString;cr.
Transcript show: 'Float: ', (db floatAt: 'a float') asString;cr.
Transcript show: 'Large integer: ', (db integerAt: 'a large integer') asString;cr.
Transcript show: 'Bytearray: ', (db at: #(255 0 0 255)) printString;cr.
We open it up, removeAll (="vanish" command), then put in some
key/value pairs. At this level of the API there are some simple
conventions for type conversion:
- Integers are stored as 32 bit signed integers if they fit, otherwise as a
ByteArray (byte contents of LargeNegative(Positive)Integer) with first byte
signifying sign.
- Floats are stored as 64 bit doubles. Sometimes they are sent as fractions,
64 + 64 bit style.
- All other objects are sent #asByteArray and we use the result of that. This
means that an Array of 0-255 integers will be turned into a ByteArray.
If we send a Float into #at:add: we send it according to the protocol which
specifies that we should send it as a fraction of two 64 bit signed
integers. This format was probably selected for its simplicity. Result is
still stored by TT as a double (8 bytes) in native endianness, so in order
to do the Right Thing we need to keep track of native endianness. Anyway,
those are details… :)
Classes
I started out with a plain single class resembling current TokyoTyrantDB.
Then I recently refactored it into three classes, TokyoTyrantProtocol which
only contains enough to "talk" to TT, and using ByteArrays. Not
very comfortable to use. Then a subclass called TokyoTyrantBinaryDB which
adds a Dictionary-like protocol, nicer but still no conversions.
Then TokyoTyrantDB which has at least some smarts on converting Smalltalk
objects into ByteArrays, and some convenience methods for doing conversions
from TT ByteArray keys and values to Smalltalk objects.
Finally there is also TokyoTyrantTableDB which is a special subclass of
TokyoTyrantDB that has extra methods to work with the "table"
database type. This database type is not fully compatible with the rest of
the protocol, so I might need to "plug" some methods or change
the inheritance, not sure yet.
Following the design that Mikio used in his Ruby package there is also a
TokyoTyrantTableQuery class representing a query. These bits are brand new
and just cleared the smoke test.
The Typing Problem
Note however that since TT does not use "types" we simply get a
ByteArray back from TT. And since TT has special functions for 32 signed
ints and 64 bit doubles, we do want to use those "native
formats". This leaves us in a slight predicament, if we get 4 bytes
back - is it a String of 4 chars or a 32 bit signed int?
So far I have "punted" on this problem, but the next step is
probably to have some kind of configuration of the db instance so that it
knows what to expect back, perhaps given certain kinds of keys. Open for
all suggestions. :)
Can you say FAST?
To wrap up, let’s take this Ferrari for a short spin, code:
| payload num size db time keys |
size := 512.
num := 10000000.
payload := ByteArray new: size.
db := TokyoTyrantDB new.
time := [1 to: num do: [:i | db at: i putNoResponse: payload]] timeToRun.
db size = num ifFalse: [self error: 'Bulk failed!'].
Transcript show: 'Type: ', db type, ', ', num asString, ' * ',size asString,' bulk: ', time asString, ' ms, (', (num*1000/time) asInteger asString, '/sec)';cr.
keys := (1 to: num) collect: [:i | num atRandom].
time := [keys do: [:i | db at: i]] timeToRun.
Transcript show: 'Type: ', db type, ', ',num asString,' random lookups: ', time asString, ' ms, (', (num*1000/time) asInteger asString, '/sec)';cr.
time := [keys do: [:i | db includesKey: i]] timeToRun.
Transcript show: 'Type: ', db type, ', ',num asString,' random includes: ', time asString, ' ms, (', (num*1000/time) asInteger asString, '/sec)';cr.
time := [db do: [:each | ]] timeToRun.
Transcript show: 'Type: ', db type, ', full iteration: ', time asString, ' ms, (', (num*1000/time) asInteger asString, ' values/sec)';cr.
db close
…and some results on different database types:
Type: hash, 100000 * 1024 bulk: 23555 ms, (4245/sec)
Type: hash, 100000 random lookups: 30358 ms, (3294/sec)
Type: hash, 100000 random includes: 32547 ms, (3072/sec)
Type: hash, full iteration: 55899 ms, (1789 values/sec)
Type: on-memory hash, 100000 * 1024 bulk: 5669 ms, (17639/sec)
Type: on-memory hash, 100000 random lookups: 41812 ms, (2391/sec)
Type: on-memory hash, 100000 random includes: 54319 ms, (1841/sec)
Type: on-memory hash, full iteration: 74022 ms, (1351 values/sec)
Type: B+ tree, 100000 * 1024 bulk: 5931 ms, (16860/sec)
Type: B+ tree, 100000 random lookups: 38659 ms, (2586/sec)
Type: B+ tree, 100000 random includes: 37315 ms, (2679/sec)
Type: B+ tree, full iteration: 76465 ms, (1307 values/sec)
Type: on-memory tree, 100000 * 1024 bulk: 5066 ms, (19739/sec)
Type: on-memory tree, 100000 random lookups: 31364 ms, (3188/sec)
Type: on-memory tree, 100000 random includes: 29093 ms, (3437/sec)
Type: on-memory tree, full iteration: 44394 ms, (2252 values/sec)
Type: B+ tree, 1000000 * 100 bulk: 48651 ms, (20554/sec)
Type: B+ tree, 1000000 random lookups: 378111 ms, (2644/sec)
Type: B+ tree, 1000000 random includes: 377373 ms, (2649/sec)
Type: B+ tree, full iteration: 633555 ms, (1578 values/sec)
Type: hash, 100000 * 1024 bulk: 17857 ms, (5600/sec)
Type: hash, 100000 random lookups: 32060 ms, (3119/sec)
Type: hash, 100000 random includes: 33954 ms, (2945/sec)
Type: hash, full iteration: 47621 ms, (2099 values/sec)
Note that I am using ByteArray values here, in order to focus entirely on
TokyoTyrant protocol code and
TT itsel, and not measure serialization speed or conversion speeds. A few
interesting things to note:
- The "includes" test is as fast as the "lookup" test for
all these payloads. Includes does not transmit the value back to Squeak, so
obviously that does not cost much!
- Bulk load on hash disk is relatively slow, ONLY 4Mb (4000 * 1Kb) per
second. :)
- On memory hash is slower than on disk hash?! Funky.
- Hash on disk is generally slightly faster than B+ tree, but not much.
- On memory trees beat on memory hash.
- The last hash on disk used TCP_NODELAY, might have caused iteration and
bulk speedup, not sure.
- All dbs seem to occupy more or less the calculated theoretical space,
clearly very little overhead.
- Being able to store 16860 1kB values per second means (if sustained) a bulk
load capacity of about 17Mb/sec = 62Gb in 1 hour. On my mini laptop
running both client Squeak and ttserver.
One issue though: When trying to do a 5Gb db test something broke when I
hit the 2Gb filesize. Ehum??? Time for an email to Mikio. …thinking a
little bit more - perhaps I need a 64-bit OS to use larger db sizes. Mmm.
Conclusion
I really, really like TC/TT. Simple, fast, robust. The only thing I have
seen that was slightly odd was when I ran out of disk space and the
"db size" returned 1000 but obviously some of my bulk inserts had
not gone through - but the count in the db was ok. Doing "db keys
size" noted missing keys though, so clearly some inconsistency there.
Will report that to Mikio.
I can see tons of ways to build on top of this layer - especially using the
table database type which I didn’t have time to cover in this first
article - but it seems extremely interesting.
Over and out, Goran