Roads Less Taken

10 Mar 10

First FOSS-Stockholm meeting

On the 24th of february the first FOSS-Stockholm meeting was held in Kista. And I dare say it turned out to be a success!

My company MSC sponsored the event together with Nohup so that there was enough sandwiches and drinks to keep the crowd happy. :)

I taped all of the talks and you will find these movies and more here and here are the original movies if you are interested.

/Goran

06 Nov 09

Breakfast seminar on the new "super databases" and CouchDB

Earlier this week I held a 90 minutes presentation for about 30 people about the new "super databases" and CouchDB in particular. It went fine and although it was a "high level sweep" over the field I think most attendees got what they expected. The slides are available here translated to english, although some of them may be less valuable without accompanying explanation.

The interest is mounting in this field, partly because developers and architects are looking for alternatives but also because there is indeed quite an explosion going on with new interesting databases popping up every week. My personal experience covers mainly TokyoTyrant and CouchDB but I intend to try out:

  • MongoDB, since it is quite close to an object databases and has come further on sharding etc.
  • One of the "Dynamo clones", not sure yet which one, Dynomite is not interesting since Microsoft has put the lid on it.
  • One of the "Bigtable clones", also not yet sure which one. :)

Finally, some good and fresh info from the NoSQL community can be found at the two summaries made from the recent meetup in the US. It’s funny that I too made the "Cambrian explosion" connection in my presentation, and so apparently did one of the keynotes there. I didn’t steal it - honestly :)

/Goran

23 Sep 09

Git and Github, where the cool kids hang out!

After releasing Divan on github I of course had to learn basic git as well as some github/git workflow. Being abnormally interested in DSCMs and having used Mercurial, a bit of Bazaar and the lovely Darcs the time has finally come to learn git.

My perception is that git has really pulled ahead the last year quickly adopting good features from the competition and turning into the "cool tool" to use. Github is also a great boost to adoption. Mercurial and Bazaar are still fighting for second place with Darcs probably set for fourth. Personally it didn’t click for me when I tried Mercurial, hard to say what made me uneasy about it. Bazaar felt nicer but I have only dabbled with it. I did use Darcs a bit and it still has a special place in my heart for its simplicity and amazing super hero powers.

In this article I try to outline some daily usage in maintaining Divan on github. It is nothing special, but if you are just diving into git/github it might be worth reading through.

Getting set up

It is actually quite easy. I just signed up on Github, followed the guides, like this one to get my proper personal clone of my repository at Github and to get it all working using SSH for pushing. There is no point in repeating all that.

Churning out code

If we disregard the rest of the world for a second, making commits and pushing them to github is what you want to do first. I typically use git from command line, on Windows I use "Git bash here" from the explorer, and on Ubuntu I just use the regular git. Sure, there are lots of UIs around, but the need is not that pressing for me.

Git status and commit

First thing - you are going to type "git status" every other second, at least I do :). Some kind of compulsion…

        gokr@yoda:~/divan/github/gokr/Divan$ git status
        # On branch master
        nothing to commit (working directory clean)
        gokr@yoda:~/divan/github/gokr/Divan$

This shows current branch but more importantly it shows a list of dirty/new files and a list of staged files. Staged files are those that I have "added to the index" (also called "cache" or "staging area"), which means that I have "staged them for commit". "The Index" is a relatively unique feature of git, but hey, it is not rocket science. You just prepare your commit by adding stuff into a "staged area" before actually committing it, no big deal. It is just unfortunate that there are three names for it (cache, index, staging area).

When you do have dirty stuff the status command also mentions useful commands to use. If you just did some modifications like for example fixing class comments in two files, it might look like this:

        gokr@yoda:~/divan/github/gokr/Divan$ git status
        # On branch master
        # Changed but not updated:
        #   (use "git add <file>..." to update what will be committed)
        #
        #      modified:   src/CouchTest.cs
        #      modified:   src/Lucene/CouchLuceneTest.cs
        #
        no changes added to commit (use "git add" and/or "git commit -a")
        gokr@yoda:~/divan/github/gokr/Divan$

Let’s add one to the staging area and look again:

        gokr@yoda:~/divan/github/gokr/Divan$ git add src/CouchTest.cs
        gokr@yoda:~/divan/github/gokr/Divan$ git status
        # On branch master
        # Changes to be committed:
        #   (use "git reset HEAD <file>..." to unstage)
        #
        #      modified:   src/CouchTest.cs
        #
        # Changed but not updated:
        #   (use "git add <file>..." to update what will be committed)
        #
        #      modified:   src/Lucene/CouchLuceneTest.cs
        #
        gokr@yoda:~/divan/github/gokr/Divan$

Now since I am slightly senile I need to remind myself what I am going to commit, so let’s diff:

        gokr@yoda:~/divan/github/gokr/Divan$ git diff
        diff --git a/src/Lucene/CouchLuceneTest.cs b/src/Lucene/CouchLuceneTest.cs
        index 791c2e8..1fd6755 100644
        --- a/src/Lucene/CouchLuceneTest.cs
        +++ b/src/Lucene/CouchLuceneTest.cs
        @@ -9,6 +9,8 @@ namespace Divan.Lucene
             /// <summary>
             /// Unit tests for the Lucene part in Divan. Operates in a separate CouchDB databa
             /// Requires a working Couchdb-Lucene installation according to Couchdb-Lucene's d
        +    /// Run from command line using something like:
        +    ///        nunit-console2 --labels -run=Divan.Lucene src/bin/Debug/Divan.dll
             /// </summary>
             [TestFixture]
             public class CouchLuceneTest
        gokr@yoda:~/divan/github/gokr/Divan$

Ehum, ok… so "git diff" only shows the unstaged changes, not the staged ones. But we can see those if we want to using "git diff —cached". This is a good example of the "terminology confusion" appearing here and there in git country, why is it not called —staged or —index? Well, whatever:

        gokr@yoda:~/divan/github/gokr/Divan$ git diff --cached
        diff --git a/src/CouchTest.cs b/src/CouchTest.cs
        index 3454d37..912ec8c 100644
        --- a/src/CouchTest.cs
        +++ b/src/CouchTest.cs
        @@ -10,6 +10,8 @@ namespace Divan
         {
             /// <summary>
             /// Unit tests for Divan. Operates in a separate CouchDB database called divan_uni
        +    /// Run from command line using something like:
        +    ///        nunit-console2 --labels -run=Divan.CouchTest src/bin/Debug/Divan.dll
             /// </summary>
             [TestFixture]
             public class CouchTest
        gokr@yoda:~/divan/github/gokr/Divan$

…and we could see all changes by doing "git diff HEAD". Just type "git help diff" to get a mouthful of options. :)

Doing a commit at this point would only commit the change in CouchTest.cs, so I add the second file (just typing a partial path is fine), run status again for extreme educational purposes and finally commit:

        gokr@yoda:~/divan/github/gokr/Divan$ git add src/Lucene/
        gokr@yoda:~/divan/github/gokr/Divan$ git status
        # On branch master
        # Changes to be committed:
        #   (use "git reset HEAD <file>..." to unstage)
        #
        #      modified:   src/CouchTest.cs
        #      modified:   src/Lucene/CouchLuceneTest.cs
        #
        gokr@yoda:~/divan/github/gokr/Divan$ git commit -m "Class comment changes."
        Created commit b2242f2: Class comment changes.
         2 files changed, 4 insertions(+), 0 deletions(-)
        gokr@yoda:~/divan/github/gokr/Divan$

We could have done all the above (adding both files and committing) in one simple line:

        git commit -a -m "A commit message"

..but if you are as confused as I am you have typically done 3-4 different things that you don’t remember so you want to investigate and possibly split it up into several logical different commits. You can in fact also do chunkwise (only selected parts of files) staging, but I am not going into that here.

Git push

Since we are going through the vanilla track, let’s push too:

        gokr@yoda:~/divan/github/gokr/Divan$ git push
        Counting objects: 11, done.
        Compressing objects: 100% (6/6), done.
        Writing objects: 100% (6/6), 721 bytes, done.
        Total 6 (delta 5), reused 0 (delta 0)
        To git@github.com:gokr/Divan.git
           e28819b..b2242f2  master -> master
        gokr@yoda:~/divan/github/gokr/Divan$

Yaddayadda, but that’s it.

Someone else forked your repo!

Great! In the github/git world forks are really good news, the more the merrier! Even better when they actually start doing commits, but a fork is a first step. It might be worth waiting for some commits on that fork, but let’s pretend we know they will come - thus we want to prepare to receive that all code goodness.

I have opted to use so called tracking branches for this. This means that I create a local branch that is set to "track" a remote branch (typically the "master" branch in the foreign fork). Let’s say Henrik actually is going to deliver some code to Divan, we first add his repository as a "remote" called "henrik". We also use "-f" which will also create a remote branch pointing at the "master" branch in "henrik":

        gokr@yoda:~/divan/github/gokr/Divan$ git remote add -f henrik git://github.com/whenrik/Divan.git
        Updating henrik
        From git://github.com/whenrik/Divan
         * [new branch]      master     -> henrik/master
        gokr@yoda:~/divan/github/gokr/Divan$

So now we have an extra known repository that we named "henrik" and we have a remote branch called "henrik/master", all remote branches use that naming convention: <remote-name> + "/" + <branch-name>. If we had skipped "-f" we would have had to follow up with "git fetch henrik" to get that remote branch.

We can see all remotes we now have (using "-v" to see their URLs):

        gokr@yoda:~/divan/github/gokr/Divan$ git remote -v
        henrik git://github.com/whenrik/Divan.git
        kolosy git://github.com/kolosy/Divan.git
        origin git@github.com:gokr/Divan.git
        upstream       git://github.com/foretagsplatsen/Divan.git
        gokr@yoda:~/divan/github/gokr/Divan$

…and all branches (both local and remote, use "git branch -r" for only remotes or "git branch" for only locals):

        gokr@yoda:~/divan/github/gokr/Divan$ git branch -a
          kolosy
        * master
          upstream
          henrik/master
          kolosy/master
          origin/HEAD
          origin/master
          upstream/master
        gokr@yoda:~/divan/github/gokr/Divan$

Here we see the remote branch "henrik/master" just created (and more). The top three entries are local branches and easily recognizable as such since they do not have a "/" in them.

With git one can merge directly from remote branches (I think), but I guess most of us would like the ability to pull down, take a look and then merge - which makes it necessary for us to first create a local branch that is a mirror of the remote branch. In git terminology this is a "tracking branch", since it is set up to easily track a remote branch, meaning that it knows from where to pull etc, nothing magic.

For all forks that I want to collaborate with I am using "tracking branches" so let’s create one for Henrik. We use the checkout command with "-b" for creating a new branch called "henrik" from remote branch "henrik/master" and "-t" for tracking:

        gokr@yoda:~/divan/github/gokr/Divan$ git checkout -t -b henrik henrik/master
        Branch henrik set up to track remote branch refs/remotes/henrik/master.
        Switched to a new branch "henrik"
        gokr@yoda:~/divan/github/gokr/Divan$

In fact, the "-t" is not needed when we branch from a remote branch, it is the default. Note that we could have done the above in two steps as "git branch henrik henrik/master" followed by "git checkout henrik".

Let’s list all our branches once more:

        gokr@yoda:~/divan/github/gokr/Divan$ git branch -a
        * henrik
          kolosy
          master
          upstream
          henrik/master
          kolosy/master
          origin/HEAD
          origin/master
          upstream/master
        gokr@yoda:~/divan/github/gokr/Divan$

So now we have a local branch which is also current, since the checkout switched to it. When we are there we can do "git pull" to get all new commits from the remote branch.

Git merge

If I want to merge work that Henrik has made I first switch to henrik using "git checkout henrik" and do a "git pull" there. Next I switch back to my own branch, say "git checkout master", and there I do "git merge henrik".

Git will automatically commit if a merge is successful. If there are conflicts it will stop in the middle and let me take care of the files which will have regular conflict markers in them. In that case I have add the fixed files manually to the staging area (which typically already has a partial merge in it) and then commit.

And then the natural thing to do, after verifying that unit tests are green :), is of course to do git push.

Final word

I would have liked to show more on merging etc, but my time is limited so better to publish and move on. :)

Over and out, Goran

15 Sep 09

ESUG-konferensen i Brest

Som en aktiv utvecklare i Smalltalk-communityn sedan mer än 15 år är det rätt lustigt att jag aldrig varit på ESUG. Den årliga konferensen hålls traditionellt någonstans i Europa och i år var det faktiskt det 17:e året, och man hamnade i Brest vilket för övrigt var där den första ESUG-konferensen hölls år 1993.

Det året kallades konferensen "Summer School" och Mario Wolczko höll i en hel del av lektionerna. Mario som flera kanske hört talas om är en erkänd expert inom implementationer av objektorienterade språk (kanske mest känd för sitt arbete inom GC och i Self) och arbetar på Sun alltsedan dess.

Det är ganska intressant att notera lite av ämnena som avhandlades 1993 BJ (dvs "Before Java"): Effectively using blocks, Exception handling, Metaclasses, Weak referencing etc.

För er icke-Smalltalkers så är alltså "blocks" i Smalltalk ungefär samma sak som lambdas eller closures som språk som C# och Java först nu cirka… 16 år senare äntligen har eller kanske kommer att . Och tja, Metaklasser det finns såklart över huvudtaget inte i de språken :)

Nu när jag avslöjat min ohöljda preferens för Smalltalk framför dessa "moderna" språk så vill jag gärna framhålla att jag arbetat professionellt i Java sedan 1998 och sedermera även lagt till C# i min profil (Divan).

Smalltalk är dock så fantastiskt mycket bättre på nästan samtliga punkter, och för er "whiz kids" som tänker "ruuuuby d00d!", tänk er Ruby… fast med:

  • En riktigt bra utvecklingsmiljö inklusive refaktorisering, debugger, live-migrering av instanser och dynamisk inkrementell kompilering.
  • Plus en mogen community samt flera kommersiella implementationer.
  • Och just ja, ett väldefinierat minimalistiskt språk med en snyggare syntax och riktigt bra virtuella maskiner.

Då har ni Smalltalk.

Men nog med evangeliet - nu när jag ändå "hängt av" allihop med mitt raljerande - hur var då ESUG med 149 Smalltalkers? Mycket trevligt och spännande!

Till att börja med var konferensen välorganiserad med väldigt goda luncher inkluderande både vin och efterrätt. Det låga antalet deltagare gav också en helt annan atmosfär jämfört med de större konferenserna som exempelvis OOPSLA, som jag besökt ett otal gånger.

Stephane Ducasse som är "motorn" bakom ESUG gjorde ett bra jobb och det var kul att äntligen få träffa honom efter alla dessa år med mailkonversationer i Squeak-communityn.

Värt att notera är att de kommersiella Smalltalk-aktörerna var väl representerade med minst en eller flera personer (Smalltalk-utvecklare och inte okunniga säljare…):

  • Cincom, sedan många år leverantören bakom VisualWorks i rakt nedåtstigande led från den ursprungliga Smalltalk-implementationen.
  • GemStone, en distribuerad persistent transaktionell super-skalbar Smalltalk. Rockar.
  • Instantiations, nuvarande företaget bakom IBMs Smalltalk, dvs den ursprungliga motorn under IBM VisualAge (gänget som sedermera byggde Eclipse).
  • Except, en Smalltalk som alltid varit en doldis men haft en mycket stark teknisk sida.

Och en annan självklar del är såklart alla som är aktiva inom Squeak och Pharo, med viss tonvikt åt Pharo såklart.

Schemat var fyllt med tekniska dragningar från erkända Smalltalk-namn och det var nästan alltid intressant. Seaside tar självklart en stor plats men även andra ämnen representerades som exempelvis multicore, cloud computing, avancerade nya verktyg, mobiltelefoner (iPhone), meta-programmering och intressanta tekniker kring enhetstestning mm.

Jag kommer att återkomma med reflektioner kring de olika sakerna som presenterades och även sammanfatta kort vad jag själv presenterade, Deltastreams.

/Göran

25 Feb 09

Erlang and the new databases

A few weeks ago I stumbled upon CouchDB - a document oriented database with lots of interesting attributes, web buzzword compliant and with the promise of very high scalability.

There are lots of similar new databases appearing, especially from the companies running very large websites like LinkedIn, Facebook, Digg etc. Some of these databases are open source, but they haven’t yet been able to gather strong developer communities, but CouchDB seems to have done that.

This is probably due to being an Apache project but also since it actually comes from a private person and not from a large company - just my guess.

What makes these databases special? It seems to me that they share a few traits:

  • Simplicity
  • Aimed at the web
  • Extreme scalability

CouchDB for example uses a REST API, so any little language that can do HTTP can talk to it trivially. It also means you can do queries using Curl or just right in your web browser. As document format it uses JSON, the serialization format "du jour" in these web-2.0-days. JSON is a trivial, readable syntax for describing data structures, kinda like a very lightweight XML (shudder). Finally it uses by default an embedded Javascript engine (Spidermonkey from Mozilla) for doing server side data manipulation.

In order to scale, all these databases use replication in various ways and that in turn often require some kind of underlying revision mechanism in order to deal with consistency. And no, we are not talking scaling on a few servers - we are talking scaling over hundreds or thousands of servers.

Some of these databases are simply very efficient key-value-stores, often inspired by Dynamo, such a store written and used internally at Amazon. Amazon has not released Dynamo but they have described in detail how it works and there are several noteworthy open source attempts of implementing it - perhaps Dynomite comes closest. CouchDB is a bit more different since it also implements a model around the map/reduce technique pioneered by Google. In short map/reduce is a slightly formalized way of partitioning work over multiple nodes (map) and then aggregating it all back together into an answer (reduce).

I will post more fun stuff about CouchDB later - especially some information on my implementation of a "view server" in Squeak so that you can use Squeak Smalltalk as data manipulation language instead of Javascript.

Erlang to the rescue

A very large share of these new extremely scalable systems are written in Erlang, I would guess about half of them! And yesterday Computer Sweden (large biweekly IT paper) noted this on its front page too. Erlang is back. Well, at least until a "cool and hip" language can steal its thunder, but that can actually take a while because Erlang is one of the few languages that have been all about extreme scalability and robustness from the very start, and Erlang has been around for a while to mature. You can read the details about Erlang in other places but it is a functional language with built-in support for "shared nothing" asynchronous message passing.

Erlang is a compelling choice when implementing extremely scalable and robust software. At the same time Erlang demands a bit from the developer - functional languages aren’t easy fits for many developer brains and Erlang is definitely "non mainstream" in other ways too, which just makes me love it. :)

Erlang runs on top of a VM written in C called BEAM. Current releases of BEAM has HiPE integrated, which is a "JIT" native compiler that can be selectively used to compile "hot" parts of the code. Since 2006 BEAM has been able to utilize SMP or in other words multiple cores.

One interesting part of all this is that BEAM/HiPE can be targetted for example by producing "core Erlang" (a subset of the language), BEAM bytecodes directly or so called "Erlang forms". Reia is an example of such a language.

Perhaps it is time to look hard at Erlang? :)

Powered by RubLog