Projects


8
May 12

Literal arrays vs JSON vs STON vs Tirade

Recently there were a range of threads on the pharo-dev mailinglist discussing the textual format to use for Smalltalk source code metadata. The discussion veered off from the specific use case but basically four different formats were discussed and compared, of which one I am the author. And oh, sorry for the formatting of this article – I need to change theme on this blog for better readability.

JSON

The first format is JSON, Javascript Object Notation. JSON is a simple language neutral (despite its name) readable format that is very small to implement. It is a restricted variant of the native JavaScript literal syntax for objects (dictionaries) and arrays. Basically it excels in simplicity but lacks a bit in features, but people tend to ignore those shortcomings due to its widespread adoption. I will not go into describing it, json.org does a very good job and there are TONS of JSON implementations around.

STON

Sven Van Caekenberghe recently created a variation on JSON he calls STON, Smalltalk Object Notation. STON is basically JSON plus the following:

  • Object references, the concept of being able to refer to other previously described arrays/objects in the STON file. This is done by number using the @-sign like “@2″ refers to the second array/object in the file.
  • Class prefixing, the idea of annotating arrays and objects (JSON terminology) so that one can instantiate a reasonable class when reading.
  • Symbols, simply adding support for a primitive data type for Smalltalk symbols, although I do note – a limited form of Symbols not allowing the same range of characters in them as Squeak/Pharo does.

Then there are a few subtle differences from JSON, like using $’ instead of $” as string delimiter and nil instead of null, but not much else that I can see. Numbers seem to be exactly the same as in JSON, and escape codes inside strings are also the same, obviously by design.

First I admit that I have not played with STON, my comparison is purely in theory. STON has the same basic positive notes that JSON has, it is small, simple and well defined. But are the differences worth it?

JSON is everywhere and there are already tons of parsers for it, probably in every Smalltalk on earth, and of course all other languages too. STON on the other hand is Smalltalk only, and as of this writing probably Pharo only, although I admit it must be simple to port.

It boils down to if the additions are worth it and I don’t think they are. Embedding class names, if needed, could be done in JSON, although slightly inelegantly of course, but one approach would be to wrap each “typed” object/array in an object like this:

ByteArray [1, 2, 3] ==> {"type": "ByteArray", "data": [1, 2, 3]}

I agree, clunky, but on the other hand I tend to think that the parsing end needs to know the semantics and construction of the JSON anyway – JSON is too “simplistic” to be used as a true generic serialization mechanism and trying to turn it into such a beast by adding types and references, like STON does, is IMHO not that useful.

STON looks neat, but in practice I don’t think the benefits outweigh the ubiquity and availability of JSON. Had it been even more different it might have been another story. But if we don’t think we will use type annotations and circular references – then why not simply use JSON?

Literal Smalltalk arrays

The simplest notation of all in the lineup is the literal array syntax in Smalltalk. The example below covers all its capabilities AFAIK (in Pharo/Squeak), please tell me if I missed anything:

#(4711 3.4 16r3F 'string' #symbol #'another-symbol' (nested array) #(one more) true false nil $x #[12 32])

So we have space separated elements and arrays that can nest, with or without #-prefix inside the array. Primitive literals are numbers (full numeric Smalltalk parser, not as limited as JSON/STON), strings (no escape codes, single quotes needs to be doubled), symbols (can handle more characters than STON symbols), character literals, byte array literals and true/false/nil.

Literal arrays are quite nice but they lack the concept of “associations” and thus no simple readable way to represent a Dictionary. And that is a BIG negative. Funny enough, if we added support for literal dictionaries to Smalltalk then literal arrays would match JSON, with a few extras on the side!

Amber has recently added support for dynamic literal HashedCollections using this syntax:

#{'hey'->12 . aString->'123123'}

It is simply a dynamic {} array (was introduced originally in Squeak I believe) but with the assumption that the expressions all evaluate to Associations that are limited to a string as key. This is because it will be turned into a HashedCollection which is the Amber counter part of a JavaScript object, and JavaScript objects are limited to having strings as keys (Sidenote: Amber also has a generic Dictionary without that limitation).

Without a syntax for dictionaries, literal arrays, although nifty and syntactically quite compact, are still limited in expression. And of course, while Smalltalk literals are fairly simple to parse, other languages do not typically know how to do it – and when it comes to numbers, the Smalltalk full range of syntax is perhaps a bit of an overkill if we aim at cross language portability. Having literal syntax for Characters is also clearly of less value, ByteArrays on the other hand are obviously useful.

Sidestory: Adding literal Dictionaries to Smalltalk?

Smalltalk only evolves in micro steps every other 10 years, but with the current onslaught of Pharo perhaps there is an opportunity to actually take a few more such steps.

We will see below that Tirade has added support for “->” as a literal syntax instead of being a message send and as I mentioned above Amber has added a special syntax for dynamic Dictionaries, and that was actually done in order to more easily match JavaScript object syntax when interacting with JavaScript.

So perhaps the Smalltalk/Pharo community could decide to add literal Dictionaries to Smalltalk using the Amber “#{” syntax? In such a syntax the separators between Associations can probably not be spaces, it gets confusing to read:

#{ key -> value key2 -> value2 }

A separator is clearly needed and since we use periods generally for that in Smalltalk it’s a good choice. Syntactically it could lead people to think it’s a dynamic Dictionary, but let’s continue the thought experiment. How would it look? As is customary for #() we can ommit the # inside the array:

#(123 'hey' {key -> value. key2 -> value} 456)

It looks fairly nice. However I do admit that we probably should take a long hard look at all our syntaxes and try to bring some harmony to them. Currently, due to legacy, we have literal and dynamic Arrays using #() and {}. A bit unfortunate since we then use both $( and ${ as delimiters for Arrays and make it harder to find good characters for Dictionaries.

It would be nice to have a symmetric syntax. Ideally the leading # could indicate “literalness” – and perhaps we could use another character to indicate dynamic evaluation? Again, just a thought:

  • #() – literal Array
  • §() – dynamic Array, expressions separated by periods.
  • #{} – literal Dictionary, literals separated by periods, support for associations as literals.
  • §{} – dynamic Dictionary, expressions separated by periods, associations created as usual using sends.

Yeah, right, how would we ever be able to reach concensus on a leading dynamic character? :) Also, I do think it is wise to syntactically indicate literal vs dynamic, heuristics only lead to developer traps. Better to clearly indicate intention.

Tirade

Tirade is a format I created for Deltas (ChangeSets improved) and I have written three articles about it earlier. Now, if I would at this point subjectively rank the formats along a few axis it could look like this:

  • Interoperability
    • JSON: 100% (all languages has it)
    • STON: 70% (one could probably tweak a JSON parser in any language to work)
    • Litarrays: 30% (could get higher score if we limit them, a parser would still have to be written)
    • Tirade: 20% (same problem as with literal arrays, but even more advanced to parse)
  • Capability
    • Tirade: 100% (has the most features and options, by some margin)
    • STON: 60% (second best, still not much better than JSON)
    • JSON: 50%
    • Litarrays: 40% (severely limited by lack of assocations but has a some features to compensate)
  • Grokkability
    • JSON: 100% (well documented, we all know it and so does the rest of the world)
    • STON: 90% (rides on JSON)
    • Litarrays: 80% (not hard but has quite a few quirks)
    • Tirade: 70% (more or less as hard as literal arrays, but with a few more concepts added)

Conclusions from the above? Before looking at Tirade I think we can safely say that JSON is a strong choice. STON is IMHO in limbo, I can’t see picking it instead of any of the others in a given situation, sorry. Literal arrays could easily become the obvious “JSON for Smalltalk” if it had associations/literal dictioneris, it sucks for interoperability though.

Tirade on the other hand has associations (on two levels one could even claim) so it can be viewed as “JSON++ for Smalltalk”. But with more features comes a slightly higher learning curve and a penalty in interoperability. We now have set the scene for the last section about Tirade.

Tirade

Obviously I am partial, since I created Tirade. But let me try to contrast Tirade to all the others. Note that Tirade was never meant to be interoperable with other languages, it was however designed to be interoperable between different Smalltalk implementations, or at least all Squeak derivatives.

A stream of messages

First of all, Tirade is slightly different than the others. They describe a single structure. A valid Tirade “document” on the other hand, is a series of “records” terminated by periods. Each such “record” looks like a Smalltalk message (but without a receiver on the left side), either a unary or a keyword message, like this:

unaryMessage.
key: 'Hello' word: 'world' message: 4711.

This high level view as a “stream of messages” gives us several nice properties:

  • The selector of the Tirade message is a kind of record “type”. It normally maps to a method on the receiving end that handles this record. That method then knows what to do with the arguments, and thus we don’t need to hard code class names into Tirade, like STON does. NOTE: This is not a security problem. There is nothing forcing the parsing end to just blindly perform these messages. In fact, there is nothing forcing the parsing end to be specific at all, it could just be a generic Tirade parser.
  • If we look at a keyword message we realize that it is very similar to a JSON object, it is basically a “naked dictionary” where each key word is… right, a key! :) So for simple data we need perhaps not make it more complicated than this.
  • It makes it very easy to extend a Tirade format by simply adding new message selectors that the receiving end can ignore if it wants to.
  • Since Tirade is a flow of messages instead of a single, potentially quite large, structure like the other three formats, we can naturally stream it and handle each message one by one.
  • And since we have this flow we can also use “control messages” that can instruct the receiving end on how to receive the messages coming next in the flow. One could even use Tirade over a bidirectional link (a SocketStream for example) and do handshaking and client server communication with it.
  • Finally, in between Tirade messages one can add Smalltalk style comments which are simply skipped by the parser. JSON and STON has no concept of comments.

Smalltalk literals

The next level of Tirade is what kind of arguments we are allowed to put in between the keywords. Basically its most kinds of Smalltalk literals with some additional constructs. I would also like to point out that this part is not encarved in stone, I am still contemplating the best mix of literal support here. But the main point is that we only allow literals – no expressions, so there is no generic “eval” going on here.

Notable differences again compared to JSON/STON on the atomic level are just like with literal arrays:

  • Strings are Smalltalk strings, no escape codes except for double single quote for single quote.
  • Numbers are Smalltalk literal numbers, in fact we rely on the number parser of Pharo/Squeak. This gives us a rich notation for numbers, at the expense of possible portability issues with other Smalltalks.

NOTE: Tirade doesn’t currently implement Character literals nor ByteArrays, both can of course be added.

Let’s continue with the added features for literals.

Literal feature: Verbatim strings

A problem with JSON for dealing with readability is that JSON strings can’t have newlines in them! So if you want to store source code in JSON it will end up as a single very long line.

Smalltalk strings like in Tirade can have newlines in them, but they suffer from double quoting of single quotes and the problem that the single quotes surrounding the string needs to be first on the first line and last on the last line, which makes it less readable.

This is why I came up with verbatim strings in Tirade, specifically for being able to contain unmodified source code in a readable way with no escapes whatsoever. I am not sure if this is the best approach, perhaps here-docs would be a simpler approach, but currently a verbatim string looks like this:

some: 1 message: 'hey' withVerbatimStringForCode: [
 This is untouched, perfectly unescaped source code, ANY character combinations will work!
 Tirade will split the input on each CR (byte = 13) and then prepend each line with a TAB character.
 This means that the parser can detect the end by looking for the first line starting with "]",
 that must be the end of the verbatim string since all other lines start with TAB.
 Copy paste will work but you will need to care for the TAB indentation, but most editors
 can do that easily. Also, right before and after the string there is a newline added to improve readability.
].

Literal feature: Associations

Since we really want to be able to do dictionaries I first added literal support for Associations. This means “->” is a literal syntax for creating an Association, it doesn’t need to be in a Dictionary, you can use them wherever you like and the key and value can be ANY literal construct allowed by Tirade, even an Assocation!

Note though that we do not have parenthesis in Tirade (no expressions at all) and the current Tirade parser is a recursive descent bottom up parser so the code below will produce an Assocation with key #key and value an Association 123->’123′. In Smalltalk where #-> is a message this is instead executed from left to right creating a different result.

cool: #key->123->'123'.

This also means that Tirade can have associations inside literal arrays, which is not syntactically possible in Squeak/Pharo:

cool: #(12->'123').

Finally, since Amber lately added #{} syntax for Dictionaries I think it could be a worthwhile addition to Tirade also.

Literal feature: Dynamic arrays as literal

Tirade supports {} style arrays, but doesn’t allow expressions so they are very much like normal arrays except they do not remove #-prefixes from nested arrays/symbols and they look more natural to Squeakers since Squeak allows Association literals inside them:

cool: {12->'123. 'banana'->true}.

Is it worth supporting both kinds of arrays? It depends, either Tirade defines a literal subset that is as small as possible, or Tirade tries to cover all literals of Pharo. I was leaning towards a subset but perhaps a super set is more attractive to people.

Ending thoughts

I hope this article explained a few things and made at least Tirade a bit clearer. There are several things not fully settled in Tirade and if anyone wants to dig in and tweak it, feel free to email me.

regards, Göran


7
Feb 12

Current Smalltalk obsessions…

These days I am, as usual, torn between several interesting technical projects.

Amber

The new Smalltalk called Amber (by Nicolas Petton) that compiles to javascript is pretty awesome and there are tons of interesting things one can do with it. My contributions so far include the beginning of a package model, a faster simpler chunk format exporter/importer, a command line compiler, a Makefile system so that Amber can be built fully from the command line and a bunch of examples running on top of Nodejs and webOS, and a few other odds and ends.

I would like to port Deltas to Amber in order to create a powerful toolset for managing code changes. Using local storage it would among other things enable undo and change logging to prevent accidental code loss. It could also easily form the basis for a “commit tool”, similar functionality that git stash offers etc.

Another thing I would like to build is a dead simple public shared package repository. And play with Socket.IO, or just fool around with the compiler trying to add optimizations like various type inferencing, optimizing self and super sends etc :) . So much fun stuff to do!

STOMP and Apollo

For a personal “secret project X” I need scalability so it is being designed with lots of daemons each taking care of a specific task. I want to be able to implement these daemons primarily in either Nodejs (in plain js or using Amber) or Pharo Smalltalk, but also in any other language that fits.

This requires some kind of messaging infrastructure to tie them together. So… after looking hard and long and reading a lot about messaging, job scheduling, AMQP, 0MQ, STOMP, Beanstalkd, RabbitMQ, ActiveMQ Apollo (and tons of other things) I decided to try to use the new ActiveMQ Apollo together with STOMP 1.1 (which should also be supported by the STOMP plugin for RabbitMQ etc).

The new Apollo implementation is written in Scala using HawtDispatch so the architecture seems modern and the JVM of course has very good performance these days. So, while I generally am very tired of Java and its eco system, this actually seems like a solid product and has already shown very impressive numbers in benchmarks.

So a sound asynchronous architecture with good performance is nice but the other thing I like with ActiveMQ is their focus on STOMP. Since I intend to use Pharo as one major component I need to be able to hook it into the messaging backbone. And sure, Tony Garnock Jones – one of the main developer behind RabbitMQ – actually has an AMQP client library written for Squeak 3.9, so I could probably us AMQP, but I somehow foresee a “world of hurt” in the complexity given that AMQP is a magnitude more complex than STOMP.

I have already implemented STOMP 1.0 for Pharo, actually tried it with RabbitMQ at the time, so I am now upgrading that library to work with 1.1 of the specification.

Riak

The other important piece of the puzzle for true “Internet scalability” is of course the choice of persistence. I am a long time fan of the new NoSQL databases and having played with a few of them, implemented a C# binding for CouchDB, hacked some bindings in Squeak for both CouchDB and Tokyo Tyrant… I now have decided to focus on Riak. Riak is IMHO the most interesting NoSQL database out there right now, at least for worry free ultra scaling. Sure, it may not be the fastest on a single box – but if you are really serious about scaling – one box is totally uninteresting. :)

Runar Jordahl had already started a Riak binding in Pharo, I took it and changed quite a lot of it – not really because it was “bad” or anything, I just have a different style of coding I guess. So I decided to fork because I didn’t feel comfortable – thus Phriak was born. Now Nicolas Petton is getting hard into Riak too and has pushed Phriak forward quite a LOT in the last few days, much further than I had time to do. It now has a clean command style protocol implementation, an object model similar to the one in Ripple (Ruby Riak client) and initial working code for both secondary indexing, link walking and map/reduce! Quite impressive stuff.

Nicolas is also experimenting with writing an “OODB-ish” database using Fuel called Oak and after I managed to get him hooked on Riak he has been moving that codebase over onto Phriak. The initial experience we have with Phriak and Oak is extremely promising and who knows where this will lead.

Happy coding, Göran


15
Apr 11

Tirade, supporting embedded text

Two years ago I ended up creating Tirade – a new “file format” for Smalltalkers. Or rather, a way to serialize stuff into a sequence of Smalltalk messages with literals as arguments. I have written a few blog articles about Tirade so I will not go into details in this one.

One thing that has been disturbing with Tirade is that I wanted it to be the main format for serializing Deltas, the new implementation of “21st Century ChangeSets”. This means I want Tirade to handle Smalltalk source code in the best possible way. Ideally I would want the Tirade file to be editable in a text editor if I wanted, and not being broken by that.

So, what properties do we want:

  1. No escaping of special characters. In regular Tirade strings (just like in Smalltalk) need to escape the single quote as doubled single quote, and that would suck for Smalltalk code of course.
  2. No length encoding. One way to avoid escaping is to store the length of the data before the actual data – like a Netstring for example. This prohibits easy editing in a text editor though, since that would change the length.
  3. A reasonable syntax. Tirade so far has been a subset of Smalltalk (disregarding lack of receiver to the left), but I think we might have to break that a bit here.

After pondering this for a while I have come up with this solution which feels kinda nice, but if someone has an even better idea I am all ears. This is how it could look embedding a method source in Tirade:

class: #MyClass selector: #at:put: source: [
      at: pos put: arg
      "Put something here"

      ^array at: pos put: arg
].

So what gives here?  We are reusing the syntax for Smalltalk blocks without arguments. Simply [...content...]. The content will be delivered as a String and the guarantee is that it will be received exactly as sent. There is a trick here – this is what Tirade will do:

  1. Write the starter $[ and then a CR
  2. Before each line in the string (a line being all characters up to and including the next CR or up to end) we insert a TAB. This means that the String begins on the line after the opening $[ and all lines will be prefixed with a TAB.
  3. Then, regardless if the last line ended with a CR or not - we add a CR before the closing $]. This makes sure the closing $] ends up on its own line.

The above trick gives us the ability to detect the end of the string because if a line starts with something else than a TAB then we have reached the end. Thus we do not have to escape the $] inside the string and we still don’t need to do length encoding. We DO however need to make sure all lines begin with a TAB, but if you are editing a Tirade file you should just learn that fact. :)

I am not sure if the above is a good solution, but it is ONE solution and I can’t come up with a better one, unless we would use a really “odd” marker at the end in order to not have to escape it, but that feels “dirty” to me.


20
Apr 09

Tirade, first trivial use

Last night I started hooking Tirade into Deltas. Quick background: Deltas is “Changesets for the 21st century”, or in other words an intelligent patch system under development for Squeak. Tirade is a Smalltalk/Squeak centric “JSON”-kinda-thingy. I made Tirade in order to get a nice file format for Deltas. Just wanted to share how the first trivial code looks, and thus illustrate simple use of Tirade.

I have a DSDelta (a Delta being almost like a ChangeSet). It consists of some metadata (a UUID, a Dictionary of properties and a TimeStamp) and a DSChangeSequence (which holds the actual DSChange instances). As a first shot I only implemented the metadata bit. So step by step:

  1. Write a unit test, first let’s set up our readers and writers on a common stream:
         setUp
             | stream |
             stream := RWBinaryOrTextStream on: String new.
             reader := DSTiradeReader on: stream.
             writer := DSTiradeWriter on: stream

…then a trivial write, read and compare test – note that they both look at the same stream:

        testEmptyDelta

            | delta same |
            delta := DSDelta new.
            writer nextPut: delta.
            reader reset.
            same := reader next.
            self assert: same = delta.
            self assert: delta timeStamp = same timeStamp.
            self assert: delta properties = same properties.
            self assert: delta uuid = same uuid
  1. Create DSTiradeWriter. It turns out that DSTiradeWriter at this point is just an empty subclass of TiradeRecorder! Eventually we might need to add behaviors but at this point there is no need. The TiradeRecorder uses DNU to intercept messages and encode them as Tirade.
  2. Implement #tiradeOn: in our domain object DSDelta. This will be used by the writer and looks like this:
         tiradeOn: recorder
    
             recorder
                 delta: uuid asString36
                 stamp: timeStamp printString
                 properties: properties

…here we convert the UUID to a String (base 36) and the timeStamp too. The properties Dictionary just holds “simple” data that Tirade can represent, so no need to convert it. The rule is that we make up a message (in this case #delta:stamp:properties:) which will be used in the Tirade stream, and we make sure our arguments are “Tirade proper” which basically means Booleans, Strings, Symbols, Arrays, Numbers, Associations and Dictionaries thereof. Note that the recorder being a DSTiradeWriter inherits the implementation of #doesNotUnderstand: from TiradeRecorder that will write this Tirade message onto the stream typically looking like this:

        delta: 'd71oknvt1bwswhno6iwgund07' stamp: '20 April 2009 11:20:50 am' properties: nil.

And then the final step, our reader:

  1. Creata a DSTiradeReader. We simply create an implementation of the above Tirade message #delta:stamp:properties: and put it in the method category “tirade” so that the default security mechanism is happy:
         delta: uuidString36 stamp: timeStampString properties: properties
    
             result := DSDelta new.
             result uuid: (UUID fromString36: uuidString36); properties: properties; timeStamp: (TimeStamp fromString: timeStampString)

…this class inherits an instvar called ‘result’, which is fine to reuse. As you see the properties needs no conversion, the others are converted from Strings.

And tada – the unit test is green! So we implemented reading and writing in more or less two lines of code. Kinda neat! :)


20
Mar 09

Tirade, part 2

In an article recently I described Tirade – a new generic “file format” for Smalltalk/Squeak, or actually a sub language! Since that article I have refined Tirade a bit. Tirade consists today of 4 classes (parser, reader, writer, recorder) totalling about 500 lines of code, excluding tests. Tests are green in 3.10.2, pharo-10231, 3.9, 3.8 and 3.7. It does turn red in 3.6 due to old initialize behavior, some missing methods etc, probably easily fixed if anyone cares. There are no dependencies on other packages. Compared to using the old Compiler>>evaluate: it is about 5-7 times faster.

Tirade is a very small “language” similar to JSON (see below) and probably fits similar use cases as JSON fits.

Numbers

In my first Tirade description I opted out and only supported plain integers, no frills at all. Then after subsequent discussion I came to the conclusion that syntactically there is no problem to letTiradeParser>>parseInteger become TiradeParser>>parseNumber and just let it handle all kinds of Number literals that Squeak supports by either using SqNumberParser if present (in Squeak 3.9+) or by falling back on regular old Number class>>readFrom: which Scanner still uses in 3.10.2.

So now Tirade deals perfectly fine with:

  • 23.45 (Floats)
  • 16rFE (radix)
  • 1.0034e-5 (scientific notation)
  • 243s2 (scaled decimals)
  • “NaN”, “Infinity” and “-Infinity”

…and whatever else should be there.

The performance penalty if we use SqNumberParser (Squeak 3.9+) is not that bad, about 20% on my little trivial benchmark. Using Number class>>readFrom: hurts more, increasing time for benchmark around 50%.

Security…

First I played with having the builder object (that is typically fed the Tirade messages from the Tirade reader) implement isSelectorAllowed: etc. I finally ended up encoding a simple security scheme in the default TiradeReader that relies on finding the implementations of the Tirade messages in the builder in a method category beginning with “tirade”. It seems simple enough for most uses.

I also added a global “whitelist” of Tirade messages that can be registered in the reader before starting to parse. If selectors are found in this whitelist they are considered “ok”. This can be useful in some situations.

If the builder relies on catching Tirade messages using doesNotUnderstand: then it is on its own for security, but that seems fine.

Finally you can turn off all selector checks by using #unsafe:.

Receiver juggling

Tirade is meant to separate “concerns” between Tirade “code”, parser, reader and the builder object supplied by you. The Tirade “code” has no control over the receiver of the messages, Tirade “code” is just a sequential flow of messages separated with periods. The TiradeParser also doesn’t care, it just parses and then does “self processMessage”, if you are using TiradeParser directly it has a default implementation of #processMessage that prints them out in Transcript and collects them in an OrderedCollection.

So yes, you can use TiradeParser to just gobble up some Tirade input and then muck about with the OrderedCollection afterwards – similar to how you work with JSON or an XML DOM. But the better approach is of course to subclass TiradeParser and implement #processMessage to actually do something – in a streaming SAX-ish fashion.

Then we have the reader. There is a default TiradeReader that implements the security described above and also implements logic for deciding the “next receiver” of the Tirade messages. The logic goes like this:

  • If the builder supplied implements Tirade messages by always returning self, it will always be the receiver. Simple.
  • If the builder returns another object X, X will be used as the “next receiver”.
  • As long as X returns self it stays as the “next receiver”.
  • If object X returns another object Y, X will be put on a “stack of old receivers” and Y will be used as the “next receiver”.
  • If Y returns nil, X will be popped and be used as the “next receiver”.
  • If X returns nil we are back to the original builder, and if it returns nil nothing changes.

So if the above is “enough” for controlling the receivers, then the builder object handles it by simply returning the “right” objects. These objects can of course be “sub builders” or domain objects themselves or whatever.

If the above is not enough you can register “control messages” in TiradeReader. A control message can be any selector and will result in TiradeReader pushing the current receiver on the stack and setting the original builder object as the “next receiver”. There is also a small twist, if the control message returns self the reader will consider that to be equivalent to “nil” and thus pop the previous receiver back. This is because the common use is to make sure all control messages are sent to the original builder without disrupting the current stack of receivers. But… why? This enables the builder to explicitly control the reader during the parse, perhaps manipulating the current stack, even though it is not the “next receiver” receiving the regular Tirade stream of messages.

One very good reason to use this is when the current receiver is a domain object that does not “know” when to return nil to pop itself.

I am not perfectly happy with the current mechanisms, but it will do for now and I will revisit this when I see how it works out in practice. The important bits are in place though – Tirade input has no control over receivers and the builder object can control it if needed.

Compared to JSON again

The differences compared to JSON that I see right now:

  • Smalltalk syntax and parsing rules for Strings. This means no escapes except for double-single quotes. JSON has 8 other escape codes. Immediate advantage for me is able to store readable code in Tirade, including newlines.
  • Smalltalk syntax for Numbers. This means more capabilities for parsing numbers than JSON has (radix, NaN, Infinity, scaled decimals).
  • Symbols. JSON only has Strings.
  • Associations. JSON has an “object” which is a Dictionary restricted to String keys. JSON does not have a free standing Association. In Tirade any of the allowed objects can of course be keys or values and Assocations can be “standalone”. So there is a little bit of greater flexibility here.
  • Comments. Hmmm, JSON has no comments. Tirade allowed Smalltalk comments, but ONLY between messages.
  • Messages. This is the big difference, Tirade consists of a sequence of unary or keyword messages, with the “data” described above as arguments.

The addition of messages adds an important extra level of “classification”, “control” or “typing” or call it what you want. It also lends Tirade to easy streaming and concatenation. JSON consist on its top level of either a Dictionary or an Array. A parser could of course parse that in a “streaming fashion” one element or pair at a time, but they normally don’t do that I think.

Having messages of course makes it much more natural to map these messages onto a builder or multiple builders and also to use messages to control the message flow. I think this makes Tirade much more expressive in itself.

In summary, Tirade is similar to JSON but extended with messages and comments, more advanced Numbers, deals with text more easily (no escaping of CRs etc), can have comments in it, has a little bit more flexibility in data model (Associations) and uses Smalltalk syntax for it all.

Potential uses for Tirade

I started out with a focus on replacing “chunk format” with something simpler and secure for Deltas (Deltastreams project), eliminating the use of Compiler to parse it. Afterwards one can find several interesting places where Tirade could be used for example:

  • DSLs
  • RPC-ish communication
  • Transaction logs

…and a few more things :)

But hey, one thing at a time.


16
Mar 09

Tirade, a file format for Smalltalkers

In my revived work on Deltastreams in Squeak I ended up facing the choice of native file format for Deltas. Matthew has made an advanced format called InterleavedChangeset which manages to squeeze a binary representation of a Delta into a Changeset file (which is in Smalltalk chunk format). An impressive feat, and it has the advantage of being backwards compatible in the sense that a Delta in this format can be filed in as a plain old Changeset into an old Squeak image.

But I must say I don’t think that benefit alone is enough to justify these tricks. Oh well, time will tell – and multiple formats for Deltas are fine to have.

Looking at another file format for Deltas I decided I want these properties:

  • Readable. Changesets are nice because they are indeed readable. You can look at them in emacs if you like.
  • Editable. Same here, if you really need to you can edit changesets manually. The syntax is not totally bad.
  • Secure. Ouch, changesets fail here because they rely on Compiler>>eval: which of course opens up tons of tricks you can do. I don’t want that with Deltas.
  • Declarative. This is a spectrum but I would like the “style” for the file format to be declarative.
  • Streamable. The parser should not have to read it all into a “DOM structure” before actually doing something with it.
  • Extendable. It should be easy to extend the format in the future and have older code “ignore” such extensions.
  • Fast. It should be fast to parse and fast to produce. As always. :-)
  • Small. I mean both conceptually (easy to understand), codewise (parser etc) and syntactically.

JSON or YAML

First I looked at JSON which nails 5-6 out of these 8. It is indeed “Readable”, “Editable”, “Secure”, “Fast” and “Small”. And it is very “hip” in the web arena. It suffers from some problems though:

  • Strings can not have CRs in them, which will make “larger texts” such as source code a pure pain to read. This is a killer! Sorry.
  • It is actually not so well suited for streaming. A JSON document is a single “thing”, either a Dictionary or an Array. Sure, we could use convention and parse one key-value-pair (or element if it is an Array) at a time, but then I would need to write a new parser anyway.
  • It is declarative since it is “just data” but on the other hand this means it also lacks semantics. We would need to use Strings and keys (in key-value pairs) to denote our own semantics. Works, but probably would make the structure less apparent.
  • Same goes for extendability. Sure, we can add new keys in our key-value pairs etc, but what we really want is to be able to add new semantic elements, and again, those would need to be “encoded as data” in JSON. I think it would get messy.

…so JSON is out. I do like JSON though, in its utter simplicity and above all – it is available everywhere.

What about YAML then – JSON’s big brother? YAML is a superset of JSON, well, it aims to be a strict superset at least. I did find a very good blog article comparing them. Looking at YAML it seemed to have lots of better ways to represent source code with CRs etc, but hey… simple? Mmmm, not. Sure, it may be “simple” to look at but I sure didn’t think the specification was simple and in fact HUGE and I had a hard time reading it due to its style. Sorry, not “small” enough for me! ;-)

When discussing all this on squeak-dev the good ole “Hey, why not Smalltalk?”-mantra popped up. I have been there lots of times, using Smalltalk as “representation language” is very neat since it enables so called “internal DSLs” at such ease, but it does fail pretty hard on “Secure”, “Declarative” (depends on how you view that one, I know), “Streamable” and in some ways “Small”. Although it would be strange if we didn’t have Compiler available when dealing with Deltas :-) .

Smalltalk… or Angry Smalltalk?

But bringing it up did get my mind working – how about a nice subset of Smalltalk? That does what JSON does but fixes the problems I identify above? Pretty quickly a subset of Smalltalk crystallized itself that I named “Tirade” as in:

“a long angry speech or scolding. Synonyms: diatribe, harangue, rant”

Tirade, as the name somewhat implies, is just a sequence of Smalltalk unary or keyword messages with optional Smalltalk comments inbetween.

The only thing making Tirade NOT a strict syntactic subset of Smalltalk is that there is no receiver to the left, this is up to the reader/parser to decide. One seemingly useful pattern for determining the receiver is to use the result of each message as the next receiver.

Here is a fantasy example of Tirade input:

        "We allow unary messages, the period after each message is mandatory.
        A unary message is probably used mainly for semantic structure."
        beginWumpus.

        "We allow Symbols."
        push: #superbat.

          "Whitespace is just fine, so indentation can be used, but has no meaning.
          In Strings we use double single quote for a single quote. No other escaping exists."
          name: 'Bat from hell'
          description:

        'Black as devil''s tar, evil. Appear in flocks
        and when they come you better duck.
        Duck fast.'.

          "Integers are fine, positive or negative. No scientific notation, no floats."
          power: 'Can screech' damage: 3.
          power: 'Can bite' damage: -6.

          attrib: #size->43.
          attrib: #color->nil.
          attrib: #dangerous->true.

          moreattribs: { #wings-> 2 . #ears -> nil. #teeth -> {'one'.'two'.'three'}}.

        pop: #superbat.

        endWumpus.

Well, you get the idea. So source code with CRs will work fine, just like in Changesets. We could even add support for indenting Strings with embedded CRs (as you can see above they otherwise break indentation), but I haven’t done that. So we have Strings, Integers, Associations, true/false/nil and brace Arrays. Nested in any combination and depth. In Smalltalk all these are “literals” except Associations which are actually created using the message #-> sent to the key object. In Tirade it is not implemented “as a message”, but rather built into the parser. I also chose the “brace array” instead of a regular array because it has separating periods and it allows Associations inside it.

Ok, so Tirade is just a sequence of messages with “data” as arguments. And the “data” is expressed very similarly to JSON, but with Smalltalk syntax. Is this better than plain JSON? I think so:

  • We got rid of the “CR in Strings”-issue, Tirade Strings are just like Smalltalk Strings – the only escape character is doubling a single quote.
  • Tirade is streaming since we parse and process the messages “one by one” and we get IMHO better “semantics” in the form of having something more than “just data”: keyword messages.
  • Extending is easily done with new messages that old code can happily ignore at will – or capture using #doesNotUnderstand: and log or whatever.

I just created the Squeaksource project Tirade for this, the code there has both reader, writer and green tests. It also has nice class comments :-) and it is not rocket science to get going with. The parser is a simple “predictive recursive descent” parser which anyone should be able to step through and understand. It has no ambiguity and it only checks the next Character for choice of production, which made it very easy to create.

Are we there?

Did we meet our objectives we started out with? Do we have our “dream format” we want? Let’s look at it again:

  • Readable. Oh yes, and for a Smalltalker VERY readable. We have simplicity, comments, indentation, CRs inside Strings etc.
  • Editable. Definitely, for a Smalltalker VERY editable. It is Smalltalk after all.
  • Secure. Very secure. There are no expressions and no access to globals. You can only send messages with data as arguments. You do not decide the receiver, the reader does. The base reader also does a security check asking the builder if it allows the selector.
  • Declarative. Given only messages (which creates structure and semantics) with data and no full Compiler I think it is very declarative.
  • Streamable. Yep. We just keep going and going! Since the period is mandatory after a message this also means you can concatenate Tirade streams together without syntactic problems.
  • Extendable. Just add new kinds of messages, make sure the old code can ignore “new messages” either by having a tolerant #doesNotUnderstand: or by letting #isSelectorAllowed: not allow unknown messages, which then will not be sent at all.
  • Fast. My first benchmark shows TiradeParser is about 7 times faster than Compiler. It is pretty nippy.
  • Small. It is definitely small in all aspects mentioned.

Just one gripe…

Why is it not a true subset of Smalltalk – in other words, why did I leave out the receiver to the left?!? Well, since we don’t have any kind of variables there is not much you can put there! Should we still write out “self”? If we did, then it would be confusing if the reader uses the “result of message is the next receiver”-logic.

We could encode that policy by enforcing the code to say “me := me <message>”, but then we would need to introduce a special variable “me” and it would look hackish and odd and confusing as hell. And it would also force this logic. I think leaving receiver out is ok. :-)

So why do we so dearly want to make it BE Smalltalk? In theory it could enable us to select some Tirade code with the mouse and press “alt-d” on it, but it is still quite easy to debug Tirade code directly. If someone makes a compelling argument I am willing to listen, but for now – sorry, no, Tirade is not == Smalltalk and it does not have an explicit receiver on the left. As Jecel pointed out though, it may be legal Self code. :-)

Conclusion

I think Tirade has interesting potential, especially as a readable serialization or configuration (!) format for Smalltalk. For “data interchange” JSON is probably still KING because then you often need to send data across language barriers and there are JSON parsers everywhere. Tirade is a Squeak only thing, although should be trivially portable to other Smalltalks.

I now intend to go full steam ahead and use Tirade as a file format for Deltas in the DeltaStreams project. We will see how it turns out when put to some real world usage.