Roads Less Taken

A blend of programming, boats and life.

Tirade, Part 2

| Comments

In an article recently I described Tirade - a new generic “file format” for Smalltalk/Squeak, or actually a sub language! Since that article I have refined Tirade a bit. Tirade consists today of 4 classes (parser, reader, writer, recorder) totalling about 500 lines of code, excluding tests. Tests are green in 3.10.2, pharo-10231, 3.9, 3.8 and 3.7. It does turn red in 3.6 due to old initialize behavior, some missing methods etc, probably easily fixed if anyone cares. There are no dependencies on other packages. Compared to using the old Compiler>>evaluate: it is about 5-7 times faster.

Tirade is a very small “language” similar to JSON (see below) and probably fits similar use cases as JSON fits.

Numbers

In my first Tirade description I opted out and only supported plain integers, no frills at all. Then after subsequent discussion I came to the conclusion that syntactically there is no problem to letTiradeParser>>parseInteger become TiradeParser>>parseNumber and just let it handle all kinds of Number literals that Squeak supports by either using SqNumberParser if present (in Squeak 3.9+) or by falling back on regular old Number class>>readFrom: which Scanner still uses in 3.10.2.

So now Tirade deals perfectly fine with:

  • 23.45 (Floats)

  • 16rFE (radix)

  • 1.0034e-5 (scientific notation)

  • 243s2 (scaled decimals)

  • “NaN”, “Infinity” and “-Infinity”

…and whatever else should be there.

The performance penalty if we use SqNumberParser (Squeak 3.9+) is not that bad, about 20% on my little trivial benchmark. Using Number class>>readFrom: hurts more, increasing time for benchmark around 50%.

Security…

First I played with having the builder object (that is typically fed the Tirade messages from the Tirade reader) implement isSelectorAllowed: etc. I finally ended up encoding a simple security scheme in the default TiradeReader that relies on finding the implementations of the Tirade messages in the builder in a method category beginning with “tirade”. It seems simple enough for most uses.

I also added a global “whitelist” of Tirade messages that can be registered in the reader before starting to parse. If selectors are found in this whitelist they are considered “ok”. This can be useful in some situations.

If the builder relies on catching Tirade messages using doesNotUnderstand: then it is on its own for security, but that seems fine.

Finally you can turn off all selector checks by using #unsafe:.

Receiver juggling

Tirade is meant to separate “concerns” between Tirade “code”, parser, reader and the builder object supplied by you. The Tirade “code” has no control over the receiver of the messages, Tirade “code” is just a sequential flow of messages separated with periods. The TiradeParser also doesn’t care, it just parses and then does “self processMessage”, if you are using TiradeParser directly it has a default implementation of #processMessage that prints them out in Transcript and collects them in an OrderedCollection.

So yes, you can use TiradeParser to just gobble up some Tirade input and then muck about with the OrderedCollection afterwards - similar to how you work with JSON or an XML DOM. But the better approach is of course to subclass TiradeParser and implement #processMessage to actually do something - in a streaming SAX-ish fashion.

Then we have the reader. There is a default TiradeReader that implements the security described above and also implements logic for deciding the “next receiver” of the Tirade messages. The logic goes like this:

  • If the builder supplied implements Tirade messages by always returning self, it will always be the receiver. Simple.

  • If the builder returns another object X, X will be used as the “next receiver”.

  • As long as X returns self it stays as the “next receiver”.

  • If object X returns another object Y, X will be put on a “stack of old receivers” and Y will be used as the “next receiver”.

  • If Y returns nil, X will be popped and be used as the “next receiver”.

  • If X returns nil we are back to the original builder, and if it returns nil nothing changes.

So if the above is “enough” for controlling the receivers, then the builder object handles it by simply returning the “right” objects. These objects can of course be “sub builders” or domain objects themselves or whatever.

If the above is not enough you can register “control messages” in TiradeReader. A control message can be any selector and will result in TiradeReader pushing the current receiver on the stack and setting the original builder object as the “next receiver”. There is also a small twist, if the control message returns self the reader will consider that to be equivalent to “nil” and thus pop the previous receiver back. This is because the common use is to make sure all control messages are sent to the original builder without disrupting the current stack of receivers. But… why? This enables the builder to explicitly control the reader during the parse, perhaps manipulating the current stack, even though it is not the “next receiver” receiving the regular Tirade stream of messages.

One very good reason to use this is when the current receiver is a domain object that does not “know” when to return nil to pop itself.

I am not perfectly happy with the current mechanisms, but it will do for now and I will revisit this when I see how it works out in practice. The important bits are in place though - Tirade input has no control over receivers and the builder object can control it if needed.

Compared to JSON again

The differences compared to JSON that I see right now:

  • Smalltalk syntax and parsing rules for Strings. This means no escapes except for double-single quotes. JSON has 8 other escape codes. Immediate advantage for me is able to store readable code in Tirade, including newlines.

  • Smalltalk syntax for Numbers. This means more capabilities for parsing numbers than JSON has (radix, NaN, Infinity, scaled decimals).

  • Symbols. JSON only has Strings.

  • Associations. JSON has an “object” which is a Dictionary restricted to String keys. JSON does not have a free standing Association. In Tirade any of the allowed objects can of course be keys or values and Assocations can be “standalone”. So there is a little bit of greater flexibility here.

  • Comments. Hmmm, JSON has no comments. Tirade allowed Smalltalk comments, but ONLY between messages.

  • Messages. This is the big difference, Tirade consists of a sequence of unary or keyword messages, with the “data” described above as arguments.

The addition of messages adds an important extra level of “classification”, “control” or “typing” or call it what you want. It also lends Tirade to easy streaming and concatenation. JSON consist on its top level of either a Dictionary or an Array. A parser could of course parse that in a “streaming fashion” one element or pair at a time, but they normally don’t do that I think.

Having messages of course makes it much more natural to map these messages onto a builder or multiple builders and also to use messages to control the message flow. I think this makes Tirade much more expressive in itself.

In summary, Tirade is similar to JSON but extended with messages and comments, more advanced Numbers, deals with text more easily (no escaping of CRs etc), can have comments in it, has a little bit more flexibility in data model (Associations) and uses Smalltalk syntax for it all.

Potential uses for Tirade

I started out with a focus on replacing “chunk format” with something simpler and secure for Deltas (Deltastreams project), eliminating the use of Compiler to parse it. Afterwards one can find several interesting places where Tirade could be used for example:

  • DSLs

  • RPC-ish communication

  • Transaction logs

…and a few more things :)

But hey, one thing at a time.

Tirade, a File Format for Smalltalkers

| Comments

In my revived work on Deltastreams in Squeak I ended up facing the choice of native file format for Deltas. Matthew has made an advanced format called InterleavedChangeset which manages to squeeze a binary representation of a Delta into a Changeset file (which is in Smalltalk chunk format). An impressive feat, and it has the advantage of being backwards compatible in the sense that a Delta in this format can be filed in as a plain old Changeset into an old Squeak image.

But I must say I don’t think that benefit alone is enough to justify these tricks. Oh well, time will tell - and multiple formats for Deltas are fine to have.

Looking at another file format for Deltas I decided I want these properties:

  • Readable. Changesets are nice because they are indeed readable. You can look at them in emacs if you like.

  • Editable. Same here, if you really need to you can edit changesets manually. The syntax is not totally bad.

  • Secure. Ouch, changesets fail here because they rely on Compiler>>eval: which of course opens up tons of tricks you can do. I don’t want that with Deltas.

  • Declarative. This is a spectrum but I would like the “style” for the file format to be declarative.

  • Streamable. The parser should not have to read it all into a “DOM structure” before actually doing something with it.

  • Extendable. It should be easy to extend the format in the future and have older code “ignore” such extensions.

  • Fast. It should be fast to parse and fast to produce. As always. :-)

  • Small. I mean both conceptually (easy to understand), codewise (parser etc) and syntactically.

JSON or YAML

First I looked at JSON which nails 5-6 out of these 8. It is indeed “Readable”, “Editable”, “Secure”, “Fast” and “Small”. And it is very “hip” in the web arena. It suffers from some problems though:

  • Strings can not have CRs in them, which will make “larger texts” such as source code a pure pain to read. This is a killer! Sorry.

  • It is actually not so well suited for streaming. A JSON document is a single “thing”, either a Dictionary or an Array. Sure, we could use convention and parse one key-value-pair (or element if it is an Array) at a time, but then I would need to write a new parser anyway.

  • It is declarative since it is “just data” but on the other hand this means it also lacks semantics. We would need to use Strings and keys (in key-value pairs) to denote our own semantics. Works, but probably would make the structure less apparent.

  • Same goes for extendability. Sure, we can add new keys in our key-value pairs etc, but what we really want is to be able to add new semantic elements, and again, those would need to be “encoded as data” in JSON. I think it would get messy.

…so JSON is out. I do like JSON though, in its utter simplicity and above all - it is available everywhere.

What about YAML then - JSON’s big brother? YAML is a superset of JSON, well, it aims to be a strict superset at least. I did find a very good blog article comparing them. Looking at YAML it seemed to have lots of better ways to represent source code with CRs etc, but hey… simple? Mmmm, not. Sure, it may be “simple” to look at but I sure didn’t think the specification was simple and in fact HUGE and I had a hard time reading it due to its style. Sorry, not “small” enough for me! ;-)

When discussing all this on squeak-dev the good ole “Hey, why not Smalltalk?”-mantra popped up. I have been there lots of times, using Smalltalk as “representation language” is very neat since it enables so called “internal DSLs” at such ease, but it does fail pretty hard on “Secure”, “Declarative” (depends on how you view that one, I know), “Streamable” and in some ways “Small”. Although it would be strange if we didn’t have Compiler available when dealing with Deltas :-).

Smalltalk… or Angry Smalltalk?

But bringing it up did get my mind working - how about a nice subset of Smalltalk? That does what JSON does but fixes the problems I identify above? Pretty quickly a subset of Smalltalk crystallized itself that I named “Tirade” as in:

“a long angry speech or scolding. Synonyms: diatribe, harangue, rant”

Tirade, as the name somewhat implies, is just a sequence of Smalltalk unary or keyword messages with optional Smalltalk comments inbetween.

The only thing making Tirade NOT a strict syntactic subset of Smalltalk is that there is no receiver to the left, this is up to the reader/parser to decide. One seemingly useful pattern for determining the receiver is to use the result of each message as the next receiver.

Here is a fantasy example of Tirade input:

        "We allow unary messages, the period after each message is mandatory.
        A unary message is probably used mainly for semantic structure."
        beginWumpus.

        "We allow Symbols."
        push: #superbat.

          "Whitespace is just fine, so indentation can be used, but has no meaning.
          In Strings we use double single quote for a single quote. No other escaping exists."
          name: 'Bat from hell'
          description:

        'Black as devil''s tar, evil. Appear in flocks
        and when they come you better duck.
        Duck fast.'.

          "Integers are fine, positive or negative. No scientific notation, no floats."
          power: 'Can screech' damage: 3.
          power: 'Can bite' damage: -6.

          attrib: #size->43.
          attrib: #color->nil.
          attrib: #dangerous->true.

          moreattribs: { #wings-> 2 . #ears -> nil. #teeth -> {'one'.'two'.'three'}}.

        pop: #superbat.

        endWumpus.

Well, you get the idea. So source code with CRs will work fine, just like in Changesets. We could even add support for indenting Strings with embedded CRs (as you can see above they otherwise break indentation), but I haven’t done that. So we have Strings, Integers, Associations, true/false/nil and brace Arrays. Nested in any combination and depth. In Smalltalk all these are “literals” except Associations which are actually created using the message #-> sent to the key object. In Tirade it is not implemented “as a message”, but rather built into the parser. I also chose the “brace array” instead of a regular array because it has separating periods and it allows Associations inside it.

Ok, so Tirade is just a sequence of messages with “data” as arguments. And the “data” is expressed very similarly to JSON, but with Smalltalk syntax. Is this better than plain JSON? I think so:

  • We got rid of the “CR in Strings”-issue, Tirade Strings are just like Smalltalk Strings - the only escape character is doubling a single quote.

  • Tirade is streaming since we parse and process the messages “one by one” and we get IMHO better “semantics” in the form of having something more than “just data”: keyword messages.

  • Extending is easily done with new messages that old code can happily ignore at will - or capture using #doesNotUnderstand: and log or whatever.

I just created the Squeaksource project Tirade for this, the code there has both reader, writer and green tests. It also has nice class comments :-) and it is not rocket science to get going with. The parser is a simple “predictive recursive descent” parser which anyone should be able to step through and understand. It has no ambiguity and it only checks the next Character for choice of production, which made it very easy to create.

Are we there?

Did we meet our objectives we started out with? Do we have our “dream format” we want? Let’s look at it again:

  • Readable. Oh yes, and for a Smalltalker VERY readable. We have simplicity, comments, indentation, CRs inside Strings etc.

  • Editable. Definitely, for a Smalltalker VERY editable. It is Smalltalk after all.

  • Secure. Very secure. There are no expressions and no access to globals. You can only send messages with data as arguments. You do not decide the receiver, the reader does. The base reader also does a security check asking the builder if it allows the selector.

  • Declarative. Given only messages (which creates structure and semantics) with data and no full Compiler I think it is very declarative.

  • Streamable. Yep. We just keep going and going! Since the period is mandatory after a message this also means you can concatenate Tirade streams together without syntactic problems.

  • Extendable. Just add new kinds of messages, make sure the old code can ignore “new messages” either by having a tolerant #doesNotUnderstand: or by letting #isSelectorAllowed: not allow unknown messages, which then will not be sent at all.

  • Fast. My first benchmark shows TiradeParser is about 7 times faster than Compiler. It is pretty nippy.

  • Small. It is definitely small in all aspects mentioned.

Just one gripe…

Why is it not a true subset of Smalltalk - in other words, why did I leave out the receiver to the left?!? Well, since we don’t have any kind of variables there is not much you can put there! Should we still write out “self”? If we did, then it would be confusing if the reader uses the “result of message is the next receiver”-logic.

We could encode that policy by enforcing the code to say “me := me ”, but then we would need to introduce a special variable “me” and it would look hackish and odd and confusing as hell. And it would also force this logic. I think leaving receiver out is ok. :-)

So why do we so dearly want to make it BE Smalltalk? In theory it could enable us to select some Tirade code with the mouse and press “alt-d” on it, but it is still quite easy to debug Tirade code directly. If someone makes a compelling argument I am willing to listen, but for now - sorry, no, Tirade is not == Smalltalk and it does not have an explicit receiver on the left. As Jecel pointed out though, it may be legal Self code. :-)

Conclusion

I think Tirade has interesting potential, especially as a readable serialization or configuration (!) format for Smalltalk. For “data interchange” JSON is probably still KING because then you often need to send data across language barriers and there are JSON parsers everywhere. Tirade is a Squeak only thing, although should be trivially portable to other Smalltalks.

I now intend to go full steam ahead and use Tirade as a file format for Deltas in the DeltaStreams project. We will see how it turns out when put to some real world usage.