OOPSLA 2008 - Compilation replay, String optimization and code-copying VMs
The problem presented was the irregularity in performance tests due to execution variations outside of the actual benchmarked code. This is due to indeterministic behaviors of the VM and levels below it due to timing, scheduling, interrupts and even the sampling itself affects results.
The variation claimed to be around 5%-10% in the regular Java benchmark suite. This was noted as "quite a lot", ehm… no? Some people may consider 5-10% a lot, but I sure don’t. Hmmm, I guess there might be some circumstances where this would matter but I am not sure you would use java in those cases. Ok, so let’s disregard that fact. :)
The idea of the paper was to use multiple compilation plans, running benchmarks multiple times and then doing matched-pair analysis on the results, well something like that anyway, I definitely know too little about "compilation plans" in java etc.
They tested this approach using the Jikes RVM and the SPECjvm98 + DaCapo benchmarks, on Athlon and Intel running Linux. Then they did statistical analysis on the results.
The recommendation at the end is that replay compilation is good, but you need to use multiple plans. Also, use matched pairs comparison for tighter confidence and when in trouble, increase the plan count before increasing the measurement count. Or something like that. :)
My conclusion is that benchmarking is hard (no surprise there) and modern dynamically jitting VMs are even harder to benchmark. I think it was John Maloney (implementor of Morphic) that said that he preferred the slower but predictable performance of the Squeak VM instead of the faster but quite unpredictable behavior of the Self VM. This was for developing Morphic, the graphical user interface of Self/Squeak, where of course responsiveness etc is crucial to the user experience.
But I still am curious about the case where 5-10% would matter. Roger that went with me to OOPSLA has been working on performance issues in a very large critical java system and he also thought 5-10% is "nothing at all". :)
Second paper was about Java String memory optimization techniques and String inefficiencies. I found this paper more interesting and pragmatic. The basic problem is wasted heap space due to lots of String instances in a typical large java program. Three different kind of waste were identified:
Memory waste A: Unused areas in the internal character Array. I had no idea a String was an internal array with an offset and a count variable! And this can be largely due to String manipulation causing more unused areas.
Memory waste B: A lot of Strings with the same value. In large apps even more so, easily 500-1000 of the same - like "name" etc.
Memory waste C: Lots of unused literal Strings. For example error messages, these are instantiated on first use which unfortunately typically is done in class initialization.
Presenter claims that over 50% of all "String related objects" are unnecessary. Looking at a specific case it seems that B and C are dominant.
Trick 1: Unify same-value strings when they are long lived. Implemented on J9. Seemed kinda straight forward, they called it StringGC. Result looked good, no comments there. I did a quick check in my Squeak VM for Gjallar and sure, given my 220000 ByteString instances I could theoretically (ignoring the fact that code may rely on identity) squeeze out 3Mb.
Trick 2: Convert a class to instantiate only actually used String literals. Instead of instantiating a Java Array in a class initializer you can just use a case-switch and instantiate and return the String you want on demand. Since most of these Strings are in java.util.ListResourceBundle you can fix this by redefining a method in your ListResourceBundle subclasses: handleGetObject(key). They whipped up something called BundleConverter that automates this.
Trick 3: More bad guys are around, like DateFormatZoneData for example with a lot of timezone names. But now it gets harder so they hacked up "Lazy Body Creation" in the VM. The idea is to let the String offset point into the Constant pool of the class before actual use - and then lazily create the character array when needed.
The evaluation on real benchmarks showed 8-13% smaller heap (ehm, ok last slide says 18% not sure) and the db benchmark actually got 30% faster due to the String unification that evidently speeded up String.equals. :)
Discussion afterwards was about the unification of Strings, the trick of course does not conform to java spec and would break code relying on String object identity. One idea was to only unify the bodies.
I can’t help but reflect on the fact that waste C is not a problem in Smalltalk since a class in Smalltalk is an object and each CompiledMethod holds its own literals. So in Smalltalk the Strings are instantiated at compile time - and we don’t have any constant pool in some other place - meaning we only have one String in memory.
The other thing is that a Smalltalk String is in fact a "variableByteSubclass" so we don’t suffer from waste A either.
But what about waste B? Since we have Symbol (unified Strings) we should simply use that more than we do I guess. All in all, I think we are in a pretty good shape in Smalltalk compared to Java on this one. :)
Final paper was about code copying to speed up VMs. I honestly don’t really know the details on this one but it was still interesting.
The technique was applied to SableVM, OCaml and Yarv (Ruby) and they tried it on Intel, AMD and PPC. End result was more or less that OCaml got a great boost, probably due to its simple bytecode set, java got a smaller improvement and Yarv actually got worse in many cases - and only slightly faster in some benchmarks. Exactly why was a bit foggy.
Three interesting papers and quite aptly presented I think.