I've posted 1.50a; this changes only the software (no new bitstream) and has three main features:
1. Fixes a rare but serious bug that can cause a lot of stales
2. Maintains a single pool of signcrypted work (which means work loads faster, which means fewer stales)
3. Eliminates "false negatives" when detecting errors, which helps the clock converge faster
I've marked this as 1.50a instead of 1.51 since it hasn't received the amount of testing I normally do before a release, but the garbage collection bug was important enough that I wanted to get this out there. I am running 1.50a on my own cluster.
. . . . . . . . . . . . . . . . . . . . . . . . .
Google MapMaker is a nifty library, and the fact that you can make it act like a WeakHashMap simply by calling weakKeys() is really neat. Except that's not how it works. The semantics for Google MapMaker aren't the same as the semantics for WeakHashMap. In fairness to the people who wrote it this is
in the documentation, but it's subtle and dangerous enough that I would have put it in big red bold <blink> text: when you call weakKeys() on a map, it changes its semantics from using .equal()
-based equality (like WeakHashMap) to using pointer equality
. Unfortunately if you're using the map to do interning
this is exactly what you don't
So the bottom line is that due to MapMaker not working the way I thought it did, under certain circumstances the TML can wind up with multiple
heap objects for the same BlockID floating around. This meant that when a new block was detected (usually due to long poll), only the WorkJobs associated with one of these BlockID objects
would get revoked. What made this so hard to track down is that whether or not it happens is totally nondeterministic, based on the vagaries of garbage collection. On the machine that runs my cluster I've never been able to get it to happen; I guess it just has "lucky" heap settings.2.
Maclane and all subsequent bitstreams still encrypt nonces the way earlier bitstreams did using a two-way handshake, but the signcryption of the work
no longer involves a two-way handshake; this means that in a large cluster any signcrypted job can be loaded onto any ring -- it is no longer necessary to maintain separate pools of signcrypted jobs for each ring of each chip. The 1.50 software does not take advantage of this; in 1.50a there's one big pool of signcrypted work.3.
Prior to 1.50a the software simply flipped a "flag" saying "I recently loaded new work; ignore any errors since they might be the result of partially-loaded work". This flag would stay set for quite a long time (sometimes up to 3-4 seconds) leading the software to disregard a lot of errors. Starting with 1.50a the "output pointer" is checked immediately before starting to load new work and immediately after loading the last word of the new work, so the time window during which false negatives might occur is now extremely small -- basically however long it takes to do one read operation.