Some context for multithreaded fragment shaders: Texture lookups are really slow. I think we eyeballed them as having a latency of around 20 QPU instructions. We can hide latency sometimes if there's some unrelated math to be done after the texture coordinate calculation but before using the texture sample. However, in most cases, you need the texture sample results right away for the next bit of work.
To allow programs to hide that latency, there's a cooperative hyperthreading mode that a fragment shader can opt into. The shader stays in the bottom half of register space, and before it collects the results of a texture fetch, it issues a thread switch signal, which the hardware will use to run a second fragment shader thread until that one issues its own thread switch. For the second thread, the top bit of the register addresses gets flipped, so the two threads' states don't overlap (except for the accumulators and the flags, which the shaders have to save and restore).
I had delayed working on this because the full solution was going to be really tricky: queue up as many lookups as we can, then thread switch, then collect all those results before switching again, all while respecting the FIFO sizes. However, Jonas's huge contribution here was in figuring out that you don't need to be perfect, you can get huge gains by thread switching between each texture lookup and reading its results.
The upshot was a 0-20% performance improvement on glmark2 and a performance hit to only one testcase. With this we're down to 3 subtests that we're slower on than the closed source driver. Jonas's kernel patch is out on the list, and I rewrote the Mesa series to expose threading to the register allocator and landed all but the enabling patch (which has to wait on the kernel side getting merged). Hopefully I'll finish merging it in a week.
In the process of writing multithreading, Jonas noticed that we were scheduling our TLB_Z writes early in the program, which can cut into fragment shader parallelism between the QPUs (of which we have 12) because a TLB_Z write locks the scoreboard for that 2x2 pixel quad. I merged a patch to Mesa that pushes the TLB_Z down to the bottom, at the cost of a single extra QPU instruction. Some day we should flip QPU scheduling around so that we pair that TLB_Z up better, and fill our delay slots in thread switches and branching as well.
I also tracked down a major performance issue in Raspbian's desktop using the open driver. I asked them to use xcompmgr a while back, because readback from the front buffer using the GPU is really slow (The texture unit can't read from raster textures, so we have to reformat them to be readable), and made window dragging unbearable. However, xcompmgr doesn't unredirect fullscreen windows, so full screen GL apps emit copies (with reformatting!) instead of pageflipping.
It looks like the best choice for Raspbian is going to be using compton, an xcompmgr fork that does pageflipping (you get tear free screen updates!) and unredirection of full screen windows (you get pageflipping directly from GL apps). I've also opened a bug report on compton with a description of how they could improve their GL drawing for a tiled renderer like the Pi, which could improve its performance for windowed updates significantly.
Simon ran into some trouble with compton, so he hasn't flipped the default yet, but I would encourage anyone running a Raspberry Pi desktop to give it a shot -- the improvement should be massive.
Other things last week: More VCHIQ patch review, merging more to the -next branches for 4.10 (Martin Sperl's thermal driver, at last!), and a shadow texturing bugfix.