The multithreaded fragment shaders are now in drm-next and Mesa master. I think this was the last big win for raw GL performance and we're now down to the level of making 1-2% improvements in our commits. That is, unless we're supposed to be using double-buffer non-MS mode and the closed driver was just missing that feature. With the glmark2 comparisons I've done, I'm comfortable with this state, though. I'm now working on performance comparisons for 3DMMES Taiji, which the HW team often uses as a benchmark. I spent a day or so trying to get it ported to the closed driver and failed, but I've got it working on the open stack and have written a couple of little performance fixes with it.
The first was just a regression fix from the multithreading patch series, but it was impressive that multithreading hid a 2.9% instruction count penalty and still showed gains.
One of the new fixes I've been working on is folding ALU operations into texture coordinate writes. This came out of frustration from the instruction selection research I had done the last couple of weeks, where all algorithms seemed like they would still need significant peephole optimization after selection. I finally said "well, how hard would it be to just finish the big selection problems I know are easily doable with peepholes?" and it wasn't all that bad. The win came out to about 1% of instructions, with a similar benefit to overall 3DMMES performance (it's shocking how ALU-bound 3DMMES is)
I also started on a branch to jump to the end of the program when all 16 pixels in a thread have been discarded. This had given me a 7.9% win on GLB2.7 on Intel, so I hoped for similar wins here. 3DMMES seemed like a good candidate for testing, too, with a lot of discards that are followed by reams of code that could be skipped, including texturing. Initial results didn't seem to show a win, but I haven't actually done any stats on it yet. I also haven't done the usual "draw red where we were applying the optimization" hack to verify that my code is really working, either.
While I've been working on this, Jonas Pfeil (who originally wrote the multithreading support) has been working on a couple of other projects. He's been trying to get instructions scheduled into the delay slots of thread switches and branching, which should help reduce any regressions those two features might have caused. More exciting, he's just posed a branch for doing nearest-filtered raster textures (the primary operation in X11 compositing) using direct memory lookups instead of our shadow-copy fallback. Hopefully I get a chance to review, test, and merge in the next week or two.
On the kernel side, my branches for 4.10 have been pulled. We've got ETC1 and multithread FS for 4.10, and a performance win in power management. I've also been helping out and reviewing Boris Brezillon's work for SDTV output in vc4. Those patches should be hitting the list this week.