More fun, though, was finally taking a look at optimizing the tiled texture load/store code. This got started with Patrick Walton tweeting a link to a blog post on clever math for implementing texture tiling, given a couple of assumptions.
As with all GPUs these days, VC4 swizzles textures into a tiled layout so that when a cacheline is loaded from memory, it will cover nearby pixels in the Y direction as well as X, so that you are more likely to get cache hits for your neighboring pixels in drawing. And the tiling tends to be multiple levels, so that nearby cachelines are also nearby on the screen, reducing DRAM access latency.
For small textures on VC4, we have a single level of tiling: 4x4@32bpp blocks ("utiles") of raster order data, themselves arranged in up to a 4x4 block, and this mode is called LT. Once things go over that size, we go to T tiling, where we call a 4x4 LT block a subtile, and arrange subtiles either clockwise or counterclockwise order within a 2x2 block (a tile, which at this point is now 1kb of data), and tiles themselves are arranged left-to-right, then right-to-left, then left-to-right again.
The first thing I did was implement the blog post's clever math for LT textures. One of the nice features of the math trick is that it means I can do partial utile updates finally, because it skips around in the GPU's address space based on the CPU-side pixel coordinate instead of trying to go through GPU address space to update a 4x4 utile at a time. The downside of doing things this way is that jumping around in GPU address space means that our writes are unlikely to benefit from write combining, which is usually important for getting full write performance to GPU memory. It turned out, though, that the math was so much more efficient than what I was doing that it was still a win.
However, I found out that the clever math made reads slower. The problem is that, because we're write-combined, reads are uncached -- each load is a separate bus transaction to main memory. Switching from my old utile-at-a-time load using memcpy to the new code meant that instead of doing 4 loads using NEON to grab a row at a time, we were now doing 16 loads of 32 bits at a time, for what added up to a 30% performance hit.
Reads *shouldn't* be something that people do much, except that we still have some software fallbacks in X on VC4 ("core" unantialiased text rendering, for example, which I need to write the GLSL 1.20 shaders for), which involve doing a read/modify/write cycle on a texture. My first attempt at fixing the regression was just adding back a fast path that operates on a utile at a time if things are nicely utile-aligned (they generally are). However, some forced inlining of functions per cpp that I was doing for the unaligned case meant that the glibc memcpy call now got inlined back to being non-NEON, and the "fast" utile code ended up not helping loads.
Relying on details of glibc's implementation (their tradeoff for when to do NEON loads) and of gcc's implementation (when to go from memcpy calls to inlined 32-bits-at-a-time code) seems like a pretty bad idea, so I decided to finally write the NEON acceleration that Eben and I have talked about several times.
My first hope was that I could load a full cacheline with NEON's VLD. VLD1 only loads up to 1 "quadword" (16 bytes) at a time, so that doesn't seem like much help. VLD4 can load 64 bytes like we want, but it also turns AOS data into SOA in the process, and there's no corresponding "SOA-back-to-AOS store 8 or 16 bytes at a time" like we need to do to get things back into the CPU's strided representation. I tried VLD4+VST4 into a temporary, then doing my old untiling path on the cached temporary, but that still left me a few percent slower on loads than not doing any of this work at all.
Finally, I hit on using the VLDM instruction. It seems to be intended for stack loads/stores, but we can also use it to get 64 bytes of data in from memory untouched into NEON registers, and then I can use 4 (32bpp) or 8 (8 or 16bpp) VST1s to store it to the CPU side. With this, we get a 208.256% +/- 7.07029% (n=10) improvement to GetTexImage performance at 1024x1024. Doing the same NEON code for stores gave a 41.2371% +/- 3.52799% (n=10) improvement, probably mostly due to not calling into memcpy and having it go through its size/alignment-based memcpy path choosing process.
I'm not yet hitting full memory bandwidth, but this should be a noticeable improvement to X, and it'll probably help my piglit test suite runtime as well. Hopefully I'll get the code polished up and landed next week when I get back from LCA.