I first pulled out valgrind's massif tool on piglit's glsl-algebraic-add-add-1.shader_test. This works as a minimum "how much memory does it take to render *anything* with this driver?" test. We were consuming 1605k of heap at the peak, and there were some obvious fixes to be made.
First, the gallium state_tracker was allocating 659kb of space at context creation so that it could bytecode-interpret TGSI if needed for glRasterPos() and glRenderMode(GL_FEEDBACK). Given that nobody should ever use those features, and luckily they rarely do, I delayed the allocation of the somewhat misleadingly-named "draw" context until the fallbacks were needed.
Second, Mesa was allocating the memory for the GL 1.x matrix stacks up front at context creation. We advertise 32 matrices for modelview/projection, 10 per texture unit (32 of those), and 4 for programs. I instead implemented a typical doubling array reallocation scheme for storing the matrices, so that only the top matrix per stack is allocated at context creation. This saved 63kb of dirty memory per context.
722KB for these two fixes may not seem like a whole lot of memory to readers on fancy desktop hardware with 8GB of RAM, but the Raspberry Pi has only 1GB of RAM, and when you exhaust that you're swapping to an SD card. You should also expect a desktop to have several GL contexts created: the X Server uses one to do its rendering, you have a GL-based compositor with its own context, and your web browser and LibreOffice may each have one or more. Additionally, trying to complete our piglit testsuite on the Raspberry Pi is currently taking me 6.5 hours (when it even succeeds and doesn't see piglit's python runner get shot by the OOM killer), so I could use any help I can get in reducing context initialization time.
However, malloc()-based memory isn't all that's involved. The GPU buffer objects that get allocated don't get counted by massif in my analysis above. To try to approximately fix this, I added in valgrind macro calls to mark the mmap()ed space in a buffer object as being a malloc-like operation until the point that the BO is freed. This doesn't get at allocations for things like the window-system renderbuffers or the kernel's overflow BO (valgrind requires that you have a pointer involved to report it to massif), but it does help.
Once I has massif reporting more, I noticed that glmark2 -b terrain was allocating a *lot* of memory for shader BOs. Going through them, an obvious problem was that we were generating a lot of shaders for glGenerateMipmap(). A few weeks ago I improved performance on the benchmark by fixing glGenerateMipmap()'s fallback blits that we were doing because vc4 doesn't support the GL_TEXTURE_BASE_LEVEL that the gallium aux code uses. I had fixed the fallback by making the shader do an explicit-LOD lookup of the base level if the GL_TEXTURE_BASE_LEVEL==GL_TEXTURE_MAX_LE
Reducing extra shaders was fun, so I set off on another project I had thought of before. VC4's vertex shader to fragment shader IO system is a bit unusual in that it's just a FIFO of floats (effectively), with none of these silly "vec4"s that permeate GLSL. Since I can take my inputs in any order, and more flexibility in the FS means avoiding register allocation failures sometimes, I have the FS compiler tell the VS what order it would like its inputs in. However, the list of all the inputs in their arbitrary orders would be expensive to hash at draw time, so I had just been using the identity of the compiled fragment shader variant in the VS and CS's key to decide when to recompile it in case output order changed. The trick was that, while the set of all possible orders is huge, the number that any particular application will use is quite small. I take the FS's input order array, keep it in a set, and use the pointer to the data in the set as the key. This cut 712 shaders from shader-db.
Also, embarassingly, when I mentioned tracking the FS in the CS's key above? Coordinate shaders don't output anything to the fragment shader. Like the name says, they just generate coordinates, which get consumed by the binner. So, by removing the FS from the CS key, I trivially cut 754 shaders from shader-db. Between the two, piglit's gl-1.0-blend-func test now passes instead of OOMing, so we get test coverage on blending.
Relatedly, while working on fixing a kernel oops recently, I had noticed that we were still reallocating the overflow BO on every draw call. This was old debug code from when I was first figuring out how overflow worked. Since each client can have up to 5 outstanding jobs (limited by Mesa) and each job was allocating a 256KB BO, we coud be saving a MB or so per client assuming they weren't using much of their overflow (likely true for the X Server). The solution, now that I understand the overflow system better, was just to not reallocate and let the new job fill out the previous overflow area.
Other projects for the week that I won't expand on here: Debugging GPU hang in piglit glsl-routing (generated fixes for vc4-gpu-tools parser, tried writing a GFXH30 workaround patch, still not fixed) and working on supporting direct GLSL IR to NIR translation (lots of cleanups, a couple fixes, patches on the Mesa list).