VC4 now generates HTML documentation from the comments in the kernel code. It's just a start, but I'm hoping to add more technical details of how the system works here. I don't think doxygen-style documentation of individual functions would be very useful (if you're calling vc4 functions, you're editing the vc4 driver, unlike library code), so I haven't pursued that.
I've managed to get permission to port my 3D driver to another platform, the 911360 enterprise phone that's already partially upstreamed. It's got a VC4 V3D in it (slightly newer than the one in Raspberry Pi), but an ARM CLCD controller for the display. The 3D came up pretty easily, but for display I've been resurrecing Tom Cooksey's old CLCD KMS driver that never got merged. After 3+ years of DRM core improvements, we get to delete half of the driver he wrote and just use core helpers instead. Completing this work will require that I get the dma-buf fencing to work right in vc4, which is something that Android cares about.
Spurred on by a report from one of the Processing developers, I've also been looking at register allocation troubles again. I've found one good opportunity for reducing register pressure in Processing by delaying FS/VS input loads until they're actually used, and landed and then reverted one patch trying to accomplish it. I also took a look at DEQP's register allocation failures again, and fixed a bunch of its testcases with a little scheduling fix.
I've also started on a fix to allow arbitrary amounts of CMA memory to be used for vc4. The 256MB CMA limit today is to make sure that the tile state buffer and tile alloc/overflow buffers are in the same 256MB region due to a HW addressing bug. If we allocate a single buffer that can contain tile state, tile alloc, and overflow all at once within a 256MB area, then all the other buffers in the system (texture contents, particularly) can go anywhere in physical address space. The set top box group did some testing, and found that of all of their workloads, 16MB was enough to get any job completed, and 28MB was enough to get bin/render parallelism on any job. The downside is that a job that's too big ends up just getting rejected, but we'll surely run out of CMA for all your textures before someone hits the bin limit by having too many vertices per pixel.
Eben recently pointed out to me that the HVS can in fact scan out tiled buffers, which I had thought it wasn't able to. Having our window system buffers be linear means that when we try to texture from it (your compositing manager's drawing and uncomposited window movement for example) we have to first do a copy to the tiled format. The downside is that, in my measurements, we lose up to 1.4% of system performance (as measured by memcpy bandwidth tests with the 7" LCD panel for display) due to the HVS being inefficient at reading from tiled buffers (since it no longer gets to do large burst reads). However, this is probably a small price to pay for the massive improvement we get in graphics operations, and that cost could be reduced by X noticing that the display is idle and swapping to a linear buffer instead. Enabling tiled scanout is currently waiting for the new buffer modifiers patches to land, which Ben Widawsky at Intel has been working on.
There's no update on HDMI audio merging -- we're still waiting for any response from the ALSA maintainers, despite having submitted the patch twice and pinged on IRC.