hello_fft is a demo program for doing FFTs using the QPUs (the shader core in vc4). Instead of drawing primitives into a framebuffer like a GL pipeline does, though, it uses the User QPU pipeline, which just hands a uniform stream to a QPU shader instance.
Originally I didn't build a QPU user shader ioctl for vc4 because there's no good way to use this pipeline of the VC4. The hardware isn't quite capable enough to be exposed as OpenCL or GL compute shaders (some research had been done into this, and the designers' conclusion was that the memory acccess support wasn't quite good enough to be useful). That leaves you with writing VC4 shaders in raw assembly, which I don't recommend.
The other problem for vc4 exposing user shaders is that, since the GPU sits directly on main memory with no MMU in between, GPU shader programs could access anywhere in system memory. For 3D, there's no need (in GL 2.1) to do general write back to system memory, so the kernel reads your shader code and rejects shaders if they try to do it. It also makes sure that you bounds-check all reads through the uniforms or texture sampler. For QPU user shaders, though, the expected mode of using them is to have the VPM DMA units do general loads and stores, so we'd need new validation support.
If it was just this one demo, we might be willing to lose support for it in the transition to the open driver. However, some modes of accelerated video decode also use QPU shaders at some stage, so we have to have some sort of solution.
My plan is basically to not do validation and have root-only execution of QPU shaders for now. The firmware communication driver captures hello_fft's request to have the firmware pass QPU shaders to the V3D hardware, and it redirects it into VC4. VC4 then maintains the little queue of requests coming in, powers up the hardware, feeds them in, and collects the interrupts from the QPU shaders for when they're done (each job request is required to "mov interrupt, 1" at the end of its last shader to be run).
Dom is now off experimenting in what it will take to redirect video decode's QPU usage onto this code.
The other little project last week was fixing Processing's performance on vc4. It's one of the few apps that is ported to the closed driver, so it's a likely thing to be compared on.
Unfortunately, Processing was clearing its framebuffers really inefficiently. For non-tiled hardware it was mostly OK, with just a double-clear of the depth buffer, but for vc4 its repeated glClear()s (rather than a single glClear with all the buffers to be cleared set) were triggering flushes and reloads of the scene, and in one case a render of a full screen quad (and a full screen load before doing so) rather than fast clearing.
The solution was to improve the tracking of what buffers are cleared and whether any primitives have been drawn yet, so that I can coalesce sets of repeated or partial clears together while only updating the colors to be cleared. There's always a tension in this sort of optimization: Should the GL driver be clever to work around apps behaving badly, or should it push that off to the app developer (more like Vulkan does). In this case, tiled renderer behavior is sufficiently different from non-tiled renderers, and enough apps will hit this path, that it's well worth it. For Processing, I saw an improvement of about 10% on the demo I was looking at.
Sadly, Processing won't be a good comparison for open vs closed drivers even after these fixes. For the closed driver Processing uses EGL, which says that depth buffers become undefined on eglSwapBuffers. In its default mode, though, processing uses GLX, which says that the depth buffer is retained on glXSwapBuffers(). To avoid this extra overhead in its GLX mode, you'd need to use glDiscardFramebufferEXT() right before the swap. Unfortunately, this isn't connected in gallium yet but shouldn't be hard to do once we have an app that we can test.