This week in vc4 (2016-12-05): SDTV, 3DMMES, HDMI audio, DSI

The Raspberry Pi Foundation recently started contracting with Free Electrons to give me some support on the display side of the stack.  Last week I got to review and release their first big piece of work: Boris Brezillon's code for SDTV support.  I had suggested that we use this as the first project because it should have been small and self contained.  It ended up that we had some clock bugs Boris had to fix, and a bug in my core VC4 CRTC code, but he got a working patch series together shockingly quickly.  He did one respin for a couple more fixes once I had tested it, and it's now out on the list waiting for devicetree maintainer review.  If nothing goes wrong, we should have composite out support in 4.11 (we're probably a week late for 4.10).

On my side, I spent some more time on HDMI audio and the DSI panel.  On the audio side, I'm now emitting the GCP packet for audio mute appropriately (I think), and with some more clocking fixes it's now accepting the audio data at the expected rate.  On the DSI front, I fixed a bit of sequencing and added debugfs for the registers like we have in our other encoders.  There's still no actual audio coming out of HDMI, and only white coming out of the panel.

The DSI situation is making me wish for someone else's panel that I could attach to the connector, so I could see if my bugs are in the Atmel bridge programming or in the DSI driver.

I did some more analysis of 3DMMES's shaders, and improved our code generation, for wins of 0.4%, 1.9%, 1.2%, 2.6%, and 1.2%.  I also experimented with changing the UBO (indirect addressed uniform array) upload path, which showed no effect.  3DMMES's uniform arrays are tiny, though, so it may be a win in some other app later.

I also got a couple of new patches from Jonas Pfeil.  I went through his thread switch delay slots patch, which is pretty close to ready.  He has a similar patch for branching delay slots, though apparently that one isn't showing wins yet in things he's tested.  Perhaps most exciting, though, is that he went and implemented an idea I had dropped on github: replacing our shadow copies of raster textures with a bunch of math in the shader and using general memory loads.  This could potentially fix X performance without a compositor, which we otherwise really don't have a workaround for other than "use a compositor."  It could also improve GL-in-a-window performance: right now all of our DRI surfaces are raster format, because we want to be able to get full screen pageflipping, but that means we do the shadow copy if they aren't fullscreen.  Hopefully this week I'll get a chance to test and review it.

This week in vc4 (2016-11-28): performance tuning

I missed last week's update, but with the holiday it ended up being a short week anyway.

The multithreaded fragment shaders are now in drm-next and Mesa master.  I think this was the last big win for raw GL performance and we're now down to the level of making 1-2% improvements in our commits.  That is, unless we're supposed to be using double-buffer non-MS mode and the closed driver was just missing that feature.  With the glmark2 comparisons I've done, I'm comfortable with this state, though.  I'm now working on performance comparisons for 3DMMES Taiji, which the HW team often uses as a benchmark.  I spent a day or so trying to get it ported to the closed driver and failed, but I've got it working on the open stack and have written a couple of little performance fixes with it.

The first was just a regression fix from the multithreading patch series, but it was impressive that multithreading hid a 2.9% instruction count penalty and still showed gains.

One of the new fixes I've been working on is folding ALU operations into texture coordinate writes.  This came out of frustration from the instruction selection research I had done the last couple of weeks, where all algorithms seemed like they would still need significant peephole optimization after selection.  I finally said "well, how hard would it be to just finish the big selection problems I know are easily doable with peepholes?" and it wasn't all that bad.  The win came out to about 1% of instructions, with a similar benefit to overall 3DMMES performance (it's shocking how ALU-bound 3DMMES is)

I also started on a branch to jump to the end of the program when all 16 pixels in a thread have been discarded.  This had given me a 7.9% win on GLB2.7 on Intel, so I hoped for similar wins here.  3DMMES seemed like a good candidate for testing, too, with a lot of discards that are followed by reams of code that could be skipped, including texturing.  Initial results didn't seem to show a win, but I haven't actually done any stats on it yet.  I also haven't done the usual "draw red where we were applying the optimization" hack to verify that my code is really working, either.

While I've been working on this, Jonas Pfeil (who originally wrote the multithreading support) has been working on a couple of other projects.  He's been trying to get instructions scheduled into the delay slots of thread switches and branching, which should help reduce any regressions those two features might have caused.  More exciting, he's just posed a branch for doing nearest-filtered raster textures (the primary operation in X11 compositing) using direct memory lookups instead of our shadow-copy fallback.  Hopefully I get a chance to review, test, and merge in the next week or two.

On the kernel side, my branches for 4.10 have been pulled.  We've got ETC1 and multithread FS for 4.10, and a performance win in power management.  I've also been helping out and reviewing Boris Brezillon's work for SDTV output in vc4.  Those patches should be hitting the list this week.

This week in vc4 (2016-11-14): Multithreaded fragment shaders

I dropped HDMI audio last week because Jonas Pfeil showed up with a pair of branches to do multithreaded fragment shaders.

Some context for multithreaded fragment shaders: Texture lookups are really slow.  I think we eyeballed them as having a latency of around 20 QPU instructions.  We can hide latency sometimes if there's some unrelated math to be done after the texture coordinate calculation but before using the texture sample.  However, in most cases, you need the texture sample results right away for the next bit of work.

To allow programs to hide that latency, there's a cooperative hyperthreading mode that a fragment shader can opt into.  The shader stays in the bottom half of register space, and before it collects the results of a texture fetch, it issues a thread switch signal, which the hardware will use to run a second fragment shader thread until that one issues its own thread switch.  For the second thread, the top bit of the register addresses gets flipped, so the two threads' states don't overlap (except for the accumulators and the flags, which the shaders have to save and restore).

I had delayed working on this because the full solution was going to be really tricky: queue up as many lookups as we can, then thread switch, then collect all those results before switching again, all while respecting the FIFO sizes.  However, Jonas's huge contribution here was in figuring out that you don't need to be perfect, you can get huge gains by thread switching between each texture lookup and reading its results.

The upshot was a 0-20% performance improvement on glmark2 and a performance hit to only one testcase.  With this we're down to 3 subtests that we're slower on than the closed source driver.  Jonas's kernel patch is out on the list, and I rewrote the Mesa series to expose threading to the register allocator and landed all but the enabling patch (which has to wait on the kernel side getting merged).  Hopefully I'll finish merging it in a week.

In the process of writing multithreading, Jonas noticed that we were scheduling our TLB_Z writes early in the program, which can cut into fragment shader parallelism between the QPUs (of which we have 12) because a TLB_Z write locks the scoreboard for that 2x2 pixel quad.  I merged a patch to Mesa that pushes the TLB_Z down to the bottom, at the cost of a single extra QPU instruction.  Some day we should flip QPU scheduling around so that we pair that TLB_Z up better, and fill our delay slots in thread switches and branching as well.

I also tracked down a major performance issue in Raspbian's desktop using the open driver.  I asked them to use xcompmgr a while back, because readback from the front buffer using the GPU is really slow (The texture unit can't read from raster textures, so we have to reformat them to be readable), and made window dragging unbearable.  However, xcompmgr doesn't unredirect fullscreen windows, so full screen GL apps emit copies (with reformatting!) instead of pageflipping.

It looks like the best choice for Raspbian is going to be using compton, an xcompmgr fork that does pageflipping (you get tear free screen updates!) and unredirection of full screen windows (you get pageflipping directly from GL apps).  I've also opened a bug report on compton with a description of how they could improve their GL drawing for a tiled renderer like the Pi, which could improve its performance for windowed updates significantly.

Simon ran into some trouble with compton, so he hasn't flipped the default yet, but I would encourage anyone running a Raspberry Pi desktop to give it a shot -- the improvement should be massive.

Other things last week: More VCHIQ patch review, merging more to the -next branches for 4.10 (Martin Sperl's thermal driver, at last!), and a shadow texturing bugfix.

This week in vc4 (2016-11-07): glmark2, ETC1, power management, HDMI audio

I spent a little bit of time last week on actual 3D work.  I finally figured out what was going wrong with glmark2's terrain demo.  The RCP (reciprocal) instruction on vc4 is a very rough approximation, and in translating GLSL code doing floating point divides (which we have to convert from a / b to a * (1 / b)) we use a Newton-Raphson step to improve our approximation.  However, we forgot to do this for the implicit divide we have to do at the end of your vertex shader, so the triangles in the distance were unstable and bounced around based on the error in the RCP output.

I also finally got around to writing kernel-side validation for ETC1 texture compression.  I should be able to merge support for this to v4.10, which is one of the feature checkboxes we were missing compared to the old driver.  Unfortunately ETC1 doesn't end up being very useful for us in open source stacks, because S3TC has been around a lot longer and supported on more hardware, so there's not much content using ETC that I know of outside of Android land.

I also spent a while fixing up various regressions that had happened in Mesa while I'd been playing in the kernel.  Some day I should work on CI infrastructure on real hardware, but unfortunately Mesa doesn't have any centralized CI infrastructure to plug into.  Intel has a test farm, but not all proposed patches get submitted to it, and it (understandably) doesn't have non-Intel hardware.

In kernel land, I sent out a patch to reduce V3D's overhead due to power management.  We were immediately shutting off the 3D engine when it went idle, even if userland might be expected to submit a new frame shortly thereafter.  This was taking up 1% of the CPU on profiles I was looking at last week, and I've seen it up to 3.5%.  Now we keep the GPU on for at least 40ms ("about two frames plus some slack").  If I'm reading right, the closed driver does immediate powerdown as well, so we may be using slightly more power now, but given that power state transitions themselves cost power, and the CPU overhead costs power, hopefully this works out fine anyway (even though we've never done power measurements for the platform).

I also reviewed more patches from Michael Zoran for vchiq.  It should now (once all reviewed patches land) work on arm64.  We still need to get the vchiq-using drivers like camera and vcsm into staging.

Finally, I made more progress on HDMI audio.  One of the ALSA developers, Lars-Peter Clausen, pointed out that ALSA's design is that userspace would provide the IEC958 subframes by using alsalib's iec958 plugin and a bit ouf asoundrc configuration.  He also provided a ton of help debugging that pipeline.  I rebased and cleaned up into a branch that uses simple-audio-card as my machine driver, so the whole stack is pretty trivial.  Now aplay runs successfully, though no sound has come out.  Next step is to go test that the closed stack actually plays audio on this monitor and capture some register state for debugging.

This week in vc4 (2016-10-31): HDMI audio

Last week was spent almost entirely on HDMI audio support.

I had started the VC4-side HDMI setup last week, and got the rest of the stubbed bits filled out this week.

Next was starting to get the audio side connected.  I had originally derived my VC4 side from the Mediatek driver, which makes a child device for the generic "hdmi-codec" DAI codec to attach to.  HDMI-codec is a shared implementation for HDMI encoders that take I2S input, which basically just passes audio setup params into the DRM HDMI encoder, whose job is to actually make I2S signals on wires get routed into HDMI audio packets.  They also have a CPU DAI registered through device tree that moves PCM data from memory onto wires, and a machine driver registered through devicetree that connects the CPU DAI's wires to the HDMI codec DAI's wires.  This is apparently how SoC audio is usually structured.

VC4, on the other hand, doesn't have a separate hardware block for moving PCM data from memory onto wires, it just has the MAI_DATA register in the HDMI block that you put S/PDIF (IEC958) data into using the SoC's generic DMA engine.  So I made VC4's HDMI driver register another a child device with the MAI_DATA register's address passed in, and that child device gets bound by the CPU DAI driver and uses the ASoC generic DMA engine support.  That CPU DAI driver also ends up setting up the machine driver, since I didn't have any other reference to the CPU DAI anywhere.  The glue here is all based on static strings I can't rely on, and is a complete hack for now.

Last, notice how I've used "PCM" talking about a working implementation's audio data and "S/PDIF" data for the thing I have to take in?  That's the sticking point.  The MAI_DATA register wants things that look like the AES3 frames description you can find on wikipedia.  I think the HW is generating my preamble.  In the firmware's HDMI audio implementation, they go through the unpacked PCM data and set the channel status bit in each sample according to the S/PDIF channel status word (the first 2 bytes of which are also documented on wikipedia).   At the same time that they're doing that, they also set the parity bit.

My first thought was "Hey, look, ALSA has SNDRV_PCM_FORMAT_IEC958_SUBFRAME, maybe my codec can say it only takes these S/PDIF frames and everything will work out."  No.  It turns out aplay won't generate them and neither will gstreamer.  To make a usable HDMI audio driver, I'm going to need to generate my IEC958 myself.

Linux is happy to generate my channel status word in memory.  I think the hardware is happy to generate my parity bit.  That just leaves making the CPU DAI write the bits of the channel status word into the subframes as PCM samples come in to the driver, and making sure that the MSB of the PCM audio data is bit 27.  I haven't figured out how to do that quite yet, which is the project for this week.

Work-in-progress, completely non-functional code is here.

This week in vc4 (2016-10-24): HDMI audio, DSI, simulator, linux-next

The most interesting work I did last week was starting to look into HDMI audio.  I've been putting this off because really I'm a 3D developer, not modesetting, and I'm certainly not an audio developer.  But the work needs to get done, so here I am.

HDMI audio on SOCs looks like it's not terribly hard.  VC4's HDMI driver needs to instantiate a platform device for audio when it gets loaded, and attach a function table to it for calls to be made when starting/stopping audio.  Then I write a sound driver that will bind to the node that VC4 created, accepting a parameter of which register to DMA into.  That sound driver will tell ALSA about what formats it can acccept and actually drive the DMA engine to write audio frames into HDMI.  On the VC4 HDMI side, we do a bit of register setup (clocks, channel mappings, and the HDMI audio infoframe) when the sound driver tells us it wants to start.

It all sounds pretty straightforward so far.  I'm currently at the "lots of typing" stage of development of this.

Second, I did a little poking at the DSI branch again.  I've rebased onto 4.9 and tested an idea I had for how DSI might be failing.  Still no luck.  There's code on drm-vc4-dsi-panel-restart-cleanup-4.9, and the length of that name suggests how long I've been bashing my head against DSI.  I would love for anyone out there with the inclination to do display debug to spend some time looking into it.  I know the history there is terrible, but I haven't prioritized cleanup while trying to get the driver working in the first place.

I also finished the simulator cleanup from last week, found a regression in the major functional improvement in the cleanup, and landed what worked.  Dave Airlie pointed out that he took a different approach for testing with virgl called vtest, and I took a long look at that.  Now that I've got my partial cleanup in place, the vtest-style simulator wrapper looks a lot more doable, and if it doesn't have the major failing that swrast did with visuals then it should actually make vc4 simulation more correct.

I landed a very tiny optimization to Mesa that I wrote while debugging DEQP failures.

Finally, I put together the -next branches for upstream bcm2835 now that 4.9-rc1 is out.  So far I've landed a major cleanup of pinctrl setup in the DT (started by me and finished by Gerd Hoffmann from Red Hat, to whom I'm very grateful) and a little USB setup fix.  I did a lot of review of VCHIQ patches for staging, and sort of rewrote Linus Walleij's RFC patch for getting descriptive names for the GPIOs in lsgpio.

This week in vc4 (2016-10-18): X performance, DEQP, simulation

The most visible change this week is a fix to Mesa for texture upload performance.  A user had reported that selection rectangles in LXDE's file manager were really slow.  I brought up sysprof, and it showed that we were doing uncached reads from the GPU in a situation that should have been entirely write-combined writes.

The bug was that when the size of the texture wasn't aligned to utiles, we were loading the previous contents into the temporary buffer before writing the new texture data in and uploading, even if the full texture was being updated.  I've fixed it to check for when the full texture is being uploaded, and not do the initial download.  This bug was getting it on almost any Cairo vector graphics operation with its Xlib backend, so hopefully this helps a lot of people's desktops.

I also worked on a cleanup of the simulator mode.  I use the closed source simulator regularly as part of my work -- it's fairly accurate to hardware behavior, and allows you to trace what's happening when things go wrong, all with the driver code running like "normal" for apps on my x86 desktop.

However, my simulator support is a little invasive to the driver, replacing vc4 ioctl calls with alternative ioctls to the i915 driver.  I started on a series to make vc4_simulator.c take plain VC4 ioctls and translate them, so that the simulator code is entirely contained.  The last step I think I have is figuring out which level to put the simulator's copying in and out of window system framebuffers at.

Last, I got DEQP's GLES2 tests up and running on Raspberry Pi.  These are approximately equivalent to the official conformance tests.  The initial results were quite good -- 18630/19098 (97.5%) passing when run in the piglit framework.  I found a couple of fixes to be made in glClear() support, one of which will affect all gallium drivers.  Of the remainder, there are a few tests that are failing due to test bugs (we expose extensions that allow features the tests don't expect), but most are failing in register allocation.  For the register allocation failures, I see a relatively quick fix that should reduce register pressure in loops.

This week in vc4 (2016-10-11): QPU user shaders, Processing performance

Last week I started work on making hello_fft work with the vc4 driver loaded.

hello_fft is a demo program for doing FFTs using the QPUs (the shader core in vc4).  Instead of drawing primitives into a framebuffer like a GL pipeline does, though, it uses the User QPU pipeline, which just hands a uniform stream to a QPU shader instance.

Originally I didn't build a QPU user shader ioctl for vc4 because there's no good way to use this pipeline of the VC4.  The hardware isn't quite capable enough to be exposed as OpenCL or GL compute shaders (some research had been done into this, and the designers' conclusion was that the memory acccess support wasn't quite good enough to be useful).  That leaves you with writing VC4 shaders in raw assembly, which I don't recommend.

The other problem for vc4 exposing user shaders is that, since the GPU sits directly on main memory with no MMU in between, GPU shader programs could access anywhere in system memory.  For 3D, there's no need (in GL 2.1) to do general write back to system memory, so the kernel reads your shader code and rejects shaders if they try to do it.  It also makes sure that you bounds-check all reads through the uniforms or texture sampler.  For QPU user shaders, though, the expected mode of using them is to have the VPM DMA units do general loads and stores, so we'd need new validation support.

If it was just this one demo, we might be willing to lose support for it in the transition to the open driver.  However, some modes of accelerated video decode also use QPU shaders at some stage, so we have to have some sort of solution.

My plan is basically to not do validation and have root-only execution of QPU shaders for now.  The firmware communication driver captures hello_fft's request to have the firmware pass QPU shaders to the V3D hardware, and it redirects it into VC4.  VC4 then maintains the little queue of requests coming in, powers up the hardware, feeds them in, and collects the interrupts from the QPU shaders for when they're done (each job request is required to "mov interrupt, 1" at the end of its last shader to be run).

Dom is now off experimenting in what it will take to redirect video decode's QPU usage onto this code.

The other little project last week was fixing Processing's performance on vc4.  It's one of the few apps that is ported to the closed driver, so it's a likely thing to be compared on.

Unfortunately, Processing was clearing its framebuffers really inefficiently.  For non-tiled hardware it was mostly OK, with just a double-clear of the depth buffer, but for vc4 its repeated glClear()s (rather than a single glClear with all the buffers to be cleared set) were triggering flushes and reloads of the scene, and in one case a render of a full screen quad (and a full screen load before doing so) rather than fast clearing.

The solution was to improve the tracking of what buffers are cleared and whether any primitives have been drawn yet, so that I can coalesce sets of repeated or partial clears together while only updating the colors to be cleared.  There's always a tension in this sort of optimization: Should the GL driver be clever to work around apps behaving badly, or should it push that off to the app developer (more like Vulkan does).  In this case, tiled renderer behavior is sufficiently different from non-tiled renderers, and enough apps will hit this path, that it's well worth it.  For Processing, I saw an improvement of about 10% on the demo I was looking at.

Sadly, Processing won't be a good comparison for open vs closed drivers even after these fixes.  For the closed driver Processing uses EGL, which says that depth buffers become undefined on eglSwapBuffers. In its default mode, though, processing uses GLX, which says that the depth buffer is retained on glXSwapBuffers().  To avoid this extra overhead in its GLX mode, you'd need to use glDiscardFramebufferEXT() right before the swap.  Unfortunately, this isn't connected in gallium yet but shouldn't be hard to do once we have an app that we can test.

This week in vc4 (2016-10-03): HDMI, X testing and cleanups, VCHIQ

Last week Adam Jackson landed my X testing series, so now at least 3 of us pushing code are testing glamor with every commit.  I used the series to test a cleanup of the Render extension related code in glamor, deleting 550 lines of junk.  I did this while starting to work on using GL sampler objects, which should help reduce our pixmap allocation and per-draw Render overhead.   That patch is failing make check so far.

The big project for the week was tackling some of the HDMI bug reports.  I pulled out my bigger monitor that actually takes HDMI input, and started working toward getting it to display all of its modes correctly.  We had an easy bug in the clock driver that broke changing to low resolutions: trying to drive the PLL too slow, instead of driving the PLL in spec and using the divider to get down to the clock we want.  We needed infoframes support to fix purple borders on the screen.  We were doing our interlaced video mode programming pretty much entirely wrong (it's not perfect yet, but 1 of the 3 interlaced modes on my montior displays correctly and they all display something).  Finally, I added support for double-clocked CEA modes (generally the low resolution, low refresh ones in your list.  These are probably particularly interesitng for RPi due to all the people running emulators).

Finally, one of the more exciting things for me was that I got general agreement from gregkh to merge the VCHIQ driver through the staging tree, and he even put the initial patch together for me.  If we add in a couple more small helper drivers, we should be able to merge the V4L2 camera driver, and hopefully get closer to doing a V4L2 video decode driver for upstream.  I've now tested a messy series that at least executes "vcdbg log msg."  Those other drivers will definitely need some cleanups, though.

This week in vc4 (2016-09-26): XDC 2016, glamor testing, glamor performance

Last week I spent at XDS 2016 with other graphics stack developers.

I gave a talk on X Server testing and our development model.  Since playing around in the Servo project (and a few other github-centric communities) recently, I've become increasingly convinced that github + CI + automatic merges is the right way to go about software development now.  I got a warmer reception from the other X Server developers than I expected, and I hope to keep working on building out this infrastructure.

I also spent a while with keithp and ajax (other core X developers) talking about how to fix our rendering performance.  We've got two big things hurting vc4 right now: Excessive flushing, and lack of bounds on rendering operations.

The excessive flushing one is pretty easy.  X currently does some of its drawing directly to the front buffer (I think this is bad and we should stop, but that *is* what we're doing today).  With glamor, we do our rendering with GL, so the driver accumulates a command stream over time.  In order for the drawing to the front buffer to show up on time, both for running animations, and for that last character you typed before rendering went idle, we have to periodically flush GL's command stream so that its rendering to the front buffer (if any) shows up.

Right now we just glFlush() every time we might block sending things back to a client, which is really frequent.  For a tiling architecture like vc4, each glFlush() is quite possibly a load and store of the entire screen (8MB).

The solution I started back in August is to rate-limit how often we flush.  When the server wakes up (possibly to render), I arm a timer.  When it fires, I flush and disarm it until the next wakeup.  I set the timer to 5ms as an arbitrary value that was clearly not pretending to be vsync.  The state machine was a little tricky and had a bug it seemed (or at least a bad interaction with x11perf -- it was unclear).  But ajax pointed out that I could easily do better: we have a function in glamor that gets called right before it does any GL operation, meaning that we could use that to arm the timer and not do the arming every time the server wakes up to process input.

Bounding our rendering operations is a bit trickier.  We started doing some of this with keithp's glamor rework: We glScissor() all of our rendering to the pCompositeClip (which is usually the bounds of the current window).  For front buffer rendering in X, this is a big win on vc4 because it means that when you update an uncomposited window I know that you're only loading and storing the area covered by the window, not the entire screen.  That's a big bandwidth savings.

However, the pCompositeClip isn't enough.  As I'm typing we should be bounding the GL rendering around each character (or span of them), so that I only load and store one or two tiles rather than the entire window.  It turns out, though, that you've probably already computed the bounds of the operation in the Damage extension, because you have a compositor running!  Wouldn't it be nice if we could just reuse that old damage computation and reference it?

Keith has started on making this possible: First with the idea of const-ifying all the op arguments so that we could reuse their pointers as a cheap cache key, and now with the idea of just passing along a possibly-initialized rectangle with the bounds of the operation.  If you do compute the bounds, then you pass it down to anything you end up calling.

Between these two, we should get much improved performance on general desktop apps on Raspberry Pi.

Other updates: Landed opt_peephole_sel improvement for glmark2 performance (and probably mupen64plus), added regression testing of glamor to the X Server, fixed a couple of glamor rendering bugs.