[Most Recent Entries]
Below are the 20 most recent journal entries recorded in
Eric Anholt's LiveJournal:
[ << Previous 20 ]
[ << Previous 20 ]
|Monday, March 27th, 2017|
|This week in vc4 (2017-03-27): Upstream PRs, more CMA, meson
Last week I sent pull requests for bcm2835 changes for 4.12. We've got some DT updates for HDMI audio, DSI, and SDHOST, and defconfig changes to enable SDHOST. The DT changes to actually enable SDHOST (and get wifi working on Pi3) won't land until 4.13.
I also wrote a patch to enable using more than 256MB of CMA memory (and not require any particular alignment). The 256MB limit was due to a hardware bug: the binner's memory allocations get dereferenced with their top 4 bits set to the top 4 bits of the tile state data array's address. Given that tile state allocations happen after CL setup (while the binner is running and throwing overflow interrupts), there was no way to guarantee that we could find overflow memory with the top bits matching.
The new solution, suggested by someone from the set top box group, is to allocate a single 16MB to 32MB buffer at HW init time, and return all of those types of allocations out of it, since it turns out you don't need much to complete rendering of any given scene. I've been mulling over the details of a solution for a while, and finally wrote and tested the patch I wanted (tricky parts included freeing the memory when the hardware was idle, and how to track the lifetimes of the sub-allocations). Results look good, and I'll be submitting it this week.
However, I spent most of the week on converting the X Server over to meson
Meson is a delightful new build system (based around Ninja on Linux) that massively speeds up builds, while also being portable to Windows (unlike autotools generally). If you've ever tried to build the X stack on Raspberry Pi, you know that autotools is painfully slow. It's also been the limiting factor for me in debugging my scripts for CI for the X Server -- something we'd really like to be doing as we hack on glamor or do refactors in the core.
So far all I've landed in this project is code deletion, as I find build options that aren't hooked up to anything, or code that isn't hooked up to build options. This itself will speed up our builds, and ajax has been working in parallel on deleting a bunch of code that makes the build messier than it needs to be. I've also submitted patches for rendercheck converting to meson (as a demo of what the conversion looks like), and I have Xephyr, Xvfb, Xdmx, and Xwayland building in the X Server with meson.
So far the only stumbling block for the meson conversion of the X Server is the X.org sdksyms.c file. It's the ugliest part of the build -- running the C preprocessor on a generated .c that #includes a bunch of .h tiles, then running the output of that through awk and trying to parse C using regular expressions. This is, as you might guess, somewhat fragile.
My hope for a solution to this is to just quit generating sdksyms.c entirely. Using ELF sections, we can convince the linker to not garbage collect symbols that it thinks are unused. Then we get to just decorate symbols with XORG_EXPORT or XORG_EXPORT_VAR (unfortunately have to have separate sections for RO vs RW contents), and Xorg will have the correct set of symbols exported. I started on a branch for this, ajax got it actually building, and now we just need to bash the prototypes so that the same set of symbols are exported before/after the conversion.
|Monday, March 20th, 2017|
|This week in vc4 2017-03-20: HDMI Audio, ARM CLCD, glamor
The biggest news this week is that I've landed the VC4 HDMI audio driver from Boris in drm-misc-next. The alsa-lib maintainers took the patch
today for the automatic configuration file, so now we just need to get that backported to various distributions and we'll have native HDMI audio in 4.12.
My main project, though, was cleaning up the old ARM CLCD controller driver. I've cut the driver from 3500 lines of code to 1000, and successfully tested it with VC4 doing dmabuf sharing to it. It was surprising how easy the dmabuf part actually was, once I had display working.
Still to do for display on this platform is to get CLCD merged, get my kmscube patches merged (letting you do kmscube on a separate display vs gpu platform), and clean up my xserver patches. The xserver changes are to let you have a 3D GPU with no display connectors as the protocol screen and the display controller as its sink. We had infrastructure for most of this for doing PRIME in laptops, except that laptop GPUs all have some display connectors, even if they're not actually connected to anything.
I also had a little time to look into glamor bugs. I found a little one-liner to fix dashed lines (think selection rectangles in the GIMP), and debugged a report of how zero-width lines are broken. I'm afraid for VC4 we're going to need to disable X's zero-width line acceleration. VC4 hardware doesn't implement the diamond-exit rule, and actually has some workarounds in it for the way people naively *try* to draw lines in GL that thoroughly breaks X11.
In non-vc4 news, the SDHOST driver is now merged upstream, so it should be in 4.12. It won't quite be enabled in 4.12, due to the absurd merge policies of the arm-soc world, but at least the Fedora and Debian kernel maintainers should have a much easier time with Pi3 support at that point.
|Monday, March 13th, 2017|
|This month in vc4 (2017-03-13): docs, porting, tiled display
It's been a while since I've posted a status update. Here are a few things that have been going on.
VC4 now generates HTML documentation
from the comments in the kernel code. It's just a start, but I'm hoping to add more technical details of how the system works here. I don't think doxygen-style documentation of individual functions would be very useful (if you're calling vc4 functions, you're editing the vc4 driver, unlike library code), so I haven't pursued that.
I've managed to get permission to port my 3D driver to another platform, the 911360 enterprise phone that's already partially upstreamed. It's got a VC4 V3D in it (slightly newer than the one in Raspberry Pi), but an ARM CLCD controller for the display. The 3D came up pretty easily, but for display I've been resurrecing Tom Cooksey's old CLCD KMS driver that never got merged. After 3+ years of DRM core improvements, we get to delete half of the driver he wrote and just use core helpers instead. Completing this work will require that I get the dma-buf fencing to work right in vc4, which is something that Android cares about.
Spurred on by a report from one of the Processing developers, I've also been looking at register allocation troubles again. I've found one good opportunity for reducing register pressure in Processing by delaying FS/VS input loads until they're actually used, and landed and then reverted one patch trying to accomplish it. I also took a look at DEQP's register allocation failures again, and fixed a bunch of its testcases with a little scheduling fix.
I've also started on a fix to allow arbitrary amounts of CMA memory to be used for vc4. The 256MB CMA limit today is to make sure that the tile state buffer and tile alloc/overflow buffers are in the same 256MB region due to a HW addressing bug. If we allocate a single buffer that can contain tile state, tile alloc, and overflow all at once within a 256MB area, then all the other buffers in the system (texture contents, particularly) can go anywhere in physical address space. The set top box group did some testing, and found that of all of their workloads, 16MB was enough to get any job completed, and 28MB was enough to get bin/render parallelism on any job. The downside is that a job that's too big ends up just getting rejected, but we'll surely run out of CMA for all your textures before someone hits the bin limit by having too many vertices per pixel.
Eben recently pointed out to me that the HVS can in fact scan out tiled buffers, which I had thought it wasn't able to. Having our window system buffers be linear means that when we try to texture from it (your compositing manager's drawing and uncomposited window movement for example) we have to first do a copy to the tiled format. The downside is that, in my measurements, we lose up to 1.4% of system performance (as measured by memcpy bandwidth tests with the 7" LCD panel for display) due to the HVS being inefficient at reading from tiled buffers (since it no longer gets to do large burst reads). However, this is probably a small price to pay for the massive improvement we get in graphics operations, and that cost could be reduced by X noticing that the display is idle and swapping to a linear buffer instead. Enabling tiled scanout is currently waiting for the new buffer modifiers patches to land, which Ben Widawsky at Intel has been working on.
There's no update on HDMI audio merging -- we're still waiting for any response from the ALSA maintainers, despite having submitted the patch twice and pinged on IRC.
|Monday, February 13th, 2017|
|These weeks in vc4 (2017-02-13): HDMI audio, DSI, 3D stability
Probably the biggest news of the last two weeks is that Boris's native HDMI audio driver is now on the mailing list for review. I'm hoping that we can get this merged for 4.12 (4.10 is about to be released, so we're too late for 4.11). We've tested stereo audio so far, no compresesd audio (though I think it should Just Work), and >2 channel audio should be relatively small amounts of work from here. The next step on HDMI audio is to write the alsalib configuration snippets necessary to hide the weird details of HDMI audio (stereo IEC958 frames required) so that sound playback works normally for all existing userspace, which Boris should have a bit of time to work on still.
I've also landed the vc4 part of the DSI driver in the drm-misc-next tree, along with a fixup. Working with the new drm-misc-next trees for vc4 submission is delightful -- once I get a review, I can just push code to the tree and it will magically (through Daniel Vetter's pull requests and some behind-the-scenes tools) go upstream at the appropriate time. I am delighted with the work that Daniel has been doing to make the DRM subsystem more welcoming of infrequent contributors and reducing the burden on us all of participating in the Linux kernel process.
In 3D land, the biggest news is that I've fixed a kernel oops (producing a desktop lockup) when CMA returns out of memory. Unfortunately, VC4 doesn't have an MMU, so we require that memory allocations for graphics be contiguous (using the Contiguous Memory Allocator), and to make things worse we have a limit of 256MB for CMA due to an addressing bug of the V3D part, so CMA returning out of memory is a serious and unfortunately frequent problem. I had a bug where a CMA failure would put the failed allocation into the BO cache, so if you asked for a BO of the same size soon enough, you'd get the bad BO back and crash. I've been unable to construct a good minimal testcase for it, but the patch is on the mailing list and in the rpi-4.4.y tree now.
I've also fixed a bug in the downstream tree's "firmware KMS" mode (where I use the closed source firmware for display, but open source 3D) where fullscreen 3D rendering would get a single frame displayed and then freeze.
In userspace, I've fixed a bit of multisample rendering (copies of MSAA depth buffers) that gets more of our regression testing working, and worked on some potential rasterization problems (the texwrap regression tests are failing due to texture filtering troubles, and I'm not sure if we're actually doing anything wrong or not because we're near the cutoff for the "how accurate does the filtering have to be?").
Coming up, I'm looking at fixing a cursor updates bug that Michael Zoran has found, and fixing up the DSI panel driver so that it can hopefully get into 4.12. Also, after a recent discussion with Eben, I've realized that we can actually scan out tiled buffers from the GPU, so we may get big performance wins for un-composited X if I can have Mesa coordinate with the kernel on producing tiled buffers.
|Monday, January 30th, 2017|
|This week in vc4 (2017-01-30): Camera, downstream backport, 3D performance.
Most of last week was spent switching my development environment over to a setup with no SD cards involved at all. This was triggered by yet another card failing, and I spent a couple of days off and on trying to recover it. I now have three scripts that build and swap my test environment between upstream 64-bit Pi3, upstream 32-bit Pi3, and downstream 32-bit Pi3, using just the closed source bootloader without u-boot or an SD card. Previously I was on Pi2 only (much slower for testing), and running downstream kernels was really difficult.
Once I got the new netboot system working, I tested and landed the NEON part of my tiling work (the big 208% download and 41% upload performance improvements). I'm looking forward to fixing up the clever tiling math parts soon.
I also tested and landed a few little compiler improvements. The nicest compiler improvement was turning on a switch that Marek added to gallium: We now run the GLSL compiler optimization loop exactly once (because it's required to avoid GLSL linker regressions), and rely on NIR to do the actual optimization after the GLSL linker has run. The GLSL IR is a terrible IR for doing optimization on (only a bit better than Mesa or TGSI IRs), and it's made worse by the fact that I wrote a bunch of its optimizations back before we had good datastructures available in Mesa and before we had big enough shaders that using good datastructures mattered. I'm horrified by my old code and can't wait to get it deleted (Timothy Arceri has been making progress on that front). Until we can actually delete it, though, cutting down the number of times we execute my old optimization passes should improve our compile times on complicated shaders.
Now that I have a good way to test the downstream kernel, I went ahead and made a giant backport
of our current vc4 kernel code to the 4.9 branch. I hope the Foundation can get that branch shipped soon -- backporting to 4.9 is so much easier for me than old 4.4, and the structure of the downstream DT files makes it much clearer what there is left to be moved upstream.
Meanwhile, Michael Zoran has continued hacking on the staging VCHI code, and kernel reviewers were getting bothered by edits to code with no callers. Michael decided to solve that by submitting the old HDMI audio driver that feeds through the closed source firmware (I'm hoping we can delete this soon once Boris's work lands, though), and I pulled the V4L2 camera driver out of rpi-4.9.y and submitted that to staging as well. I unfortunately don't have the camera driver quite working yet, because when I modprobe it the network goes down. There are a ton of variables that have changed since the last time I ran the camera (upstream vs downstream, 4.10 vs 4.4, pi3 vs pi2), so it's going to take a bit of debugging before I have it working again.
Other news: kraxel from RH has resubmitted the SDHOST driver upstream, so maybe we can have wifi by default soon. Baruch Siach has submitted some fixes that I suspect get BT working. I've also passed libepoxy off to Emmanuele Bassi (long time GNOME developer) who has fixed it up to be buildable and usable on Linux again and converted it to the Meson build system, which appears to be really promising.
|Tuesday, January 24th, 2017|
|This week in vc4 (2017-01-24): LCA, DSI, HDMI audio
I spent last week at linux.conf.au 2017
(the best open source conference), mostly talking to people and going to talks instead of acually hacking on vc4. I'd like to highlight a few talks you could go watch if you weren't there.
Jack Moffit gave an overview
of what Servo
is doing with Rust on parallel processing and process isolation in web engines. I've been hacking on this project in my spare time, and the community is delightful to work with. Check out his talk to see what I've been excited about recently.
I then spent the next session in the hallway chatting with him instead of going to Willy's excellent talk about how the Linux page cache works
. One of the subjects of that talk makes me wonder: I've got these contiguous regions of GPU memory on VC4, could I use huge pages when we're mapping them for uploads to the GPU? Would they help enough to matter?
Dave Airlie gave a talk about the structure of the Vulkan drivers
and his work with Bas on the Radeon Vulkan driver. It's awesome how quickly these two have been able to turn around a complete Vulkan driver, and it really validates our open source process and the stack we've built in Mesa.
Nathan Egge from Mozilla gave a talk on accelerating JPEG decode
using the GPU. I thought it was going to be harder to do than he made it out to be, so it was pretty cool to see the performance improvements he got. The only sad part was that it isn't integrated into WebRender yet as far as I can see.
Karen Sandler talked about the issue of what happens to our copyrights
when we die, from experience in working with current project maintainers on relicensing efforts. I think this is a really important talk for anyone who writes copyleft software. It actually softened my stance on CLAs (!) but is mostly a call to action for "Nobody has a good process yet for dealing with our copyrights after death, let's work on this."
And, finally, I think the most important talk of the conference to me was Daniel Vetter's Maintainers Don't Scale (no video yet, just a blog post
). The proliferation of boutique trees in the kernel with "overworked" maintainers either ignoring or endlessly bikeshedding the code of outside contributors is why I would never work on the Linux kernel if I wasn't paid to do so. Daniel actually offers a way forward from the current state of maintainership in the kernel, with group maintainership (a familiar model in healthy large projects) to replace DRM's boutique trees and a reinforcement of the cooperative values that I think are why we all work on free software. I would love to be able to shut down my boutique DRM and ARM platform trees.
In VC4, I did still manage to write a couple of kernel fixes in response to a security report, and rebase DSI on 4.10 and respin some of its patches for resubmission. I'm still hoping for DSI in 4.11, but as usual it depends on DT maintainer review (no response for over a month). I also spent a lot of time with Keith talking about ARM cache coherency questions and trying experiments on my NEON tiling code.
While I was mostly gone last week, though, Boris has been working on finish off my HDMI audio branch. He now has a branch that has a seriously simplified architecture compared to what I'd talked about with larsc originally (*so* much cleaner than what I was modeling on), and he has it actually playing some staticy audio. He's now debugging what's causing the static, including comparing dumps of raw audio between VC4 and the firmware, along with the register settings of the HDMI and DMA engines.
|Monday, January 16th, 2017|
|These weeks in vc4 (2017-01-16): NEON acceleration of tiling
Last week I was on vacation, but the week before that I did some more work on figuring out Intel's Mesa CI system, and Mark Janes has started working on some better documentation for it. I now understand better how the setup is going to work, but haven't made much progress on actually getting a master running yet.
More fun, though, was finally taking a look at optimizing the tiled texture load/store code. This got started with Patrick Walton tweeting a link to a blog post
on clever math for implementing texture tiling, given a couple of assumptions.
As with all GPUs these days, VC4 swizzles textures into a tiled layout so that when a cacheline is loaded from memory, it will cover nearby pixels in the Y direction as well as X, so that you are more likely to get cache hits for your neighboring pixels in drawing. And the tiling tends to be multiple levels, so that nearby cachelines are also nearby on the screen, reducing DRAM access latency.
For small textures on VC4, we have a single level of tiling: 4x4@32bpp blocks ("utiles") of raster order data, themselves arranged in up to a 4x4 block, and this mode is called LT. Once things go over that size, we go to T tiling, where we call a 4x4 LT block a subtile, and arrange subtiles either clockwise or counterclockwise order within a 2x2 block (a tile, which at this point is now 1kb of data), and tiles themselves are arranged left-to-right, then right-to-left, then left-to-right again.
The first thing I did was implement the blog post's clever math for LT textures. One of the nice features of the math trick is that it means I can do partial utile updates finally, because it skips around in the GPU's address space based on the CPU-side pixel coordinate instead of trying to go through GPU address space to update a 4x4 utile at a time. The downside of doing things this way is that jumping around in GPU address space means that our writes are unlikely to benefit from write combining, which is usually important for getting full write performance to GPU memory. It turned out, though, that the math was so much more efficient than what I was doing that it was still a win.
However, I found out that the clever math made reads slower. The problem is that, because we're write-combined, reads are uncached -- each load is a separate bus transaction to main memory. Switching from my old utile-at-a-time load using memcpy to the new code meant that instead of doing 4 loads using NEON to grab a row at a time, we were now doing 16 loads of 32 bits at a time, for what added up to a 30% performance hit.
Reads *shouldn't* be something that people do much, except that we still have some software fallbacks in X on VC4 ("core" unantialiased text rendering, for example, which I need to write the GLSL 1.20 shaders for), which involve doing a read/modify/write cycle on a texture. My first attempt at fixing the regression was just adding back a fast path that operates on a utile at a time if things are nicely utile-aligned (they generally are). However, some forced inlining of functions per cpp that I was doing for the unaligned case meant that the glibc memcpy call now got inlined back to being non-NEON, and the "fast" utile code ended up not helping loads.
Relying on details of glibc's implementation (their tradeoff for when to do NEON loads) and of gcc's implementation (when to go from memcpy calls to inlined 32-bits-at-a-time code) seems like a pretty bad idea, so I decided to finally write the NEON acceleration that Eben and I have talked about several times.
My first hope was that I could load a full cacheline with NEON's VLD. VLD1 only loads up to 1 "quadword" (16 bytes) at a time, so that doesn't seem like much help. VLD4 can load 64 bytes like we want, but it also turns AOS data into SOA in the process, and there's no corresponding "SOA-back-to-AOS store 8 or 16 bytes at a time" like we need to do to get things back into the CPU's strided representation. I tried VLD4+VST4 into a temporary, then doing my old untiling path on the cached temporary, but that still left me a few percent slower on loads than not doing any of this work at all.
Finally, I hit on using the VLDM instruction. It seems to be intended for stack loads/stores, but we can also use it to get 64 bytes of data in from memory untouched into NEON registers, and then I can use 4 (32bpp) or 8 (8 or 16bpp) VST1s to store it to the CPU side. With this, we get a 208.256% +/- 7.07029% (n=10) improvement to GetTexImage performance at 1024x1024. Doing the same NEON code for stores gave a 41.2371% +/- 3.52799% (n=10) improvement, probably mostly due to not calling into memcpy and having it go through its size/alignment-based memcpy path choosing process.
I'm not yet hitting full memory bandwidth, but this should be a noticeable improvement to X, and it'll probably help my piglit test suite runtime as well. Hopefully I'll get the code polished up and landed next week when I get back from LCA.
|Monday, January 2nd, 2017|
|This week in vc4 (2017-01-02): thread switching, CI
The 3DMMES test has been thoroughly instruction count limited, so any wins we can get on code generation translate pretty directly into performance gains. Last week I decided to work on fixing up Jonas's patch to schedule instructions in the delay slots of thread switching, which can save us 3 instructions per texture sample.
Thread switching, you may recall, is a trick in the fragment shader to hide texture fetch latency by cooperatively switching to another fragment shader instance after you request a texture fetch, so that it can make some progress when you'd probably be blocked anyway. This lets us better occupy the ALUs, at the cost of each shader needing to fit into half of the physical register file.
However, VC4 doesn't cancel instructions in the pipeline when we request a switch (same as for within-shader branching), so 3 more instructions from the current shader get executed. For my first implementation of thread switching, I just dumped in 3 NOPs after each THRSW to fill the delay slots. This was still a massive win over not threading, so it was good enough.
Jonas's patch tries to fill the delay slots after we schedule in the thread switch by trying to extract the THRSW signal off of the instruction we scheduled and move it up to 2 instructions before that, and then only add enough NOPs to get us 3 slots filled. There was a little bug (it re-scheduled the thrsw instruction instead of a NOP in trying to pad out to delay slots), but it basically worked and got us a 1.55% instruction count win on shader-db.
The problem was that he was scheduling ALU operations along with the thrsw, and if the thrsw signal was alone without an ALU operation in it, after moving the thrsw up we'd have a NOP left in the old location. I wrote a followon patch to fix that: We now only schedule thrsws on their own without ALU operations, insert the THRSW as early as we can, and then add NOPs as necessary to fill remaining delay slots. This was another 0.41% instruction count win.
This isn't as good as it could be. Maybe we don't fill the slots as well as we could before choosing to schedule thrsw, but instructions we choose to schedule after that could be fit in. Those would be tricky because we have to check that they don't write flags or the accumulators (which wouldn't be preserved across the thrsw) or new texture coordinates. We also don't put the THRSW at any particular point in the timeline between sampler request and results collection. We might be able to get wins by trying to put thrsw at the lowest-register-pressure point between them, so that fewer things need to be forced into physical regs instead of accumulators.
Those will be projects for later. It's probably much more important that we figure out how to schedule 2-4 samples with a single thrsw, instead of doing a thrsw per sample like we do today.
The other project last week was starting to build up a plan for vc4 CI. We've had a few regressions to vc4 in Mesa master because developers (including me!) don't do testing on vc4 hardware on every commit. The Intel folks have a lovely CI system that does piglit, DEQP, and performance regression testing for them, both tracking Mesa master and doing pre-commit tests of voluntarily submitted branches. I despaired of the work needed to build something that good, but apparently they've open sourced
their configuration, so I'll be trying to replicate that in the next few weeks. This week I worked on getting the hardware and a plan for where to host it. I'm looking forward to a bright future of 30-minute test runs on the actual hardware and long-term tracking of performance metrics. Unfortunately, there are no docs to it and I've never worked with jenkins before, so this is going to be a challenge.
Other things: I submitted a patch to mm/ to shut up the CMA warning we've been suffering from (and patching away downstream) for years, got exactly the expected response ("this dmesg spew in an common path might be useful for somebody debugging something some day, so it should stay there"), so hopefully we can just get Fedora and Debian to patch it out instead. This is yet another data point in favor of Matthew Garrett's plan of "write kernel patches you need, don't bother submitting them upstream". Started testing a new patch to reduce error return rate on vc4 memory allocatoins on upstream kernel, haven't confirmed that it's working yet. I also spent a lot of time reviewing tarceri's patches to Mesa preparing for the on-disk shader cache, which should help improve app startup times and reduce memory consumption.
|Monday, December 19th, 2016|
|This week in VC4 (2016-12-19): DSI, SDTV
Two big successes last week.
One is that Dave Airlie has pulled Boris's VEC (STDV) code for 4.10. We didn't quite make it for getting the DT changes in time to have full support in 4.10, but the DT changes will be a lot easier to backport than the driver code.
The other is that I finally got the DSI panel working, after 9 months of frustrating development. It turns out that my DSI transactions to the Toshiba chip aren't working, but if I use I2C to the undocumented Atmel microcontroller to relay to the Toshiba in one of the 4 possible orderings of the register sequence, the panel comes up. The Toshiba docs say I need to do I2C writes at the beginning of the poweron sequence, but the closed firmware uses DSI transactions for the whole sequence, making me suspicious that the Atmel has already done some of the Toshiba register setup.
I've now submitted the DSI driver, panel, and clock patches after some massive cleanup. It even comes with proper runtime power management for both the panel and the VC4 DSI module.
The next steps in modesetting land will be to write an input driver for the panel, do any reworks we need from review, and get those chunks merged in time for 4.11. While I'm working on this, Boris is now looking at my HDMI audio code so that hopefully we can get that working as well. Eben is going to poke Dom about fixing the VC4driver interop with media decode. I feel like we're now making rapid progress toward feature parity on the modesetting side of the VC4 driver.
|Monday, December 5th, 2016|
|This week in vc4 (2016-12-05): SDTV, 3DMMES, HDMI audio, DSI
The Raspberry Pi Foundation recently started contracting with Free Electrons to give me some support on the display side of the stack. Last week I got to review and release their first big piece of work: Boris Brezillon's code for SDTV support
. I had suggested that we use this as the first project because it should have been small and self contained. It ended up that we had some clock bugs Boris had to fix, and a bug in my core VC4 CRTC code, but he got a working patch series together shockingly quickly. He did one respin for a couple more fixes once I had tested it, and it's now out on the list waiting for devicetree maintainer review. If nothing goes wrong, we should have composite out support in 4.11 (we're probably a week late for 4.10).
On my side, I spent some more time on HDMI audio and the DSI panel. On the audio side, I'm now emitting the GCP packet for audio mute appropriately (I think), and with some more clocking fixes it's now accepting the audio data at the expected rate. On the DSI front, I fixed a bit of sequencing and added debugfs for the registers like we have in our other encoders. There's still no actual audio coming out of HDMI, and only white coming out of the panel.
The DSI situation is making me wish for someone else's panel that I could attach to the connector, so I could see if my bugs are in the Atmel bridge programming or in the DSI driver.
I did some more analysis of 3DMMES's shaders, and improved our code generation, for wins of 0.4%, 1.9%, 1.2%, 2.6%, and 1.2%. I also experimented with changing the UBO (indirect addressed uniform array) upload path, which showed no effect. 3DMMES's uniform arrays are tiny, though, so it may be a win in some other app later.
I also got a couple of new patches from Jonas Pfeil. I went through his thread switch delay slots
patch, which is pretty close to ready. He has a similar patch for branching delay slots, though apparently that one isn't showing wins yet in things he's tested. Perhaps most exciting, though, is that he went and implemented an idea
I had dropped on github: replacing our shadow copies of raster textures with a bunch of math in the shader and using general memory loads. This could potentially fix X performance without a compositor, which we otherwise really don't have a workaround for other than "use a compositor." It could also improve GL-in-a-window performance: right now all of our DRI surfaces are raster format, because we want to be able to get full screen pageflipping, but that means we do the shadow copy if they aren't fullscreen. Hopefully this week I'll get a chance to test and review it.
|Monday, November 28th, 2016|
|This week in vc4 (2016-11-28): performance tuning
I missed last week's update, but with the holiday it ended up being a short week anyway.
The multithreaded fragment shaders are now in drm-next and Mesa master. I think this was the last big win for raw GL performance and we're now down to the level of making 1-2% improvements in our commits. That is, unless we're supposed to be using double-buffer non-MS mode
and the closed driver was just missing that feature. With the glmark2 comparisons I've done, I'm comfortable with this state, though. I'm now working on performance comparisons for 3DMMES Taiji, which the HW team often uses as a benchmark. I spent a day or so trying to get it ported to the closed driver and failed, but I've got it working on the open stack and have written a couple of little performance fixes with it.
The first was just a regression fix from the multithreading patch series, but it was impressive that multithreading hid a 2.9% instruction count penalty and still showed gains.
One of the new fixes I've been working on is folding ALU operations into texture coordinate writes. This came out of frustration from the instruction selection research I had done the last couple of weeks, where all algorithms seemed like they would still need significant peephole optimization after selection. I finally said "well, how hard would it be to just finish the big selection problems I know are easily doable with peepholes?" and it wasn't all that bad. The win came out to about 1% of instructions, with a similar benefit to overall 3DMMES performance (it's shocking how ALU-bound 3DMMES is)
I also started on a branch
to jump to the end of the program when all 16 pixels in a thread have been discarded. This had given me a 7.9% win on GLB2.7 on Intel, so I hoped for similar wins here. 3DMMES seemed like a good candidate for testing, too, with a lot of discards that are followed by reams of code that could be skipped, including texturing. Initial results didn't seem to show a win, but I haven't actually done any stats on it yet. I also haven't done the usual "draw red where we were applying the optimization" hack to verify that my code is really working, either.
While I've been working on this, Jonas Pfeil (who originally wrote the multithreading support) has been working on a couple of other projects. He's been trying to get instructions scheduled into the delay slots of thread switches
, which should help reduce any regressions those two features might have caused. More exciting, he's just posed a branch
for doing nearest-filtered raster textures (the primary operation in X11 compositing) using direct memory lookups instead of our shadow-copy fallback. Hopefully I get a chance to review, test, and merge in the next week or two.
On the kernel side, my branches for 4.10 have been pulled. We've got ETC1 and multithread FS for 4.10, and a performance win in power management. I've also been helping out and reviewing Boris Brezillon's work for SDTV output in vc4. Those patches should be hitting the list this week.
|Monday, November 14th, 2016|
|This week in vc4 (2016-11-14): Multithreaded fragment shaders
I dropped HDMI audio last week because Jonas Pfeil showed up with a pair of branches to do multithreaded fragment shaders.
Some context for multithreaded fragment shaders: Texture lookups are really slow. I think we eyeballed them as having a latency of around 20 QPU instructions. We can hide latency sometimes if there's some unrelated math to be done after the texture coordinate calculation but before using the texture sample. However, in most cases, you need the texture sample results right away for the next bit of work.
To allow programs to hide that latency, there's a cooperative hyperthreading mode that a fragment shader can opt into. The shader stays in the bottom half of register space, and before it collects the results of a texture fetch, it issues a thread switch signal, which the hardware will use to run a second fragment shader thread until that one issues its own thread switch. For the second thread, the top bit of the register addresses gets flipped, so the two threads' states don't overlap (except for the accumulators and the flags, which the shaders have to save and restore).
I had delayed working on this because the full solution was going to be really tricky: queue up as many lookups as we can, then thread switch, then collect all those results before switching again, all while respecting the FIFO sizes. However, Jonas's huge contribution here was in figuring out that you don't need to be perfect, you can get huge gains by thread switching between each texture lookup and reading its results.
The upshot was a 0-20% performance improvement on glmark2 and a performance hit to only one testcase. With this we're down to 3 subtests that we're slower on than the closed source driver. Jonas's kernel patch is out on the list, and I rewrote the Mesa series to expose threading to the register allocator and landed all but the enabling patch (which has to wait on the kernel side getting merged). Hopefully I'll finish merging it in a week.
In the process of writing multithreading, Jonas noticed that we were scheduling our TLB_Z writes early in the program, which can cut into fragment shader parallelism between the QPUs (of which we have 12) because a TLB_Z write locks the scoreboard for that 2x2 pixel quad. I merged a patch to Mesa that pushes the TLB_Z down to the bottom, at the cost of a single extra QPU instruction. Some day we should flip QPU scheduling around so that we pair that TLB_Z up better, and fill our delay slots in thread switches and branching as well.
I also tracked down a major performance issue in Raspbian's desktop using the open driver. I asked them to use xcompmgr a while back, because readback from the front buffer using the GPU is really slow (The texture unit can't read from raster textures, so we have to reformat them to be readable), and made window dragging unbearable. However, xcompmgr doesn't unredirect fullscreen windows, so full screen GL apps emit copies (with reformatting!) instead of pageflipping.
It looks like the best choice for Raspbian is going to be using compton, an xcompmgr fork that does pageflipping (you get tear free screen updates!) and unredirection of full screen windows (you get pageflipping directly from GL apps). I've also opened a bug report on compton with a description of how they could improve their GL drawing for a tiled renderer like the Pi, which could improve its performance for windowed updates significantly.
Simon ran into some trouble with compton, so he hasn't flipped the default yet, but I would encourage anyone running a Raspberry Pi desktop to give it a shot -- the improvement should be massive.
Other things last week: More VCHIQ patch review, merging more to the -next branches for 4.10 (Martin Sperl's thermal driver, at last!), and a shadow texturing bugfix.
|Monday, November 7th, 2016|
|This week in vc4 (2016-11-07): glmark2, ETC1, power management, HDMI audio
I spent a little bit of time last week on actual 3D work. I finally figured out what was going wrong with glmark2's terrain demo. The RCP (reciprocal) instruction on vc4 is a very rough approximation, and in translating GLSL code doing floating point divides (which we have to convert from a / b to a * (1 / b)) we use a Newton-Raphson
step to improve our approximation. However, we forgot to do this for the implicit divide we have to do at the end of your vertex shader, so the triangles in the distance were unstable and bounced around based on the error in the RCP output.
I also finally got around to writing kernel-side validation for ETC1 texture compression. I should be able to merge support for this to v4.10, which is one of the feature checkboxes we were missing compared to the old driver. Unfortunately ETC1 doesn't end up being very useful for us in open source stacks, because S3TC has been around a lot longer and supported on more hardware, so there's not much content using ETC that I know of outside of Android land.
I also spent a while fixing up various regressions that had happened in Mesa while I'd been playing in the kernel. Some day I should work on CI infrastructure on real hardware, but unfortunately Mesa doesn't have any centralized CI infrastructure to plug into. Intel has a test farm, but not all proposed patches get submitted to it, and it (understandably) doesn't have non-Intel hardware.
In kernel land, I sent out a patch to reduce V3D's overhead due to power management. We were immediately shutting off the 3D engine when it went idle, even if userland might be expected to submit a new frame shortly thereafter. This was taking up 1% of the CPU on profiles I was looking at last week, and I've seen it up to 3.5%. Now we keep the GPU on for at least 40ms ("about two frames plus some slack"). If I'm reading right, the closed driver does immediate powerdown as well, so we may be using slightly more power now, but given that power state transitions themselves cost power, and the CPU overhead costs power, hopefully this works out fine anyway (even though we've never done power measurements for the platform).
I also reviewed more patches from Michael Zoran for vchiq. It should now (once all reviewed patches land) work on arm64. We still need to get the vchiq-using drivers like camera and vcsm into staging.
Finally, I made more progress on HDMI audio. One of the ALSA developers, Lars-Peter Clausen, pointed out that ALSA's design is that userspace would provide the IEC958 subframes by using alsalib's iec958 plugin and a bit ouf asoundrc configuration. He also provided a ton of help debugging that pipeline. I rebased and cleaned up into a branch that uses simple-audio-card as my machine driver, so the whole stack is pretty trivial. Now aplay runs successfully, though no sound has come out. Next step is to go test that the closed stack actually plays audio on this monitor and capture some register state for debugging.
|Monday, October 31st, 2016|
|This week in vc4 (2016-10-31): HDMI audio
Last week was spent almost entirely on HDMI audio support.
I had started the VC4-side HDMI setup last week, and got the rest of the stubbed bits filled out this week.
Next was starting to get the audio side connected. I had originally derived my VC4 side from the Mediatek driver, which makes a child device for the generic "hdmi-codec" DAI codec to attach to. HDMI-codec
is a shared implementation for HDMI encoders that take I2S input, which basically just passes audio setup params into the DRM HDMI encoder, whose job is to actually make I2S signals on wires get routed into HDMI audio packets. They also have a CPU DAI
registered through device tree that moves PCM data from memory onto wires, and a machine driver
registered through devicetree that connects the CPU DAI's wires to the HDMI codec DAI's wires. This is apparently how SoC audio is usually structured.
VC4, on the other hand, doesn't have a separate hardware block for moving PCM data from memory onto wires, it just has the MAI_DATA register in the HDMI block that you put S/PDIF (IEC958) data into using the SoC's generic DMA engine
. So I made VC4's HDMI driver register another a child device with the MAI_DATA register's address passed in, and that child device gets bound by the CPU DAI driver and uses the ASoC generic DMA engine support
. That CPU DAI driver also ends up setting up the machine driver, since I didn't have any other reference to the CPU DAI anywhere. The glue here is all based on static strings I can't rely on, and is a complete hack for now.
Last, notice how I've used "PCM" talking about a working implementation's audio data and "S/PDIF" data for the thing I have to take in? That's the sticking point. The MAI_DATA register wants things that look like the AES3 frames description
you can find on wikipedia. I think the HW is generating my preamble. In the firmware's HDMI audio implementation, they go through the unpacked PCM data and set the channel status bit in each sample according to the S/PDIF channel status word (the first 2 bytes
of which are also documented on wikipedia). At the same time that they're doing that, they also set the parity bit.
My first thought was "Hey, look, ALSA has SNDRV_PCM_FORMAT_IEC958_SUBFRAME, maybe my codec can say it only takes these S/PDIF frames and everything will work out." No. It turns out aplay won't generate them and neither will gstreamer. To make a usable HDMI audio driver, I'm going to need to generate my IEC958 myself.
Linux is happy to generate my channel status word
in memory. I think the hardware is happy to generate my parity bit. That just leaves making the CPU DAI write the bits of the channel status word into the subframes as PCM samples come in to the driver, and making sure that the MSB of the PCM audio data is bit 27. I haven't figured out how to do that quite yet, which is the project for this week.
Work-in-progress, completely non-functional code is here
|Monday, October 24th, 2016|
|This week in vc4 (2016-10-24): HDMI audio, DSI, simulator, linux-next
The most interesting work I did last week was starting to look into HDMI audio. I've been putting this off because really I'm a 3D developer, not modesetting, and I'm certainly not an audio developer. But the work needs to get done, so here I am.
HDMI audio on SOCs looks like it's not terribly hard. VC4's HDMI driver needs to instantiate a platform device for audio when it gets loaded, and attach a function table to it for calls to be made when starting/stopping audio. Then I write a sound driver that will bind to the node that VC4 created, accepting a parameter of which register to DMA into. That sound driver will tell ALSA about what formats it can acccept and actually drive the DMA engine to write audio frames into HDMI. On the VC4 HDMI side, we do a bit of register setup (clocks, channel mappings, and the HDMI audio infoframe) when the sound driver tells us it wants to start.
It all sounds pretty straightforward so far. I'm currently at the "lots of typing" stage of development of this.
Second, I did a little poking at the DSI branch again. I've rebased onto 4.9 and tested an idea I had for how DSI might be failing. Still no luck. There's code on drm-vc4-dsi-panel-restart-cleanup-4.9
, and the length of that name suggests how long I've been bashing my head against DSI. I would love for anyone out there with the inclination to do display debug to spend some time looking into it. I know the history there is terrible, but I haven't prioritized cleanup while trying to get the driver working in the first place.
I also finished the simulator cleanup from last week, found a regression in the major functional improvement in the cleanup, and landed what worked. Dave Airlie pointed out that he took a different approach for testing with virgl called vtest
, and I took a long look at that. Now that I've got my partial cleanup in place, the vtest-style simulator wrapper looks a lot more doable, and if it doesn't have the major failing that swrast did with visuals then it should actually make vc4 simulation more correct.
I landed a very tiny optimization
to Mesa that I wrote while debugging DEQP failures.
Finally, I put together the -next
branches for upstream bcm2835 now that 4.9-rc1 is out. So far I've landed a major cleanup of pinctrl setup in the DT (started by me and finished by Gerd Hoffmann from Red Hat, to whom I'm very grateful) and a little USB setup fix. I did a lot of review of VCHIQ patches for staging, and sort of rewrote Linus Walleij's RFC patch for getting descriptive names for the GPIOs in lsgpio.
|Tuesday, October 18th, 2016|
|This week in vc4 (2016-10-18): X performance, DEQP, simulation
The most visible change this week is a fix to Mesa for texture upload performance. A user had reported
that selection rectangles in LXDE's file manager were really slow. I brought up sysprof, and it showed that we were doing uncached reads from the GPU in a situation that should have been entirely write-combined writes.
The bug was that when the size of the texture wasn't aligned to utiles, we were loading the previous contents into the temporary buffer before writing the new texture data in and uploading, even if the full texture was being updated. I've fixed
it to check for when the full texture is being uploaded, and not do the initial download. This bug was getting it on almost any Cairo vector graphics operation with its Xlib backend, so hopefully this helps a lot of people's desktops.
I also worked on a cleanup of the simulator mode. I use the closed source simulator regularly as part of my work -- it's fairly accurate to hardware behavior, and allows you to trace what's happening when things go wrong, all with the driver code running like "normal" for apps on my x86 desktop.
However, my simulator support is a little invasive to the driver, replacing vc4 ioctl calls with alternative ioctls to the i915 driver. I started on a series to make vc4_simulator.c take plain VC4 ioctls and translate them, so that the simulator code is entirely contained. The last step I think I have is figuring out which level to put the simulator's copying in and out of window system framebuffers at.
Last, I got DEQP
's GLES2 tests up and running on Raspberry Pi. These are approximately equivalent to the official conformance tests. The initial results were quite good -- 18630/19098 (97.5%) passing when run in the piglit framework. I found a couple of fixes
to be made in glClear() support, one
of which will affect all gallium drivers. Of the remainder, there are a few tests that are failing due to test bugs (we expose extensions that allow features the tests don't expect), but most are failing in register allocation. For the register allocation failures, I see a relatively quick fix that should reduce register pressure in loops.
|Tuesday, October 11th, 2016|
|This week in vc4 (2016-10-11): QPU user shaders, Processing performance
Last week I started work on making hello_fft work with the vc4 driver loaded.hello_fft
is a demo program for doing FFTs
using the QPUs (the shader core in vc4). Instead of drawing primitives into a framebuffer like a GL pipeline does, though, it uses the User QPU pipeline, which just hands a uniform stream to a QPU shader instance.
Originally I didn't build a QPU user shader ioctl for vc4 because there's no good way to use this pipeline of the VC4. The hardware isn't quite capable enough to be exposed as OpenCL or GL compute shaders (some research had been done into this, and the designers' conclusion was that the memory acccess support wasn't quite good enough to be useful). That leaves you with writing VC4 shaders in raw assembly
, which I don't recommend.
The other problem for vc4 exposing user shaders is that, since the GPU sits directly on main memory with no MMU in between, GPU shader programs could access anywhere in system memory. For 3D, there's no need (in GL 2.1) to do general write back to system memory, so the kernel reads your shader code
and rejects shaders if they try to do it. It also makes sure that you bounds-check all reads through the uniforms or texture sampler. For QPU user shaders, though, the expected mode of using them is to have the VPM DMA units do general loads and stores, so we'd need new validation support.
If it was just this one demo, we might be willing to lose support for it in the transition to the open driver. However, some modes of accelerated video decode also use QPU shaders at some stage, so we have to have some sort of solution.
My plan is basically to not do validation and have root-only execution of QPU shaders for now. The firmware communication driver captures hello_fft's request
to have the firmware pass QPU shaders to the V3D hardware, and it redirects it into VC4. VC4
then maintains the little queue of requests coming in, powers up the hardware, feeds them in, and collects the interrupts from the QPU shaders for when they're done (each job request is required to "mov interrupt, 1" at the end of its last shader to be run).
Dom is now off experimenting in what it will take to redirect video decode's QPU usage onto this code.
The other little project last week was fixing Processing
's performance on vc4. It's one of the few apps that is ported to the closed driver, so it's a likely thing to be compared on.
Unfortunately, Processing was clearing its framebuffers really inefficiently
. For non-tiled hardware it was mostly OK, with just a double-clear of the depth buffer, but for vc4 its repeated glClear()s (rather than a single glClear with all the buffers to be cleared set) were triggering flushes and reloads of the scene, and in one case a render of a full screen quad (and a full screen load before doing so) rather than fast clearing.
The solution was to improve the tracking of what buffers are cleared and whether any primitives have been drawn yet, so that I can coalesce sets of repeated or partial clears together while only updating the colors to be cleared. There's always a tension in this sort of optimization: Should the GL driver be clever to work around apps behaving badly, or should it push that off to the app developer (more like Vulkan does). In this case, tiled renderer behavior is sufficiently different from non-tiled renderers, and enough apps will hit this path, that it's well worth it. For Processing, I saw an improvement of about 10% on the demo I was looking at.
Sadly, Processing won't be a good comparison for open vs closed drivers even after these fixes. For the closed driver Processing uses EGL, which says that depth buffers become undefined on eglSwapBuffers. In its default mode, though, processing uses GLX, which says that the depth buffer is retained on glXSwapBuffers(). To avoid this extra overhead in its GLX mode, you'd need to use glDiscardFramebufferEXT()
right before the swap. Unfortunately, this isn't connected in gallium yet
but shouldn't be hard to do once we have an app that we can test.
|Monday, October 3rd, 2016|
|This week in vc4 (2016-10-03): HDMI, X testing and cleanups, VCHIQ
Last week Adam Jackson landed
my X testing series, so now at least 3 of us pushing code are testing glamor with every commit. I used the series to test a cleanup of the Render extension related code in glamor, deleting 550 lines of junk
. I did this while starting to work on using GL sampler objects
, which should help reduce our pixmap allocation and per-draw Render overhead. That patch is failing make check so far.
The big project for the week was tackling some of the HDMI bug reports. I pulled out my bigger monitor that actually takes HDMI input, and started working toward getting it to display all of its modes correctly. We had an easy bug in the clock driver
that broke changing to low resolutions: trying to drive the PLL too slow, instead of driving the PLL in spec and using the divider to get down to the clock we want. We needed infoframes support
to fix purple borders on the screen. We were doing our interlaced video mode programming pretty much entirely wrong
(it's not perfect yet, but 1 of the 3 interlaced modes on my montior displays correctly and they all display something). Finally, I added support for double-clocked CEA modes
(generally the low resolution, low refresh ones in your list. These are probably particularly interesitng for RPi due to all the people running emulators).
Finally, one of the more exciting things for me was that I got general agreement from gregkh to merge the VCHIQ driver through the staging tree, and he even put the initial patch
together for me. If we add in a couple more small helper drivers, we should be able to merge the V4L2 camera driver
, and hopefully get closer to doing a V4L2 video decode driver for upstream. I've now tested a messy series
that at least executes "vcdbg log msg." Those other drivers will definitely need some cleanups, though.
|Monday, September 26th, 2016|
|This week in vc4 (2016-09-26): XDC 2016, glamor testing, glamor performance
Last week I spent at XDS 2016 with other graphics stack developers.
I gave a talk
on X Server testing and our development model. Since playing around in the Servo project (and a few other github-centric communities) recently, I've become increasingly convinced that github + CI + automatic merges is the right way to go about software development now. I got a warmer reception from the other X Server developers than I expected, and I hope to keep working on building out this infrastructure.
I also spent a while with keithp and ajax (other core X developers) talking about how to fix our rendering performance. We've got two big things hurting vc4 right now: Excessive flushing, and lack of bounds on rendering operations.
The excessive flushing one is pretty easy. X currently does some of its drawing directly to the front buffer (I think this is bad and we should stop, but that *is* what we're doing today). With glamor, we do our rendering with GL, so the driver accumulates a command stream over time. In order for the drawing to the front buffer to show up on time, both for running animations, and for that last character you typed before rendering went idle, we have to periodically flush GL's command stream so that its rendering to the front buffer (if any) shows up.
Right now we just glFlush() every time we might block sending things back to a client, which is really frequent. For a tiling architecture like vc4, each glFlush() is quite possibly a load and store of the entire screen (8MB).
The solution I started back in August is to rate-limit how often we flush. When the server wakes up (possibly to render), I arm a timer. When it fires, I flush and disarm it until the next wakeup. I set the timer to 5ms as an arbitrary value that was clearly not pretending to be vsync. The state machine was a little tricky and had a bug it seemed (or at least a bad interaction with x11perf -- it was unclear). But ajax pointed out that I could easily do better: we have a function in glamor that gets called right before it does any GL operation, meaning that we could use that to arm the timer and not do the arming every time the server wakes up to process input.
Bounding our rendering operations is a bit trickier. We started doing some of this with keithp's glamor rework: We glScissor() all of our rendering to the pCompositeClip (which is usually the bounds of the current window). For front buffer rendering in X, this is a big win on vc4 because it means that when you update an uncomposited window I know that you're only loading and storing the area covered by the window, not the entire screen. That's a big bandwidth savings.
However, the pCompositeClip isn't enough. As I'm typing we should be bounding the GL rendering around each character (or span of them), so that I only load and store one or two tiles rather than the entire window. It turns out, though, that you've probably already computed the bounds of the operation in the Damage extension, because you have a compositor running! Wouldn't it be nice if we could just reuse that old damage computation and reference it?
Keith has started on making this possible: First with the idea of const-ifying all the op arguments
so that we could reuse their pointers as a cheap cache key, and now with the idea of just passing along a possibly-initialized rectangle
with the bounds of the operation. If you do compute the bounds, then you pass it down to anything you end up calling.
Between these two, we should get much improved performance on general desktop apps on Raspberry Pi.
Other updates: Landed opt_peephole_sel
improvement for glmark2 performance (and probably mupen64plus), added regression testing of glamor
to the X Server, fixed a couple of glamor rendering bugs
|Monday, September 19th, 2016|
|This week in vc4 (2016-09-19): firmware KMS, apitrace testing
While stuck on an airplane, I put together a repository for apitraces with confirmed good images for driver output. Combined with the piglit patches I pushed, I now have regression testing of actual apps on vc4 (particularly relevant given that I'm working on optimizing one of those apps!)
The flight was to visit the Raspberry Pi Foundation, with the goal of getting something usable for their distro to switch to the open 3D stack. There's still a giant pile of KMS work to do (HDMI audio, DSI power management, SDTV support, etc.), and waiting for all of that to be regression-free will be a long time. The question is: what could we do that would get us 3D, even if KMS isn't ready?
So, I put together a quick branch to expose the firmware's display stack as the KMS display pipeline. It's a filthy hack, and loses us a lot of the important new features that the open stack was going to bring (changing video modes in X, vblank timestamping, power management), but it gets us much closer to the featureset of the previous stack. Hopefully they'll be switching to it as the default in new installs soon.
In debugging while I was here, Simon found that on his HDMI display the color ramps didn't quite match between closed and open drivers. After a bit of worrying about gamma ramp behavior, I figured out that it was actually that the monitor was using a CEA mode that requires limited range RGB input. A patch is now on the list.