This week in vc4 (2017-04-10): dmabuf fencing, meson

The big project for the last two weeks has been developing dmabuf fencing support for vc4.  Without dmabuf fences, when passing buffers between devices the user needs to manually wait for the job to finish on one (say, camera snapshot) before letting the other device get started (accumulating GL commands to texture from the camera snapshot).  That means leaving both devices idle for a moment while the CPU accumulates the command stream for the consumer, but the bigger pain is that it requires that the end user manage the synchronization.

With dma-buf fencing in the kernel, a "reservation object" generated by the dma-buf exporter tracks the fences of the various devices using the shared object, and then the device trivers get to look at that list and wait on on each others' fences when using it.

So far, I've got my reservations and fences being exported from vc4, so that pl111 display can wait for vc4 to be done before actually putting a new pageflip up on the screen.  I haven't quite hooked up the other direction, for camera capture into vc4 display or GL texturing (I don't have a testcase for this, as the current camera driver doesn't expose dmabufs), but it shouldn't be hard.

On the meson front, rendercheck is now converted to meson upstream.  I've made more progress on the X Server:  Xorg is now building, and even successfully executes Xorg -pogo with the previous modesetting driver in place.  The new modesetting driver is failing mysteriously.  With a build hack I got from the meson folks and some work from ajax, the sdksyms script I complained about in my last post isn't used at all on the meson build.  And, best of all, the meson devs have written the code needed for us to not even need the build hack I'm using.

It's so nice to be using a build system that's an actual living software project.

This week in vc4 (2017-03-27): Upstream PRs, more CMA, meson

Last week I sent pull requests for bcm2835 changes for 4.12.  We've got some DT updates for HDMI audio, DSI, and SDHOST, and defconfig changes to enable SDHOST.  The DT changes to actually enable SDHOST (and get wifi working on Pi3) won't land until 4.13.

I also wrote a patch to enable using more than 256MB of CMA memory (and not require any particular alignment).  The 256MB limit was due to a hardware bug: the binner's memory allocations get dereferenced with their top 4 bits set to the top 4 bits of the tile state data array's address.  Given that tile state allocations happen after CL setup (while the binner is running and throwing overflow interrupts), there was no way to guarantee that we could find overflow memory with the top bits matching.

The new solution, suggested by someone from the set top box group, is to allocate a single 16MB to 32MB buffer at HW init time, and return all of those types of allocations out of it, since it turns out you don't need much to complete rendering of any given scene.  I've been mulling over the details of a solution for a while, and finally wrote and tested the patch I wanted (tricky parts included freeing the memory when the hardware was idle, and how to track the lifetimes of the sub-allocations).  Results look good, and I'll be submitting it this week.

However, I spent most of the week on converting the X Server over to meson

Meson is a delightful new build system (based around Ninja on Linux) that massively speeds up builds, while also being portable to Windows (unlike autotools generally).  If you've ever tried to build the X stack on Raspberry Pi, you know that autotools is painfully slow.  It's also been the limiting factor for me in debugging my scripts for CI for the X Server -- something we'd really like to be doing as we hack on glamor or do refactors in the core.

So far all I've landed in this project is code deletion, as I find build options that aren't hooked up to anything, or code that isn't hooked up to build options.  This itself will speed up our builds, and ajax has been working in parallel on deleting a bunch of code that makes the build messier than it needs to be.  I've also submitted patches for rendercheck converting to meson (as a demo of what the conversion looks like), and I have Xephyr, Xvfb, Xdmx, and Xwayland building in the X Server with meson.

So far the only stumbling block for the meson conversion of the X Server is the X.org sdksyms.c file.  It's the ugliest part of the build -- running the C preprocessor on a generated .c that #includes a bunch of .h tiles, then running the output of that through awk and trying to parse C using regular expressions.  This is, as you might guess, somewhat fragile.

My hope for a solution to this is to just quit generating sdksyms.c entirely.  Using ELF sections, we can convince the linker to not garbage collect symbols that it thinks are unused.  Then we get to just decorate symbols with XORG_EXPORT or XORG_EXPORT_VAR (unfortunately have to have separate sections for RO vs RW contents), and Xorg will have the correct set of symbols exported.  I started on a branch for this, ajax got it actually building, and now we just need to bash the prototypes so that the same set of symbols are exported before/after the conversion.

This week in vc4 2017-03-20: HDMI Audio, ARM CLCD, glamor

The biggest news this week is that I've landed the VC4 HDMI audio driver from Boris in drm-misc-next.  The alsa-lib maintainers took the patch today for the automatic configuration file, so now we just need to get that backported to various distributions and we'll have native HDMI audio in 4.12.

My main project, though, was cleaning up the old ARM CLCD controller driver.  I've cut the driver from 3500 lines of code to 1000, and successfully tested it with VC4 doing dmabuf sharing to it.  It was surprising how easy the dmabuf part actually was, once I had display working.

Still to do for display on this platform is to get CLCD merged, get my kmscube patches merged (letting you do kmscube on a separate display vs gpu platform), and clean up my xserver patches.  The xserver changes are to let you have a 3D GPU with no display connectors as the protocol screen and the display controller as its sink.  We had infrastructure for most of this for doing PRIME in laptops, except that laptop GPUs all have some display connectors, even if they're not actually connected to anything.

I also had a little time to look into glamor bugs.  I found a little one-liner to fix dashed lines (think selection rectangles in the GIMP), and debugged a report of how zero-width lines are broken.  I'm afraid for VC4 we're going to need to disable X's zero-width line acceleration.  VC4 hardware doesn't implement the diamond-exit rule, and actually has some workarounds in it for the way people naively *try* to draw lines in GL that thoroughly breaks X11.

In non-vc4 news, the SDHOST driver is now merged upstream, so it should be in 4.12.  It won't quite be enabled in 4.12, due to the absurd merge policies of the arm-soc world, but at least the Fedora and Debian kernel maintainers should have a much easier time with Pi3 support at that point.

This month in vc4 (2017-03-13): docs, porting, tiled display

It's been a while since I've posted a status update.  Here are a few things that have been going on.

VC4 now generates HTML documentation from the comments in the kernel code.  It's just a start, but I'm hoping to add more technical details of how the system works here.  I don't think doxygen-style documentation of individual functions would be very useful (if you're calling vc4 functions, you're editing the vc4 driver, unlike library code), so I haven't pursued that.

I've managed to get permission to port my 3D driver to another platform, the 911360 enterprise phone that's already partially upstreamed.  It's got a VC4 V3D in it (slightly newer than the one in Raspberry Pi), but an ARM CLCD controller for the display.  The 3D came up pretty easily, but for display I've been resurrecing Tom Cooksey's old CLCD KMS driver that never got merged.  After 3+ years of DRM core improvements, we get to delete half of the driver he wrote and just use core helpers instead.  Completing this work will require that I get the dma-buf fencing to work right in vc4, which is something that Android cares about.

Spurred on by a report from one of the Processing developers, I've also been looking at register allocation troubles again.  I've found one good opportunity for reducing register pressure in Processing by delaying FS/VS input loads until they're actually used, and landed and then reverted one patch trying to accomplish it.  I also took a look at DEQP's register allocation failures again, and fixed a bunch of its testcases with a little scheduling fix.

I've also started on a fix to allow arbitrary amounts of CMA memory to be used for vc4.  The 256MB CMA limit today is to make sure that the tile state buffer and tile alloc/overflow buffers are in the same 256MB region due to a HW addressing bug.  If we allocate a single buffer that can contain tile state, tile alloc, and overflow all at once within a 256MB area, then all the other buffers in the system (texture contents, particularly) can go anywhere in physical address space.  The set top box group did some testing, and found that of all of their workloads, 16MB was enough to get any job completed, and 28MB was enough to get bin/render parallelism on any job.  The downside is that a job that's too big ends up just getting rejected, but we'll surely run out of CMA for all your textures before someone hits the bin limit by having too many vertices per pixel.

Eben recently pointed out to me that the HVS can in fact scan out tiled buffers, which I had thought it wasn't able to.  Having our window system buffers be linear means that when we try to texture from it (your compositing manager's drawing and uncomposited window movement for example) we have to first do a copy to the tiled format. The downside is that, in my measurements, we lose up to 1.4% of system performance (as measured by memcpy bandwidth tests with the 7" LCD panel for display) due to the HVS being inefficient at reading from tiled buffers (since it no longer gets to do large burst reads).  However, this is probably a small price to pay for the massive improvement we get in graphics operations, and that cost could be reduced by X noticing that the display is idle and swapping to a linear buffer instead.  Enabling tiled scanout is currently waiting for the new buffer modifiers patches to land, which Ben Widawsky at Intel has been working on.

There's no update on HDMI audio merging -- we're still waiting for any response from the ALSA maintainers, despite having submitted the patch twice and pinged on IRC.

These weeks in vc4 (2017-02-13): HDMI audio, DSI, 3D stability

Probably the biggest news of the last two weeks is that Boris's native HDMI audio driver is now on the mailing list for review.  I'm hoping that we can get this merged for 4.12 (4.10 is about to be released, so we're too late for 4.11).  We've tested stereo audio so far, no compresesd audio (though I think it should Just Work), and >2 channel audio should be relatively small amounts of work from here.  The next step on HDMI audio is to write the alsalib configuration snippets necessary to hide the weird details of HDMI audio (stereo IEC958 frames required) so that sound playback works normally for all existing userspace, which Boris should have a bit of time to work on still.

I've also landed the vc4 part of the DSI driver in the drm-misc-next tree, along with a fixup.  Working with the new drm-misc-next trees for vc4 submission is delightful -- once I get a review, I can just push code to the tree and it will magically (through Daniel Vetter's pull requests and some behind-the-scenes tools) go upstream at the appropriate time.  I am delighted with the work that Daniel has been doing to make the DRM subsystem more welcoming of infrequent contributors and reducing the burden on us all of participating in the Linux kernel process.

In 3D land, the biggest news is that I've fixed a kernel oops (producing a desktop lockup) when CMA returns out of memory.  Unfortunately, VC4 doesn't have an MMU, so we require that memory allocations for graphics be contiguous (using the Contiguous Memory Allocator), and to make things worse we have a limit of 256MB for CMA due to an addressing bug of the V3D part, so CMA returning out of memory is a serious and unfortunately frequent problem.  I had a bug where a CMA failure would put the failed allocation into the BO cache, so if you asked for a BO of the same size soon enough, you'd get the bad BO back and crash.  I've been unable to construct a good minimal testcase for it, but the patch is on the mailing list and in the rpi-4.4.y tree now.

I've also fixed a bug in the downstream tree's "firmware KMS" mode (where I use the closed source firmware for display, but open source 3D) where fullscreen 3D rendering would get a single frame displayed and then freeze.

In userspace, I've fixed a bit of multisample rendering (copies of MSAA depth buffers) that gets more of our regression testing working, and worked on some potential rasterization problems (the texwrap regression tests are failing due to texture filtering troubles, and I'm not sure if we're actually doing anything wrong or not because we're near the cutoff for the "how accurate does the filtering have to be?").

Coming up, I'm looking at fixing a cursor updates bug that Michael Zoran has found, and fixing up the DSI panel driver so that it can hopefully get into 4.12.  Also, after a recent discussion with Eben, I've realized that we can actually scan out tiled buffers from the GPU, so we may get big performance wins for un-composited X if I can have Mesa coordinate with the kernel on producing tiled buffers.

This week in vc4 (2017-01-30): Camera, downstream backport, 3D performance.

Most of last week was spent switching my development environment over to a setup with no SD cards involved at all.  This was triggered by yet another card failing, and I spent a couple of days off and on trying to recover it.  I now have three scripts that build and swap my test environment between upstream 64-bit Pi3, upstream 32-bit Pi3, and downstream 32-bit Pi3, using just the closed source bootloader without u-boot or an SD card.  Previously I was on Pi2 only (much slower for testing), and running downstream kernels was really difficult.

Once I got the new netboot system working, I tested and landed the NEON part of my tiling work (the big 208% download and 41% upload performance improvements).  I'm looking forward to fixing up the clever tiling math parts soon.

I also tested and landed a few little compiler improvements. The nicest compiler improvement was turning on a switch that Marek added to gallium: We now run the GLSL compiler optimization loop exactly once (because it's required to avoid GLSL linker regressions), and rely on NIR to do the actual optimization after the GLSL linker has run.  The GLSL IR is a terrible IR for doing optimization on (only a bit better than Mesa or TGSI IRs), and it's made worse by the fact that I wrote a bunch of its optimizations back before we had good datastructures available in Mesa and before we had big enough shaders that using good datastructures mattered.  I'm horrified by my old code and can't wait to get it deleted (Timothy Arceri has been making progress on that front).  Until we can actually delete it, though, cutting down the number of times we execute my old optimization passes should improve our compile times on complicated shaders.

Now that I have a good way to test the downstream kernel, I went ahead and made a giant backport of our current vc4 kernel code to the 4.9 branch.  I hope the Foundation can get that branch shipped soon -- backporting to 4.9 is so much easier for me than old 4.4, and the structure of the downstream DT files makes it much clearer what there is left to be moved upstream.

Meanwhile, Michael Zoran has continued hacking on the staging VCHI code, and kernel reviewers were getting bothered by edits to code with no callers.  Michael decided to solve that by submitting the old HDMI audio driver that feeds through the closed source firmware (I'm hoping we can delete this soon once Boris's work lands, though), and I pulled the V4L2 camera driver out of rpi-4.9.y and submitted that to staging as well.  I unfortunately don't have the camera driver quite working yet, because when I modprobe it the network goes down.  There are a ton of variables that have changed since the last time I ran the camera (upstream vs downstream, 4.10 vs 4.4, pi3 vs pi2), so it's going to take a bit of debugging before I have it working again.

Other news: kraxel from RH has resubmitted the SDHOST driver upstream, so maybe we can have wifi by default soon.  Baruch Siach has submitted some fixes that I suspect get BT working.  I've also passed libepoxy off to Emmanuele Bassi (long time GNOME developer) who has fixed it up to be buildable and usable on Linux again and converted it to the Meson build system, which appears to be really promising.

This week in vc4 (2017-01-24): LCA, DSI, HDMI audio

I spent last week at linux.conf.au 2017 (the best open source conference), mostly talking to people and going to talks instead of acually hacking on vc4.  I'd like to highlight a few talks you could go watch if you weren't there.

Jack Moffit gave an overview of what Servo is doing with Rust on parallel processing and process isolation in web engines.  I've been hacking on this project in my spare time, and the community is delightful to work with.  Check out his talk to see what I've been excited about recently.

I then spent the next session in the hallway chatting with him instead of going to Willy's excellent talk about how the Linux page cache works.  One of the subjects of that talk makes me wonder: I've got these contiguous regions of GPU memory on VC4, could I use huge pages when we're mapping them for uploads to the GPU?  Would they help enough to matter?

Dave Airlie gave a talk about the structure of the Vulkan drivers and his work with Bas on the Radeon Vulkan driver.  It's awesome how quickly these two have been able to turn around a complete Vulkan driver, and it really validates our open source process and the stack we've built in Mesa.

Nathan Egge from Mozilla gave a talk on accelerating JPEG decode using the GPU.  I thought it was going to be harder to do than he made it out to be, so it was pretty cool to see the performance improvements he got.  The only sad part was that it isn't integrated into WebRender yet as far as I can see.

Karen Sandler talked about the issue of what happens to our copyrights when we die, from experience in working with current project maintainers on relicensing efforts.  I think this is a really important talk for anyone who writes copyleft software.  It actually softened my stance on CLAs (!) but is mostly a call to action for "Nobody has a good process yet for dealing with our copyrights after death, let's work on this."

And, finally, I think the most important talk of the conference to me was Daniel Vetter's Maintainers Don't Scale (no video yet, just a blog post).  The proliferation of boutique trees in the kernel with "overworked" maintainers either ignoring or endlessly bikeshedding the code of outside contributors is why I would never work on the Linux kernel if I wasn't paid to do so.  Daniel actually offers a way forward from the current state of maintainership in the kernel, with group maintainership (a familiar model in healthy large projects) to replace DRM's boutique trees and a reinforcement of the cooperative values that I think are why we all work on free software.  I would love to be able to shut down my boutique DRM and ARM platform trees.

In VC4, I did still manage to write a couple of kernel fixes in response to a security report, and rebase DSI on 4.10 and respin some of its patches for resubmission.  I'm still hoping for DSI in 4.11, but as usual it depends on DT maintainer review (no response for over a month).  I also spent a lot of time with Keith talking about ARM cache coherency questions and trying experiments on my NEON tiling code.

While I was mostly gone last week, though, Boris has been working on finish off my HDMI audio branch.  He now has a branch that has a seriously simplified architecture compared to what I'd talked about with larsc originally (*so* much cleaner than what I was modeling on), and he has it actually playing some staticy audio.  He's now debugging what's causing the static, including comparing dumps of raw audio between VC4 and the firmware, along with the register settings of the HDMI and DMA engines.

These weeks in vc4 (2017-01-16): NEON acceleration of tiling

Last week I was on vacation, but the week before that I did some more work on figuring out Intel's Mesa CI system, and Mark Janes has started working on some better documentation for it.  I now understand better how the setup is going to work, but haven't made much progress on actually getting a master running yet.

More fun, though, was finally taking a look at optimizing the tiled texture load/store code.  This got started with Patrick Walton tweeting a link to a blog post on clever math for implementing texture tiling, given a couple of assumptions.

As with all GPUs these days, VC4 swizzles textures into a tiled layout so that when a cacheline is loaded from memory, it will cover nearby pixels in the Y direction as well as X, so that you are more likely to get cache hits for your neighboring pixels in drawing.  And the tiling tends to be multiple levels, so that nearby cachelines are also nearby on the screen, reducing DRAM access latency.

For small textures on VC4, we have a single level of tiling: 4x4@32bpp blocks ("utiles") of raster order data, themselves arranged in up to a 4x4 block, and this mode is called LT.  Once things go over that size, we go to T tiling, where we call a 4x4 LT block a subtile, and arrange subtiles either clockwise or counterclockwise order within a 2x2 block (a tile, which at this point is now 1kb of data), and tiles themselves are arranged left-to-right, then right-to-left, then left-to-right again.

The first thing I did was implement the blog post's clever math for LT textures.  One of the nice features of the math trick is that it means I can do partial utile updates finally, because it skips around in the GPU's address space based on the CPU-side pixel coordinate instead of trying to go through GPU address space to update a 4x4 utile at a time.  The downside of doing things this way is that jumping around in GPU address space means that our writes are unlikely to benefit from write combining, which is usually important for getting full write performance to GPU memory.  It turned out, though, that the math was so much more efficient than what I was doing that it was still a win.

However, I found out that the clever math made reads slower.  The problem is that, because we're write-combined, reads are uncached -- each load is a separate bus transaction to main memory.  Switching from my old utile-at-a-time load using memcpy to the new code meant that instead of doing 4 loads using NEON to grab a row at a time, we were now doing 16 loads of 32 bits at a time, for what added up to a 30% performance hit.

Reads *shouldn't* be something that people do much, except that we still have some software fallbacks in X on VC4 ("core" unantialiased text rendering, for example, which I need to write the GLSL 1.20 shaders for), which involve doing a read/modify/write cycle on a texture.  My first attempt at fixing the regression was just adding back a fast path that operates on a utile at a time if things are nicely utile-aligned (they generally are).  However, some forced inlining of functions per cpp that I was doing for the unaligned case meant that the glibc memcpy call now got inlined back to being non-NEON, and the "fast" utile code ended up not helping loads.

Relying on details of glibc's implementation (their tradeoff for when to do NEON loads) and of gcc's implementation (when to go from memcpy calls to inlined 32-bits-at-a-time code) seems like a pretty bad idea, so I decided to finally write the NEON acceleration that Eben and I have talked about several times.

My first hope was that I could load a full cacheline with NEON's VLD.  VLD1 only loads up to 1 "quadword" (16 bytes) at a time, so that doesn't seem like much help.  VLD4 can load 64 bytes like we want, but it also turns AOS data into SOA in the process, and there's no corresponding "SOA-back-to-AOS store 8 or 16 bytes at a time" like we need to do to get things back into the CPU's strided representation.  I tried VLD4+VST4 into a temporary, then doing my old untiling path on the cached temporary, but that still left me a few percent slower on loads than not doing any of this work at all.

Finally, I hit on using the VLDM instruction.  It seems to be intended for stack loads/stores, but we can also use it to get 64 bytes of data in from memory untouched into NEON registers, and then I can use 4 (32bpp) or 8 (8 or 16bpp) VST1s to store it to the CPU side.  With this, we get a 208.256% +/- 7.07029% (n=10) improvement to GetTexImage performance at 1024x1024.  Doing the same NEON code for stores gave a 41.2371% +/- 3.52799% (n=10) improvement, probably mostly due to not calling into memcpy and having it go through its size/alignment-based memcpy path choosing process.

I'm not yet hitting full memory bandwidth, but this should be a noticeable improvement to X, and it'll probably help my piglit test suite runtime as well.  Hopefully I'll get the code polished up and landed next week when I get back from LCA.

This week in vc4 (2017-01-02): thread switching, CI

The 3DMMES test has been thoroughly instruction count limited, so any wins we can get on code generation translate pretty directly into performance gains.  Last week I decided to work on fixing up Jonas's patch to schedule instructions in the delay slots of thread switching, which can save us 3 instructions per texture sample.

Thread switching, you may recall, is a trick in the fragment shader to hide texture fetch latency by cooperatively switching to another fragment shader instance after you request a texture fetch, so that it can make some progress when you'd probably be blocked anyway.  This lets us better occupy the ALUs, at the cost of each shader needing to fit into half of the physical register file.

However, VC4 doesn't cancel instructions in the pipeline when we request a switch (same as for within-shader branching), so 3 more instructions from the current shader get executed.  For my first implementation of thread switching, I just dumped in 3 NOPs after each THRSW to fill the delay slots.  This was still a massive win over not threading, so it was good enough.

Jonas's patch tries to fill the delay slots after we schedule in the thread switch by trying to extract the THRSW signal off of the instruction we scheduled and move it up to 2 instructions before that, and then only add enough NOPs to get us 3 slots filled.  There was a little bug (it re-scheduled the thrsw instruction instead of a NOP in trying to pad out to delay slots), but it basically worked and got us a 1.55% instruction count win on shader-db.

The problem was that he was scheduling ALU operations along with the thrsw, and if the thrsw signal was alone without an ALU operation in it, after moving the thrsw up we'd have a NOP left in the old location.  I wrote a followon patch to fix that: We now only schedule thrsws on their own without ALU operations, insert the THRSW as early as we can, and then add NOPs as necessary to fill remaining delay slots.  This was another 0.41% instruction count win.

This isn't as good as it could be.  Maybe we don't fill the slots as well as we could before choosing to schedule thrsw, but instructions we choose to schedule after that could be fit in.  Those would be tricky because we have to check that they don't write flags or the accumulators (which wouldn't be preserved across the thrsw) or new texture coordinates.  We also don't put the THRSW at any particular point in the timeline between sampler request and results collection.  We might be able to get wins by trying to put thrsw at the lowest-register-pressure point between them, so that fewer things need to be forced into physical regs instead of accumulators.

Those will be projects for later.  It's probably much more important that we figure out how to schedule 2-4 samples with a single thrsw, instead of doing a thrsw per sample like we do today.

The other project last week was starting to build up a plan for vc4 CI.  We've had a few regressions to vc4 in Mesa master because developers (including me!) don't do testing on vc4 hardware on every commit.  The Intel folks have a lovely CI system that does piglit, DEQP, and performance regression testing for them, both tracking Mesa master and doing pre-commit tests of voluntarily submitted branches.  I despaired of the work needed to build something that good, but apparently they've open sourced their configuration, so I'll be trying to replicate that in the next few weeks.  This week I worked on getting the hardware and a plan for where to host it.  I'm looking forward to a bright future of 30-minute test runs on the actual hardware and long-term tracking of performance metrics.  Unfortunately, there are no docs to it and I've never worked with jenkins before, so this is going to be a challenge.

Other things: I submitted a patch to mm/ to shut up the CMA warning we've been suffering from (and patching away downstream) for years, got exactly the expected response ("this dmesg spew in an common path might be useful for somebody debugging something some day, so it should stay there"), so hopefully we can just get Fedora and Debian to patch it out instead.  This is yet another data point in favor of Matthew Garrett's plan of "write kernel patches you need, don't bother submitting them upstream". Started testing a new patch to reduce error return rate on vc4 memory allocatoins on upstream kernel, haven't confirmed that it's working yet.  I also spent a lot of time reviewing tarceri's patches to Mesa preparing for the on-disk shader cache, which should help improve app startup times and reduce memory consumption.

This week in VC4 (2016-12-19): DSI, SDTV

Two big successes last week.

One is that Dave Airlie has pulled Boris's VEC (STDV) code for 4.10.  We didn't quite make it for getting the DT changes in time to have full support in 4.10, but the DT changes will be a lot easier to backport than the driver code.

The other is that I finally got the DSI panel working, after 9 months of frustrating development.  It turns out that my DSI transactions to the Toshiba chip aren't working, but if I use I2C to the undocumented Atmel microcontroller to relay to the Toshiba in one of the 4 possible orderings of the register sequence, the panel comes up.  The Toshiba docs say I need to do I2C writes at the beginning of the poweron sequence, but the closed firmware uses DSI transactions for the whole sequence, making me suspicious that the Atmel has already done some of the Toshiba register setup.

I've now submitted the DSI driver, panel, and clock patches after some massive cleanup.  It even comes with proper runtime power management for both the panel and the VC4 DSI module.

The next steps in modesetting land will be to write an input driver for the panel, do any reworks we need from review, and get those chunks merged in time for 4.11.  While I'm working on this, Boris is now looking at my HDMI audio code so that hopefully we can get that working as well.  Eben is going to poke Dom about fixing the VC4driver interop with media decode.  I feel like we're now making rapid progress toward feature parity on the modesetting side of the VC4 driver.