[Most Recent Entries]
Below are the 20 most recent journal entries recorded in
Eric Anholt's LiveJournal:
[ << Previous 20 ]
[ << Previous 20 ]
|Monday, November 28th, 2016|
|This week in vc4 (2016-11-28): performance tuning
I missed last week's update, but with the holiday it ended up being a short week anyway.
The multithreaded fragment shaders are now in drm-next and Mesa master. I think this was the last big win for raw GL performance and we're now down to the level of making 1-2% improvements in our commits. That is, unless we're supposed to be using double-buffer non-MS mode
and the closed driver was just missing that feature. With the glmark2 comparisons I've done, I'm comfortable with this state, though. I'm now working on performance comparisons for 3DMMES Taiji, which the HW team often uses as a benchmark. I spent a day or so trying to get it ported to the closed driver and failed, but I've got it working on the open stack and have written a couple of little performance fixes with it.
The first was just a regression fix from the multithreading patch series, but it was impressive that multithreading hid a 2.9% instruction count penalty and still showed gains.
One of the new fixes I've been working on is folding ALU operations into texture coordinate writes. This came out of frustration from the instruction selection research I had done the last couple of weeks, where all algorithms seemed like they would still need significant peephole optimization after selection. I finally said "well, how hard would it be to just finish the big selection problems I know are easily doable with peepholes?" and it wasn't all that bad. The win came out to about 1% of instructions, with a similar benefit to overall 3DMMES performance (it's shocking how ALU-bound 3DMMES is)
I also started on a branch
to jump to the end of the program when all 16 pixels in a thread have been discarded. This had given me a 7.9% win on GLB2.7 on Intel, so I hoped for similar wins here. 3DMMES seemed like a good candidate for testing, too, with a lot of discards that are followed by reams of code that could be skipped, including texturing. Initial results didn't seem to show a win, but I haven't actually done any stats on it yet. I also haven't done the usual "draw red where we were applying the optimization" hack to verify that my code is really working, either.
While I've been working on this, Jonas Pfeil (who originally wrote the multithreading support) has been working on a couple of other projects. He's been trying to get instructions scheduled into the delay slots of thread switches
, which should help reduce any regressions those two features might have caused. More exciting, he's just posed a branch
for doing nearest-filtered raster textures (the primary operation in X11 compositing) using direct memory lookups instead of our shadow-copy fallback. Hopefully I get a chance to review, test, and merge in the next week or two.
On the kernel side, my branches for 4.10 have been pulled. We've got ETC1 and multithread FS for 4.10, and a performance win in power management. I've also been helping out and reviewing Boris Brezillon's work for SDTV output in vc4. Those patches should be hitting the list this week.
|Monday, November 14th, 2016|
|This week in vc4 (2016-11-14): Multithreaded fragment shaders
I dropped HDMI audio last week because Jonas Pfeil showed up with a pair of branches to do multithreaded fragment shaders.
Some context for multithreaded fragment shaders: Texture lookups are really slow. I think we eyeballed them as having a latency of around 20 QPU instructions. We can hide latency sometimes if there's some unrelated math to be done after the texture coordinate calculation but before using the texture sample. However, in most cases, you need the texture sample results right away for the next bit of work.
To allow programs to hide that latency, there's a cooperative hyperthreading mode that a fragment shader can opt into. The shader stays in the bottom half of register space, and before it collects the results of a texture fetch, it issues a thread switch signal, which the hardware will use to run a second fragment shader thread until that one issues its own thread switch. For the second thread, the top bit of the register addresses gets flipped, so the two threads' states don't overlap (except for the accumulators and the flags, which the shaders have to save and restore).
I had delayed working on this because the full solution was going to be really tricky: queue up as many lookups as we can, then thread switch, then collect all those results before switching again, all while respecting the FIFO sizes. However, Jonas's huge contribution here was in figuring out that you don't need to be perfect, you can get huge gains by thread switching between each texture lookup and reading its results.
The upshot was a 0-20% performance improvement on glmark2 and a performance hit to only one testcase. With this we're down to 3 subtests that we're slower on than the closed source driver. Jonas's kernel patch is out on the list, and I rewrote the Mesa series to expose threading to the register allocator and landed all but the enabling patch (which has to wait on the kernel side getting merged). Hopefully I'll finish merging it in a week.
In the process of writing multithreading, Jonas noticed that we were scheduling our TLB_Z writes early in the program, which can cut into fragment shader parallelism between the QPUs (of which we have 12) because a TLB_Z write locks the scoreboard for that 2x2 pixel quad. I merged a patch to Mesa that pushes the TLB_Z down to the bottom, at the cost of a single extra QPU instruction. Some day we should flip QPU scheduling around so that we pair that TLB_Z up better, and fill our delay slots in thread switches and branching as well.
I also tracked down a major performance issue in Raspbian's desktop using the open driver. I asked them to use xcompmgr a while back, because readback from the front buffer using the GPU is really slow (The texture unit can't read from raster textures, so we have to reformat them to be readable), and made window dragging unbearable. However, xcompmgr doesn't unredirect fullscreen windows, so full screen GL apps emit copies (with reformatting!) instead of pageflipping.
It looks like the best choice for Raspbian is going to be using compton, an xcompmgr fork that does pageflipping (you get tear free screen updates!) and unredirection of full screen windows (you get pageflipping directly from GL apps). I've also opened a bug report on compton with a description of how they could improve their GL drawing for a tiled renderer like the Pi, which could improve its performance for windowed updates significantly.
Simon ran into some trouble with compton, so he hasn't flipped the default yet, but I would encourage anyone running a Raspberry Pi desktop to give it a shot -- the improvement should be massive.
Other things last week: More VCHIQ patch review, merging more to the -next branches for 4.10 (Martin Sperl's thermal driver, at last!), and a shadow texturing bugfix.
|Monday, November 7th, 2016|
|This week in vc4 (2016-11-07): glmark2, ETC1, power management, HDMI audio
I spent a little bit of time last week on actual 3D work. I finally figured out what was going wrong with glmark2's terrain demo. The RCP (reciprocal) instruction on vc4 is a very rough approximation, and in translating GLSL code doing floating point divides (which we have to convert from a / b to a * (1 / b)) we use a Newton-Raphson
step to improve our approximation. However, we forgot to do this for the implicit divide we have to do at the end of your vertex shader, so the triangles in the distance were unstable and bounced around based on the error in the RCP output.
I also finally got around to writing kernel-side validation for ETC1 texture compression. I should be able to merge support for this to v4.10, which is one of the feature checkboxes we were missing compared to the old driver. Unfortunately ETC1 doesn't end up being very useful for us in open source stacks, because S3TC has been around a lot longer and supported on more hardware, so there's not much content using ETC that I know of outside of Android land.
I also spent a while fixing up various regressions that had happened in Mesa while I'd been playing in the kernel. Some day I should work on CI infrastructure on real hardware, but unfortunately Mesa doesn't have any centralized CI infrastructure to plug into. Intel has a test farm, but not all proposed patches get submitted to it, and it (understandably) doesn't have non-Intel hardware.
In kernel land, I sent out a patch to reduce V3D's overhead due to power management. We were immediately shutting off the 3D engine when it went idle, even if userland might be expected to submit a new frame shortly thereafter. This was taking up 1% of the CPU on profiles I was looking at last week, and I've seen it up to 3.5%. Now we keep the GPU on for at least 40ms ("about two frames plus some slack"). If I'm reading right, the closed driver does immediate powerdown as well, so we may be using slightly more power now, but given that power state transitions themselves cost power, and the CPU overhead costs power, hopefully this works out fine anyway (even though we've never done power measurements for the platform).
I also reviewed more patches from Michael Zoran for vchiq. It should now (once all reviewed patches land) work on arm64. We still need to get the vchiq-using drivers like camera and vcsm into staging.
Finally, I made more progress on HDMI audio. One of the ALSA developers, Lars-Peter Clausen, pointed out that ALSA's design is that userspace would provide the IEC958 subframes by using alsalib's iec958 plugin and a bit ouf asoundrc configuration. He also provided a ton of help debugging that pipeline. I rebased and cleaned up into a branch that uses simple-audio-card as my machine driver, so the whole stack is pretty trivial. Now aplay runs successfully, though no sound has come out. Next step is to go test that the closed stack actually plays audio on this monitor and capture some register state for debugging.
|Monday, October 31st, 2016|
|This week in vc4 (2016-10-31): HDMI audio
Last week was spent almost entirely on HDMI audio support.
I had started the VC4-side HDMI setup last week, and got the rest of the stubbed bits filled out this week.
Next was starting to get the audio side connected. I had originally derived my VC4 side from the Mediatek driver, which makes a child device for the generic "hdmi-codec" DAI codec to attach to. HDMI-codec
is a shared implementation for HDMI encoders that take I2S input, which basically just passes audio setup params into the DRM HDMI encoder, whose job is to actually make I2S signals on wires get routed into HDMI audio packets. They also have a CPU DAI
registered through device tree that moves PCM data from memory onto wires, and a machine driver
registered through devicetree that connects the CPU DAI's wires to the HDMI codec DAI's wires. This is apparently how SoC audio is usually structured.
VC4, on the other hand, doesn't have a separate hardware block for moving PCM data from memory onto wires, it just has the MAI_DATA register in the HDMI block that you put S/PDIF (IEC958) data into using the SoC's generic DMA engine
. So I made VC4's HDMI driver register another a child device with the MAI_DATA register's address passed in, and that child device gets bound by the CPU DAI driver and uses the ASoC generic DMA engine support
. That CPU DAI driver also ends up setting up the machine driver, since I didn't have any other reference to the CPU DAI anywhere. The glue here is all based on static strings I can't rely on, and is a complete hack for now.
Last, notice how I've used "PCM" talking about a working implementation's audio data and "S/PDIF" data for the thing I have to take in? That's the sticking point. The MAI_DATA register wants things that look like the AES3 frames description
you can find on wikipedia. I think the HW is generating my preamble. In the firmware's HDMI audio implementation, they go through the unpacked PCM data and set the channel status bit in each sample according to the S/PDIF channel status word (the first 2 bytes
of which are also documented on wikipedia). At the same time that they're doing that, they also set the parity bit.
My first thought was "Hey, look, ALSA has SNDRV_PCM_FORMAT_IEC958_SUBFRAME, maybe my codec can say it only takes these S/PDIF frames and everything will work out." No. It turns out aplay won't generate them and neither will gstreamer. To make a usable HDMI audio driver, I'm going to need to generate my IEC958 myself.
Linux is happy to generate my channel status word
in memory. I think the hardware is happy to generate my parity bit. That just leaves making the CPU DAI write the bits of the channel status word into the subframes as PCM samples come in to the driver, and making sure that the MSB of the PCM audio data is bit 27. I haven't figured out how to do that quite yet, which is the project for this week.
Work-in-progress, completely non-functional code is here
|Monday, October 24th, 2016|
|This week in vc4 (2016-10-24): HDMI audio, DSI, simulator, linux-next
The most interesting work I did last week was starting to look into HDMI audio. I've been putting this off because really I'm a 3D developer, not modesetting, and I'm certainly not an audio developer. But the work needs to get done, so here I am.
HDMI audio on SOCs looks like it's not terribly hard. VC4's HDMI driver needs to instantiate a platform device for audio when it gets loaded, and attach a function table to it for calls to be made when starting/stopping audio. Then I write a sound driver that will bind to the node that VC4 created, accepting a parameter of which register to DMA into. That sound driver will tell ALSA about what formats it can acccept and actually drive the DMA engine to write audio frames into HDMI. On the VC4 HDMI side, we do a bit of register setup (clocks, channel mappings, and the HDMI audio infoframe) when the sound driver tells us it wants to start.
It all sounds pretty straightforward so far. I'm currently at the "lots of typing" stage of development of this.
Second, I did a little poking at the DSI branch again. I've rebased onto 4.9 and tested an idea I had for how DSI might be failing. Still no luck. There's code on drm-vc4-dsi-panel-restart-cleanup-4.9
, and the length of that name suggests how long I've been bashing my head against DSI. I would love for anyone out there with the inclination to do display debug to spend some time looking into it. I know the history there is terrible, but I haven't prioritized cleanup while trying to get the driver working in the first place.
I also finished the simulator cleanup from last week, found a regression in the major functional improvement in the cleanup, and landed what worked. Dave Airlie pointed out that he took a different approach for testing with virgl called vtest
, and I took a long look at that. Now that I've got my partial cleanup in place, the vtest-style simulator wrapper looks a lot more doable, and if it doesn't have the major failing that swrast did with visuals then it should actually make vc4 simulation more correct.
I landed a very tiny optimization
to Mesa that I wrote while debugging DEQP failures.
Finally, I put together the -next
branches for upstream bcm2835 now that 4.9-rc1 is out. So far I've landed a major cleanup of pinctrl setup in the DT (started by me and finished by Gerd Hoffmann from Red Hat, to whom I'm very grateful) and a little USB setup fix. I did a lot of review of VCHIQ patches for staging, and sort of rewrote Linus Walleij's RFC patch for getting descriptive names for the GPIOs in lsgpio.
|Tuesday, October 18th, 2016|
|This week in vc4 (2016-10-18): X performance, DEQP, simulation
The most visible change this week is a fix to Mesa for texture upload performance. A user had reported
that selection rectangles in LXDE's file manager were really slow. I brought up sysprof, and it showed that we were doing uncached reads from the GPU in a situation that should have been entirely write-combined writes.
The bug was that when the size of the texture wasn't aligned to utiles, we were loading the previous contents into the temporary buffer before writing the new texture data in and uploading, even if the full texture was being updated. I've fixed
it to check for when the full texture is being uploaded, and not do the initial download. This bug was getting it on almost any Cairo vector graphics operation with its Xlib backend, so hopefully this helps a lot of people's desktops.
I also worked on a cleanup of the simulator mode. I use the closed source simulator regularly as part of my work -- it's fairly accurate to hardware behavior, and allows you to trace what's happening when things go wrong, all with the driver code running like "normal" for apps on my x86 desktop.
However, my simulator support is a little invasive to the driver, replacing vc4 ioctl calls with alternative ioctls to the i915 driver. I started on a series to make vc4_simulator.c take plain VC4 ioctls and translate them, so that the simulator code is entirely contained. The last step I think I have is figuring out which level to put the simulator's copying in and out of window system framebuffers at.
Last, I got DEQP
's GLES2 tests up and running on Raspberry Pi. These are approximately equivalent to the official conformance tests. The initial results were quite good -- 18630/19098 (97.5%) passing when run in the piglit framework. I found a couple of fixes
to be made in glClear() support, one
of which will affect all gallium drivers. Of the remainder, there are a few tests that are failing due to test bugs (we expose extensions that allow features the tests don't expect), but most are failing in register allocation. For the register allocation failures, I see a relatively quick fix that should reduce register pressure in loops.
|Tuesday, October 11th, 2016|
|This week in vc4 (2016-10-11): QPU user shaders, Processing performance
Last week I started work on making hello_fft work with the vc4 driver loaded.hello_fft
is a demo program for doing FFTs
using the QPUs (the shader core in vc4). Instead of drawing primitives into a framebuffer like a GL pipeline does, though, it uses the User QPU pipeline, which just hands a uniform stream to a QPU shader instance.
Originally I didn't build a QPU user shader ioctl for vc4 because there's no good way to use this pipeline of the VC4. The hardware isn't quite capable enough to be exposed as OpenCL or GL compute shaders (some research had been done into this, and the designers' conclusion was that the memory acccess support wasn't quite good enough to be useful). That leaves you with writing VC4 shaders in raw assembly
, which I don't recommend.
The other problem for vc4 exposing user shaders is that, since the GPU sits directly on main memory with no MMU in between, GPU shader programs could access anywhere in system memory. For 3D, there's no need (in GL 2.1) to do general write back to system memory, so the kernel reads your shader code
and rejects shaders if they try to do it. It also makes sure that you bounds-check all reads through the uniforms or texture sampler. For QPU user shaders, though, the expected mode of using them is to have the VPM DMA units do general loads and stores, so we'd need new validation support.
If it was just this one demo, we might be willing to lose support for it in the transition to the open driver. However, some modes of accelerated video decode also use QPU shaders at some stage, so we have to have some sort of solution.
My plan is basically to not do validation and have root-only execution of QPU shaders for now. The firmware communication driver captures hello_fft's request
to have the firmware pass QPU shaders to the V3D hardware, and it redirects it into VC4. VC4
then maintains the little queue of requests coming in, powers up the hardware, feeds them in, and collects the interrupts from the QPU shaders for when they're done (each job request is required to "mov interrupt, 1" at the end of its last shader to be run).
Dom is now off experimenting in what it will take to redirect video decode's QPU usage onto this code.
The other little project last week was fixing Processing
's performance on vc4. It's one of the few apps that is ported to the closed driver, so it's a likely thing to be compared on.
Unfortunately, Processing was clearing its framebuffers really inefficiently
. For non-tiled hardware it was mostly OK, with just a double-clear of the depth buffer, but for vc4 its repeated glClear()s (rather than a single glClear with all the buffers to be cleared set) were triggering flushes and reloads of the scene, and in one case a render of a full screen quad (and a full screen load before doing so) rather than fast clearing.
The solution was to improve the tracking of what buffers are cleared and whether any primitives have been drawn yet, so that I can coalesce sets of repeated or partial clears together while only updating the colors to be cleared. There's always a tension in this sort of optimization: Should the GL driver be clever to work around apps behaving badly, or should it push that off to the app developer (more like Vulkan does). In this case, tiled renderer behavior is sufficiently different from non-tiled renderers, and enough apps will hit this path, that it's well worth it. For Processing, I saw an improvement of about 10% on the demo I was looking at.
Sadly, Processing won't be a good comparison for open vs closed drivers even after these fixes. For the closed driver Processing uses EGL, which says that depth buffers become undefined on eglSwapBuffers. In its default mode, though, processing uses GLX, which says that the depth buffer is retained on glXSwapBuffers(). To avoid this extra overhead in its GLX mode, you'd need to use glDiscardFramebufferEXT()
right before the swap. Unfortunately, this isn't connected in gallium yet
but shouldn't be hard to do once we have an app that we can test.
|Monday, October 3rd, 2016|
|This week in vc4 (2016-10-03): HDMI, X testing and cleanups, VCHIQ
Last week Adam Jackson landed
my X testing series, so now at least 3 of us pushing code are testing glamor with every commit. I used the series to test a cleanup of the Render extension related code in glamor, deleting 550 lines of junk
. I did this while starting to work on using GL sampler objects
, which should help reduce our pixmap allocation and per-draw Render overhead. That patch is failing make check so far.
The big project for the week was tackling some of the HDMI bug reports. I pulled out my bigger monitor that actually takes HDMI input, and started working toward getting it to display all of its modes correctly. We had an easy bug in the clock driver
that broke changing to low resolutions: trying to drive the PLL too slow, instead of driving the PLL in spec and using the divider to get down to the clock we want. We needed infoframes support
to fix purple borders on the screen. We were doing our interlaced video mode programming pretty much entirely wrong
(it's not perfect yet, but 1 of the 3 interlaced modes on my montior displays correctly and they all display something). Finally, I added support for double-clocked CEA modes
(generally the low resolution, low refresh ones in your list. These are probably particularly interesitng for RPi due to all the people running emulators).
Finally, one of the more exciting things for me was that I got general agreement from gregkh to merge the VCHIQ driver through the staging tree, and he even put the initial patch
together for me. If we add in a couple more small helper drivers, we should be able to merge the V4L2 camera driver
, and hopefully get closer to doing a V4L2 video decode driver for upstream. I've now tested a messy series
that at least executes "vcdbg log msg." Those other drivers will definitely need some cleanups, though.
|Monday, September 26th, 2016|
|This week in vc4 (2016-09-26): XDC 2016, glamor testing, glamor performance
Last week I spent at XDS 2016 with other graphics stack developers.
I gave a talk
on X Server testing and our development model. Since playing around in the Servo project (and a few other github-centric communities) recently, I've become increasingly convinced that github + CI + automatic merges is the right way to go about software development now. I got a warmer reception from the other X Server developers than I expected, and I hope to keep working on building out this infrastructure.
I also spent a while with keithp and ajax (other core X developers) talking about how to fix our rendering performance. We've got two big things hurting vc4 right now: Excessive flushing, and lack of bounds on rendering operations.
The excessive flushing one is pretty easy. X currently does some of its drawing directly to the front buffer (I think this is bad and we should stop, but that *is* what we're doing today). With glamor, we do our rendering with GL, so the driver accumulates a command stream over time. In order for the drawing to the front buffer to show up on time, both for running animations, and for that last character you typed before rendering went idle, we have to periodically flush GL's command stream so that its rendering to the front buffer (if any) shows up.
Right now we just glFlush() every time we might block sending things back to a client, which is really frequent. For a tiling architecture like vc4, each glFlush() is quite possibly a load and store of the entire screen (8MB).
The solution I started back in August is to rate-limit how often we flush. When the server wakes up (possibly to render), I arm a timer. When it fires, I flush and disarm it until the next wakeup. I set the timer to 5ms as an arbitrary value that was clearly not pretending to be vsync. The state machine was a little tricky and had a bug it seemed (or at least a bad interaction with x11perf -- it was unclear). But ajax pointed out that I could easily do better: we have a function in glamor that gets called right before it does any GL operation, meaning that we could use that to arm the timer and not do the arming every time the server wakes up to process input.
Bounding our rendering operations is a bit trickier. We started doing some of this with keithp's glamor rework: We glScissor() all of our rendering to the pCompositeClip (which is usually the bounds of the current window). For front buffer rendering in X, this is a big win on vc4 because it means that when you update an uncomposited window I know that you're only loading and storing the area covered by the window, not the entire screen. That's a big bandwidth savings.
However, the pCompositeClip isn't enough. As I'm typing we should be bounding the GL rendering around each character (or span of them), so that I only load and store one or two tiles rather than the entire window. It turns out, though, that you've probably already computed the bounds of the operation in the Damage extension, because you have a compositor running! Wouldn't it be nice if we could just reuse that old damage computation and reference it?
Keith has started on making this possible: First with the idea of const-ifying all the op arguments
so that we could reuse their pointers as a cheap cache key, and now with the idea of just passing along a possibly-initialized rectangle
with the bounds of the operation. If you do compute the bounds, then you pass it down to anything you end up calling.
Between these two, we should get much improved performance on general desktop apps on Raspberry Pi.
Other updates: Landed opt_peephole_sel
improvement for glmark2 performance (and probably mupen64plus), added regression testing of glamor
to the X Server, fixed a couple of glamor rendering bugs
|Monday, September 19th, 2016|
|This week in vc4 (2016-09-19): firmware KMS, apitrace testing
While stuck on an airplane, I put together a repository for apitraces with confirmed good images for driver output. Combined with the piglit patches I pushed, I now have regression testing of actual apps on vc4 (particularly relevant given that I'm working on optimizing one of those apps!)
The flight was to visit the Raspberry Pi Foundation, with the goal of getting something usable for their distro to switch to the open 3D stack. There's still a giant pile of KMS work to do (HDMI audio, DSI power management, SDTV support, etc.), and waiting for all of that to be regression-free will be a long time. The question is: what could we do that would get us 3D, even if KMS isn't ready?
So, I put together a quick branch to expose the firmware's display stack as the KMS display pipeline. It's a filthy hack, and loses us a lot of the important new features that the open stack was going to bring (changing video modes in X, vblank timestamping, power management), but it gets us much closer to the featureset of the previous stack. Hopefully they'll be switching to it as the default in new installs soon.
In debugging while I was here, Simon found that on his HDMI display the color ramps didn't quite match between closed and open drivers. After a bit of worrying about gamma ramp behavior, I figured out that it was actually that the monitor was using a CEA mode that requires limited range RGB input. A patch is now on the list.
|Wednesday, September 14th, 2016|
|This week in vc4 2016-09-13: glmark2 performance work
Last week I spent working on the glmark2 performance issues. I now have a NIR patch
out for the pathological conditionals test (it's now faster than on the old driver), and a branch for job shuffling (+17% and +27% on the two desktop tests).
Here's the basic idea of job shuffling:
We're a tiled renderer, and tiled renderers get their wins from having a Clear at the start of the frame (indicating we don't need to load any previous contents into the tile buffer). When your frame is done, we flush each tile out to memory. If you do your clear, start rendering some primitives, and then switch to some other FBO (because you're rendering to a texture that you're planning on texturing from in your next draw to the main FBO), we have to flush out all of those tiles, start rendering to the new FBO, and flush its rendering, and then when you come back to the main FBO and we have to reload your old cleared-and-a-few-draws tiles.
Job shuffling deals with this by separating the single GL command stream into separate jobs per FBO. When you switch to your temporary FBO, we don't flush the old job, we just set it aside. To make this work we have to add tracking for which buffers have jobs writing into them (so that if you try to read those from another job, we can go flush the job that wrote it), and which buffers have jobs reading from them (so that if you try to write to them, they can get flushed so that they don't get incorrectly updated contents).
This felt like it should have been harder than it was, and there's a spot where I'm using a really bad data structure I had laying around, but that data structure has been bad news since the driver was imported and it hasn't been high in any profiles yet. The other tests don't seem to have any problem with the possible increased CPU overhead.
The shuffling branch also unearthed a few bugs related to clearing and blitting in the multisample tests. Some of the piglit cases involved are fixed, but some will be reporting new piglit "regressions" because the tests are now closer to working correctly (sigh, reftests).
I also started writing documentation
for updating the system's X and Mesa stack on Raspbian for one of the Foundation's developers. It's not polished, and if I was rewriting it I would use modular's build.sh instead of some of what I did there. But it's there for reference.
|Tuesday, September 6th, 2016|
|This week in vc4 (2016-09-06): glmark2, X testing, kernel maintaining
Last week I was tasked with making performance comparisons between vc4 and the closed driver possible. I decided to take glmark2
and port it to dispmanx, and submitted a pull request
upstream. It already worked on X11 on the vc4 driver, and I fixed the drm backend to work as well (though the drm backend has performance problems specific to the glmark2 code).
Looking at glmark2, vc4 has a few bugs. Terrain has some rendering bugs
. The driver on master took a big performance hit on one of the conditionals tests since the loops support was added, because NIR isn't aggressive enough in flattening if statements
. Some of the tests require that we shuffle rendering jobs
to avoid extra frame store/loads. Finally, we need to use the multithreaded fragment shader
mode to hide texture fetching latency on a bunch of tests. Despite the bugs, results looked good.
(Note: I won't be posting comparisons here. Comparisons will be left up to the reader on their particular hardware and software stacks).
I'm expecting to get to do some glamor work on vc4 again soon, so I spent some of the time while I was waiting for Raspberry Pi builds working on the X Server's testing infrastructure. I've previously talked about Travis CI, but for it to be really useful it needs to run our integration tests. I fixed up the piglit XTS test wrapper to not spuriously fail
, made the X Test suite spuriously fail less
, and worked with Adam Jackson at Red Hat to fix build issues in XTS
. Finally, I wrote scripts
that will, if you have an XTS tree and a piglit tree and build Xvfb, actually run the XTS rendering tests at xserver make check time.
Next steps for xserver testing are to test glamor in a similar fashion, and to integrate this testing into travis-ci and land the travis-ci branch.
Finally, I submitted pull requests to the upstream kernel. 4.8 got some fixes for VC4 3D (merged by Dave), and 4.9 got interlaced vblank timing patches from the Mario Kleiner (not yet merged by Dave) and Raspberry Pi Zero (merged by Florian).
|Monday, August 29th, 2016|
|This week in vc4 (2016-08-29): derivatives, GPU hangs, testing, kernel maintaining
I spent a day or so last week cleaning up @jonasarrow's demo patch
for derivatives on vc4. It had been hanging around on the github issue waiting for a rework due to feedback, and I decided to finally just go and do it. It unfortunately involved totally rewriting their patches (which I dislike doing, it's always more awesome to have the original submitter get credit), but we now have dFdx()/dFdy() on Mesa master.
I also landed a fix for GPU hangs with 16 vertex attributes (4 or more vec4s, aka glsl-routing in piglit). I'd been debugging this one for a while, and finally came up with an idea ("what if this FIFO here is a bad idea to use and we should be synchronous with this external unit?"), it worked, and a hardware developer confirmed that the fix was correct. This one got a huge explanation comment
. I also fixed discards
inside of if/loop statements -- generally discards get lowered out of ifs, but if it's in a non-unrolled loop we were doing discards ignoring whether the channel was in the loop.
Thanks to review from Rhys, I landed Mesa's Travis build fixes. Rhys then used Travis to test out a couple of fixes to i915
. This is pretty cool, but it just makes me really want to get piglit into Travis so that we can get some actual integration testing in this process.
I got xserver's Travis to the point of running the unit tests, and one of them crashes on CI
but not locally. That's interesting.
The last GPU hang I have in piglit is in glsl-vs-loops. This week I figured out what's going on, and I hope I'll be able to write about a fix next week.
Finally, I landed Stefan Wahren's Raspberry Pi Zero devicetree
for upstream. If nothing goes wrong, the Zero should be supported in 4.9.
|Monday, August 22nd, 2016|
|vc4 status update for 2016-08-22: camera, NIR, testing
Last week I finally plugged in the camera module I got a while ago to go take a look at what vc4 needs for displaying camera output.
The surprising answer was "nothing." vc4 could successfully import RGB dmabufs and display them as planes, even though I had been expecting to need fixes on that front.
However, the bcm2835 v4l camera driver needs a lot of work. First of all, it doesn't use the proper contiguous memory support in v4l (vb2-dma-contig), and instead asks the firmware to copy from the firmware's contiguous memory into vmalloced kernel memory. This wastes memory and wastes memory bandwidth, and doesn't give us dma-buf support.
Even more, MMAL (the v4l equivalent that the firmware exposes for driving the hardware) wants to output planar buffers with specific padding. However, instead of using the multi-plane format support in v4l to expose buffers with that padding, the bcm2835 driver asks the firmware to do another copy from the firmware's planar layout into the old no-padding V4L planar format.
As a user of the V4L api, you're also in trouble because none of these formats have any priority information that I can see: The camera driver says it's equally happy to give you RGB or planar, even though RGB costs an extra copy. I think properly done today, the camera driver would be exposing multi-plane planar YUV, and giving you a mem2mem adapter that could use MMAL calls to turn the planar YUV into RGB.
For now, I've updated the bug report
with links to the demo code and instructions.
I also spent a little bit of time last week finishing off the series to use st/nir in vc4. I managed to get to no regressions, and landed it today. It doesn't eliminate TGSI, but it does mean TGSI is gone from the normal GLSL path.
Finally, I got inspired to do some work on testing. I've been doing some free time work on servo
, Mozilla's Rust-based web browser, and their development environment has been a delight as a new developer. All patch submissions, from core developers or from newbies, go through github pull requests. When you generate a PR, Travis builds and runs the unit tests on the PR. Then a core developer reviews the code by adding a "r" comment in the PR or provides feedback. Once it's reviewed, a bot picks up the pull request, tries merging it to master, then runs the full integration test suite on it. If the test suite passes, the bot merges it to master, otherwise the bot writes a comment with a link to the build/test logs.
Compare this to Mesa's development process. You make a patch. You file it in the issue tracker and it gets utterly ignored. You complain, and someone tells you you got the process wrong, so you join the mailing list and send your patch (and then get a flood of email until you unsubscribe). It gets mangled by your email client, and you get told to use git-send-email, so you screw around with that for a while before you get an email that will actually show up in people's inboxes. Then someone reviews it (hopefully) before it scrolls off the end of their inbox, and then it doesn't get committed anyway because your name was familiar enough that the reviewer thought maybe you had commit access. Or they do land your patch, and it turns out you hasn't run the integration tests and then people complain at you for not testing.
So, as a first step toward making a process like Mozilla's possible, I put some time into fixing up Travis on Mesa
, and building Travis support
for the X Server. If I can get Travis to run piglit and ensure that expected-pass tests don't regress, that at least gives us a documentable path for new developers in these two projects to put their code up on github and get automated testing of the branches they're proposing on the mailing lists.
|Monday, August 15th, 2016|
|vc4 status update for 2016-08-15: DSI panel, Raspbian updates, and docs
Last week I mostly worked on getting the upstream work I and others have done into downstream Raspbian (most of that time unfortunately in setting up another Raspbian development environment, after yet another SD card failed).
However, the most exciting thing for most users is that with the merge of the rpi-4.4.y-dsi-stub-squash branch, the DSI display should now come up by default with the open source driver. This is unfortunately not a full upstreamable DSI driver, because the closed-source firmware is getting in the way of Linux by stealing our interrupts and then talking to the hardware behind our backs. To work around the firmware, I never talk to the DSI hardware, and we just replace the HVS display plane configuration on the DSI's output pipe. This means your display backlight is always on and the DSI link is always running, but better that than no display.
I also transferred the wiki I had made for VC4 over to github. In doing so, I was pleasantly surprised at how much documentation I wanted to write once I got off of the awful wiki software at freedesktop. You can find more information on VC4 at my mesa
(Side note, wikis on github are interesting. When you make your fork, you inherit the wiki of whoever you fork from, and you can do PRs back to their wiki similarly to how you would for the main repo. So my linux tree has Raspberry Pi's wiki too, and I'm wondering if I want to move all of my wiki over to their tree. I'm not sure.)
Is there anything that people think should be documented for the vc4 project that isn't there?
|Monday, August 8th, 2016|
|vc4 status update for 2016-08-08: cutting memory usage
Last week's project for vc4 was to take a look at memory usage. Eben had expressed concern that the new driver stack would use more memory than the closed stack, and so I figured I would spend a little while working on that.
I first pulled out valgrind's massif tool on piglit's glsl-algebraic-add-add-1.shader_test. This works as a minimum "how much memory does it take to render *anything* with this driver?" test. We were consuming 1605k of heap at the peak, and there were some obvious fixes to be made.
First, the gallium state_tracker was allocating 659kb of space at context creation so that it could bytecode-interpret TGSI if needed for glRasterPos() and glRenderMode(GL_FEEDBACK). Given that nobody should ever use those features, and luckily they rarely do, I delayed the allocation of the somewhat misleadingly-named "draw" context until the fallbacks were needed.
Second, Mesa was allocating the memory for the GL 1.x matrix stacks up front at context creation. We advertise 32 matrices for modelview/projection, 10 per texture unit (32 of those), and 4 for programs. I instead implemented a typical doubling array reallocation scheme for storing the matrices, so that only the top matrix per stack is allocated at context creation. This saved 63kb of dirty memory per context.
722KB for these two fixes may not seem like a whole lot of memory to readers on fancy desktop hardware with 8GB of RAM, but the Raspberry Pi has only 1GB of RAM, and when you exhaust that you're swapping to an SD card. You should also expect a desktop to have several GL contexts created: the X Server uses one to do its rendering, you have a GL-based compositor with its own context, and your web browser and LibreOffice may each have one or more. Additionally, trying to complete our piglit testsuite on the Raspberry Pi is currently taking me 6.5 hours (when it even succeeds and doesn't see piglit's python runner get shot by the OOM killer), so I could use any help I can get in reducing context initialization time.
However, malloc()-based memory isn't all that's involved. The GPU buffer objects that get allocated don't get counted by massif in my analysis above. To try to approximately fix this, I added in valgrind macro calls to mark the mmap()ed space in a buffer object as being a malloc-like operation until the point that the BO is freed. This doesn't get at allocations for things like the window-system renderbuffers or the kernel's overflow BO (valgrind requires that you have a pointer involved to report it to massif), but it does help.
Once I has massif reporting more, I noticed that glmark2 -b terrain was allocating a *lot* of memory for shader BOs. Going through them, an obvious problem was that we were generating a lot of shaders for glGenerateMipmap(). A few weeks ago I improved performance on the benchmark by fixing glGenerateMipmap()'s fallback blits that we were doing because vc4 doesn't support the GL_TEXTURE_BASE_LEVEL that the gallium aux code uses. I had fixed the fallback by making the shader do an explicit-LOD lookup of the base level if the GL_TEXTURE_BASE_LEVEL==GL_TEXTURE_MAX_LE
VEL. However, in the process I made the shader depend on that base level, so we would comple a new shader variant per level of the texture. The fix was to make the base level into a uniform value that's uploaded per draw call, and with that change I dropped 572 shader variants from my shader-db results.
Reducing extra shaders was fun, so I set off on another project I had thought of before. VC4's vertex shader to fragment shader IO system is a bit unusual in that it's just a FIFO of floats (effectively), with none of these silly "vec4"s that permeate GLSL. Since I can take my inputs in any order, and more flexibility in the FS means avoiding register allocation failures sometimes, I have the FS compiler tell the VS what order it would like its inputs in. However, the list of all the inputs in their arbitrary orders would be expensive to hash at draw time, so I had just been using the identity of the compiled fragment shader variant in the VS and CS's key to decide when to recompile it in case output order changed. The trick was that, while the set of all possible orders is huge, the number that any particular application will use is quite small. I take the FS's input order array, keep it in a set, and use the pointer to the data in the set as the key. This cut 712 shaders from shader-db.
Also, embarassingly, when I mentioned tracking the FS in the CS's key above? Coordinate shaders don't output anything to the fragment shader. Like the name says, they just generate coordinates, which get consumed by the binner. So, by removing the FS from the CS key, I trivially cut 754 shaders from shader-db. Between the two, piglit's gl-1.0-blend-func test now passes instead of OOMing, so we get test coverage on blending.
Relatedly, while working on fixing a kernel oops recently, I had noticed that we were still reallocating the overflow BO on every draw call. This was old debug code from when I was first figuring out how overflow worked. Since each client can have up to 5 outstanding jobs (limited by Mesa) and each job was allocating a 256KB BO, we coud be saving a MB or so per client assuming they weren't using much of their overflow (likely true for the X Server). The solution, now that I understand the overflow system better, was just to not reallocate and let the new job fill out the previous overflow area.
Other projects for the week that I won't expand on here: Debugging GPU hang in piglit glsl-routing (generated fixes for vc4-gpu-tools parser, tried writing a GFXH30 workaround patch, still not fixed) and working on supporting direct GLSL IR to NIR translation (lots of cleanups, a couple fixes, patches on the Mesa list).
|Thursday, March 3rd, 2016|
|VC4/RPi3 status update
It's been a busy month. I spent most of it working on the Raspberry Pi 3 support so I could have a working branch for upstream day 1. That involved cleaning up the SDHOST driver for submission, cleaning up pinctrl DT, writing an I2C GPIO expander driver, debugging the I2C controller, fixing HDMI hotplug handling, debugging EMMC (not quite done!), scraping together some wireless firmware, and a bunch of work trying to get BT working on the UART. I'm happy to say that on day 1 I published a branch that worked the same as a RPi2, and by the end of the day I had wireless working. Some of the patches are now out for review, and I'll be working on cleaning up the rest in the near future.
For VC4, my big push recently has been to support some sort of panel. Panels are really big with Raspberry Pi users, and it's the primary complaint I hear about the open driver. The official 7" DSI touchscreen seems like the most promising device to support, since it doesn't hog all your GPIOs (letting me use my serial console) and it's quite popular.
Unfortunately, DSI isn't going well. The DSI0 peripheral is responding to me, but while I can read from DSI1 it won't respond to any writes. DSI1 is, unfortunately, the one that the RPi exposes on its DSI connector. (Note: this debugging is with the panel up and running from the firmware's boot configuration). Debug code is at drm-vc4-dsi-boot
So, since DSI1's not cooperating, I switched tasks. I had also picked up a DPI panel using the Adafruit Kippah, and a little SPI-driven panel. I hadn't started with DPI because hogging all the GPIOs makes kernel debugging a mostly black box experience. The upside is that DPI is crazy simple -- set the GPIOs muxes to output from DPI, set one register in DPI, and use the same pixelvalve setup from before. I was surprised when 2 days in I got display output. Here it is, running HDMI and DPI at the same time:
Expect patches soon on a mailing list near you. Until then, it's at drm-vc4-dpi-boot
|Monday, December 14th, 2015|
|vc4 status update 2015-12-14
Big news for the VC4 project today:commit 21de54b3c4d08d2b20e80876c6def0b421dfec2e
Merge: 870a171 2146136
Author: Dave Airlie
Date: Tue Dec 15 10:43:27 2015 +1000
Merge tag 'drm-vc4-next-2015-12-11' of http://github.com/anholt/linux into drm-next
This is the last step for getting the VC4 driver upstream: Dave's pulled my code for inclusion in kernel 4.5 (probably to be released around mid-March). The ABI is now stable, so I'm working on getting that ABI usage into the Mesa 11.1 release. Hopefully I'll land that in the next couple of days.
As far as using it out of the box, we're not there yet. I've been getting my code included in some builds for the Raspberry Pi Foundation developers. They've been working on switching to kernel 4.2, and their tree has VC4 support up to the previous ABI. Once the Mesa 11.1 merge happens, I'll ask them to pull the new kernel ABI and rebuild userspace using Mesa 11.1-rc4. Hopefully this then appears in the next Raspbian Jessie build they produce. Until that release happens, there are instructions for the development environment on the DRI wiki
, and I'd recommend trying out the continuous integration builds linked from there.
The Raspberry Pi folks aren't ready to swap everyone over to the vc4 driver quite yet, though. They want to make sure we don't regress functionality, obviously, and there are some big chunks of work left to do: HDMI audio support, video overlays, and integration of the vc4 driver with the camera and video decode support come time mind. And then there's the matter of making sure that 3D performance doesn't suffer. That's a bit hard to test, since only a few apps work with the existing GLES2 support, while the vc4 driver gives GLX, EGL-on-X11, EGL-on-gbm, most of GL2.1, and all of GLES2, but doesn't support the EGL-on-Dispmanx interface that the previous driver used. So, until then, they're going to have a devicetree overlay that determines whether the firmware sets itself up for Linux using the vc4 driver or the closed driver.
Part of what's taken so long to get to this point has been trying to get my dependencies merged to the kernel. To turn on V3D, I need to turn on power, which means a power domain driver. Turning on the power required talking to the firmware, which required resurrecting an old patchset for talking to the firmware, which got bikeshedded harder than I've ever had happen to my code before. Programming video modes required a clock driver. Every step of the way is misery to get the code merged, and I would give up a lot to never work on the Linux kernel again.
Until then, though, I've become as Raspberry Pi kernel maintainer, so that I can ack other people's patches and help shepherd them into the kernel. Hopefully for 4.5 I can get the aux clock driver bikeshedding dealt with and resubmit it, at which point people can use UART1 and SPI1/2. I have a third rework to do of my power domain driver so that if we're lucky we can get it merged nad actually turn on the 3D core (and manage power of many other devices, too!). Martin Sperl is doing a major rewrite of the SPI driver (an area I know basically nothing about), and his recent patch split may deal with the subsystem maintainer's concerns. I want to pull in feedback and merge Lubomir's thermal driver. There's also a cpufreq driver (for actually doing the overclocking you can set with config.txt) from Lubomir, which I expect to be harder to deal with the feedback on.
So, while I've been quiet on the blogging front, there's been a lot going on for vc4, and it's in pretty good shape now. Hopefully more folks can give it a try as it becomes more upstreamed and accessible.
|Sunday, October 19th, 2014|
|VC4 driver status update
I've just spent another week hanging out with my Broadcom and Raspberry Pi teammates, and it's unblocked a lot of my work.
Notably, I learned some unstated rules about how loading and storing from the tilebuffer work, which has significantly improved stability on the Pi (as opposed to simulation, which only asserted about following half of these rules).
I got an intro on the debug process for GPU hangs, which ultimately just looks like "run it through simpenrose (the simulator) directly. If that doesn't catch the problem, you capture a .CLIF file of all the buffers involved and feed it into RTL simulation, at which point you can confirm for yourself that yes, it's hanging, and then you hand it to somebody who understands the RTL and they tell you what the deal is." There's also the opportunity to use JTAG to look at the GPU's perspective of memory, which might be useful for some classes of problems. I've started on .CLIF generation (currently simulation-environment-only), but I've got some bugs in my generated files because I'm using packets that the .CLIF generator wasn't prepared for.
I got an overview of the cache hierarchy, which pointed out that I wasn't flushing the ARM dcache to get my writes out into system L2 (more like an L3) so that the GPU could see it. This should also improve stability, since before we were only getting lucky that the GPU would actually see our command stream.
Most importantly, I ended up fixing a mistake in my attempt at reset using the mailbox commands, and now I've got working reset. Testing cycles for GPU hangs have dropped from about 5 minutes to 2-30 seconds. Between working reset and improved stability from loads/stores, we're at the point that X is almost stable. I can now run piglit on actual hardware! (it takes hours, though)
On the X front, the modesetting driver is now merged to the X Server with glamor-based X rendering acceleration. It also happens to support DRI3 buffer passing, but not Present's pageflipping/vblank synchronization. I've submitted a patch series for DRI2 support with vblank synchronization (again, no pageflipping), which will get us more complete GLX extension support, including things like GLX_INTEL_swap_event that gnome-shell really wants.
In other news, I've been talking to a developer at Raspberry Pi who's building the KMS support. Combined with the discussions with keithp and ajax last week about compositing inside the X Server, I think we've got a pretty solid plan for what we want our display stack to look like, so that we can get GL swaps and video presentation into HVS planes, and avoid copies on our very bandwidth-limited hardware. Baby steps first, though -- he's still working on putting giant piles of clock management code into the kernel module so we can even turn on the GPU and displays on our own without using the firmware blob.
- 93.8% passrate on piglit on simulation
- 86.3% passrate on piglit gpu.py on Raspberry Pi
All those opcodes I mentioned in the previous post are now completed -- sadly, I didn't get people up to speed fast enough to contribute before those projects were the biggest things holding back the passrate. I've started a page at http://dri.freedesktop.org/wiki/VC4/
for documenting the setup process and status.
And now, next steps. Now that I've got GPU reset, a high priority is switching to interrupt-based render job tracking and putting an actual command queue in the kernel so we can have multiple GPU jobs queued up by userland at the same time (the VC4 sadly has no ringbuffer like other GPUs have). Then I need to clean up user <-> kernel ABI so that I can start pushing my linux code upstream, and probably work on building userspace BO caching.
|Friday, August 29th, 2014|
|helping out with VC4
I've had a couple of questions about whether there's a way for others to contribute to the VC4 driver project. There is! I haven't posted about it before because things aren't as ready as I'd like for others to do development (it has a tendency to lock up, and the X implementation isn't really ready yet so you don't get to see your results), but that shouldn't actually stop anyone.
To get your environment set up, build the kernel (https://github.com/anholt/linux.git
vc4 branch), Mesa (git://anongit.freedesktop.org/mesa/mesa)
, and piglit (git://anongit.freedesktop.org/git/pigli
t). For working on the Pi, I highly recommend having a serial cable
and doing NFS root so that you don't have to write things to slow, unreliable SD cards.
You can run an existing piglit test that should work, to check your environment: env PIGLIT_PLATFORM=gbm VC4_DEBUG=qir ./bin/shader_runner tests/shaders/glsl-algebraic-add-add-1.shader_test -auto -fbo
-- you should see a dump of the IR for this shader, and a pass report. The kernel will make some noise about how it's rendered a frame.
Now the actual work: I've left some of the TGSI opcodes unfinished (SCS, DST, DPH, and XPD, for example), so the driver just aborts when a shader tries to use them. How they work is described in src/gallium/docs/source/tgsi.rst
. The TGSI-to_QIR code is in vc4_program.c
(where you'll find all the opcodes that are implemented currently), and vc4_qir.h
has all the opcodes that are available to you and helpers for generating them. Once it's in QIR (which I think should have all the opcodes you need for this work), vc4_qpu_emit.c
will turn the QIR into actual QPU code like you find described in the chip specs.
You can dump the shaders being generated by the driver using VC4_DEBUG=tgsi,qir,qpu
in the environment (that gets you 3/4 stages of code dumped -- at times you might want some subset of that just to quiet things down).
Since we've still got a lot of GPU hangs, and I don't have reset wokring, you can't even complete a piglit run to find all the problems or to test your changes to see if your changes are good. What I can offer currently is that you could run PIGLIT_PLATFORM=gbm VC4_DEBUG=norast ./piglit-run.py tests/quick.py results/vc4-norast; piglit-summary-html.py --overwrite summary/mysum results/vc4-norast
will get you a list of all the tests (which mostly failed, since we didn't render anything), some of which will have assertion failed. Now that you have which tests were assertion failing from the opcode you worked on, you can run them manually, like PIGLIT_PLATFORM=gbm /home/anholt/src/piglit/bin/shader_runner /home/anholt/src/piglit/generated_tests/spec/glsl-1.10/execution/built-in-functions/vs-asin-vec4.shader_test -auto
(copy-and-pasted from the results) or PIGLIT_PLATFORM=gbm PIGLIT_TEST="XPD test 2 (same src and dst arg)" ./bin/glean -o -v -v -v -t +vertProg1 --quick
(also copy and pasted from the results, but note that you need the other env var for glean to pick out the subtest to run).
Other things you might want eventually: I do my development using cross-builds instead of on the Pi, install to a prefix in my homedir, then rsync that into my NFS root and use LD_LIBRARY_PATH
on the Pi to point my tests at the driver in the homedir prefix. Cross-builds were a *huge* pain to set up (debian's multiarch doesn't ship the .so symlink with the libary, and the -dev packages that do install them don't install simultaneously for multiple arches), but it's worth it in the end. If you look into cross-build, what I'm using is rpi-tools/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64/bin/arm-linux-gnueabihf-gcc
and you'll want --enable-malloc0returnsnull
if you cross-build a bunch of X-related packages.