Tuesday, February 24th, 2009

Things have been pretty quiet on the intel graphics driver front for a while. After all the churn of FBOs, GEM, DRI2, and KMS, we've been settling down to seriously stabilizing the driver. Things are looking pretty good -- KMS is now suspending/resuming, resizing framebuffers, supporting rotation, and generally looking relatively correct on the machines we're testing on. A number of DRI2 performance regressions are fixed. We've fixed some tricky little bugs in GEM cache handling. Big thanks goes out to the RH guys (airlied and krh) for pushing so hard on this code and fixing a bunch of the sharp edges instead of just filing bugs, and to ickle for reviewing and fixing piles of error path problems in the DRM. I love working with this community.

We're now ramping up for the Q1 release, which means we're switching from what has been nearly 100% time on bug and regression fixing to 100% bug and regression fixing. I just pushed out xf86-video-intel 2.6.2 to get some of our stable fixes into the world so people have a better base for testing KMS. By the end of March we should have a new release out with current 2D, a new Mesa tarball with the latest fixes since 7.4, and hopefully a kernel 2.6.29 with KMS.




In my spare time I've been working on a GL backend for cairo. My assertion for some time has been that anything we're doing with accelerating Render, we could do with less developer time and faster runtime with GL. It took about 2 weeks of me flailing around learning how to do GL, and I got cairogears up and running at about 4 times the speed of Render, on various 965s.

The code's still pretty rough -- it doesn't check for extensions it needs, relies on NPOT textures when it could use rectangular textures, doesn't do shaders for gradients, etc. But actually most of the time on the code has been spent fixing our 3D driver. There were some texture upload issues, BO cache explosion issues, drawpixels issues, DRI2 viewport overhead issues, and my current one is that the latest commit in cairo-gl ends up hanging both 915s and 965s after a while in cairogears.

I'm hoping some other people interested in this stuff might feel like picking it up and playing with it. It's not terribly complicated code, I don't think, there are other backends to compare to, and cairo's test suite is wonderful for validating what you're trying out. The code is in:

git://people.freedesktop.org/~anholt/cairo on the gl branch
git://people.freedesktop.org/~anholt/cairogears on the gl branch
(8 comments | Leave a comment)

Tuesday, October 21st, 2008

doing it right

Things are looking up in the Intel driver. The GEM work has landed in Linus's master, meaning that at this point we can worry about just fixing our bugs, and know that it's going to end up in 2.6.28.

But what gets me even more excited than getting our first major kernel merge done is that we're starting to create a culture of review surrounding our driver. Keith started by insisting on posting patches instead of git trees, which meant that I read his patches and found issues before they were queued for upstream. I started posting patches in retaliation, and he kept NAKing them for needing improvements (which I've usually got around to) or being obsoleted by better work he did. Now we've got the rest of the team joining the party. While it has slowed things down a bit, it's also nice -- the internal TODO list of "someone committed some junk and I need to go fix it up when I get some time" is no longer growing out of control, and stability is definitely increasing.

The majority of the improvements in the last week have been in vblank handling. The vblank-rework changes had gone through a series of QA cycles before I merged them, but there were some issues that cropped up in integrating them with the GEM changes, and there were some plain old bugs we found when we started trying to use them on our desktops. We also fixed some VT switching bugs that would have hurt suspend/resume, and a couple more G4X issues.

My highest priority now is sorting out our failures with batchbuffers being too large. Our testcases up until now have had texture load below the size of the aperture. However, with more interesting apps like sauerbraten or Virtual Forbidden City, we're running into a problem where we accumulate a batchbuffer for execution that can't be loaded into the aperture all at once, and you get an error message and no rendering. Dave Airlie had fixed this in classic mode with his check_aperture changes, but we never did them for GEM, since figured we'd have the whole aperture available instead of just 32MB. But when a single mipmapped texture is 24MB, we run out of aperture quickly anyway. There are a few things we're planning on doing to resolve the problem:

  • Implement check_aperture on GEM so that we can flush the batch when it's likely going to be too big for the aperture. This will avoid having rendering dropped on the floor for being too big.
  • Fix counting of aperture space consumed in check_aperture. Right now we just sum up all the sizes of buffers targeted by relocations, but if you keep referencing the same big buffer over and over you'll flush too often.
  • Implement PPGTT support for new chipsets. This gives us a 2GB virtual address space for your batchbuffers to play in instead of 256 or 512MB. I'm trying to avoid saying "that'll be enough for anybody", but it'll certainly make a lot of apps happier. This'll be a bit of work, as we don't have a PCI aperture mapping to that address space, so we can't use some of our old tricks the same way. However, it should be a pretty significant win on any serious 3D workload.


We're also looking into getting an appropriate API into the kernel for our transient mappings of the aperture. One of the sticking points in getting GEM merged was a hack we were doing with the kernel mapping APIs on CONFIG_HIGHMEM x86. By actually sitting down and writing the API we need, we should get improved performance in PAT non-MTRR environments (like the G[M]45 where we don't get an MTRR slot), and improved performance on 64-bit where we couldn't do the CONFIG_HIGHMEM hack, for a 20% to 200% improvement depending on which case of failure occurred. By being in a kernel tree, we can submit a single patch to the kernel community showing the API we want and how we plan to use it, and get much better feedback than we've been able to in the past. In this case, Ingo came up with fun ideas for how we can get the advantages of our atomic mapping path on CONFIG_HIGHMEM x86 without the scheduling restrictions of actually being atomic, which would be awfully convenient for one of our code paths.

For those looking to run the latest hotness, here's the list:
kernel: drm-intel-next in my tree (still has some bugfixes to be merged)
libdrm: 2.4.0 (nothing major has happened since then)
xf86-video-intel: 2.5.0 (nothing major has happened since then)
dri2proto master
inputproto master
xserver: master (be sure to update your input drivers too!)
mesa: master

That's 2 things with version numbers on them compared to 2 weeks ago. The X server would be in the list too, but we missed that the glyph cache didn't make it into the 1.5 series, which is critical for 2D performance.
(15 comments | Leave a comment)

Friday, September 12th, 2008

taste of success

As part of the work we're doing for Moblin, I got our head trees in shape for GL compositing on GEM. We had a nice demo of the technology at XDS from krh, and today I got the same setup working: glxgears and totem on a cube all playing nicely together. It feels like things are finally coming together, and we'll be ready to support this stuff in releases soon, even if it won't be at the end of this release cycle.

All the code's in master, and available to test. The usual warnings about master apply, and just to show how serious we are about our in-development code, among the known issues are:

Can't do UXA (and therefore DRI2) with tiling.
It's kind of hard to teach X about tiled buffers. NVIDIA went with wrapping all pixel access from libfb to produce libwfb. Our plan is to return to GTT-managed access of buffers in the X Server so we don't have to teach it about this -- we keep the nice write-combined performance we're used to with framebuffer access, though we also suffer from painful read performance we're used to if we have fallbacks that read (See also: Render gradients and convolutions).

2D performance has tanked when you use UXA.
If you fallback on front buffer rendering with the X Server, GEM goes wild with cache flushing because the X Server isn't telling it just what memory it touched, yet we're telling GEM that the results need to land in the front buffer "soon". This means that compositing is fast, but non-composited isn't if you run emacs or gitk anything else that causes fallbacks. There are a couple of potential fixes for this, but the current plan is to avoid it using GTT mapping, which resolves the coherency issue.

Long term, we're going to be adding support for telling GEM what pages we touched after the fact (or maybe use fault-based clflushing) for OpenGL, and at that point we may reconsider how X maps its buffers.

3D performance has tanked in some cases
Haven't debugged this one, and it's next on the list. Can't reproduce on the test systems here.

Applications hang on to a lot of buffer objects
In TTM we would allocate giant buffer objects and then suballocate out of them, because allocation performance was so slow. This meant that userland had to pay attention to a lot of fencing issues (and you had to expose the idea of a fence) so that you could know which pieces were still used.

We went with a simpler model for GEM, where the userland caches buffer objects of similar size, and reuses them when the kernel tells us they're no longer used by the GPU. By returning "freed" buffers to the cache, we get wonderful performance when an app is running flat out and allocating and freeing buffers like mad. However, as we think about cairo moving to a GL backend, and all your apps sitting waiting for input hanging onto these cached buffers, the memory usage is likely to become an issue. They're pageable, but that's slow and we're looking at systems with limited memory+swap anyway. Instead, we need to free buffers when they're not serving any use in the cache. One way we're thinking about is on allocate, when a cached buffer is "really old" (seconds), actually free it.

That still leaves some excess memory laying around when an app does some rendering and then stops for input. The long-term solution would be for userland to tell the kernel when it wasn't going to actively use a buffer and the contents could be thrown out. Then, in the memory pressure callback from the kernel, we can throw out cached buffers and nuke mappings, and userland has to allocate new ones when it needs.
(7 comments | Leave a comment)