All the code's in master, and available to test. The usual warnings about master apply, and just to show how serious we are about our in-development code, among the known issues are:
Can't do UXA (and therefore DRI2) with tiling.
It's kind of hard to teach X about tiled buffers. NVIDIA went with wrapping all pixel access from libfb to produce libwfb. Our plan is to return to GTT-managed access of buffers in the X Server so we don't have to teach it about this -- we keep the nice write-combined performance we're used to with framebuffer access, though we also suffer from painful read performance we're used to if we have fallbacks that read (See also: Render gradients and convolutions).
2D performance has tanked when you use UXA.
If you fallback on front buffer rendering with the X Server, GEM goes wild with cache flushing because the X Server isn't telling it just what memory it touched, yet we're telling GEM that the results need to land in the front buffer "soon". This means that compositing is fast, but non-composited isn't if you run emacs or gitk anything else that causes fallbacks. There are a couple of potential fixes for this, but the current plan is to avoid it using GTT mapping, which resolves the coherency issue.
Long term, we're going to be adding support for telling GEM what pages we touched after the fact (or maybe use fault-based clflushing) for OpenGL, and at that point we may reconsider how X maps its buffers.
3D performance has tanked in some cases
Haven't debugged this one, and it's next on the list. Can't reproduce on the test systems here.
Applications hang on to a lot of buffer objects
In TTM we would allocate giant buffer objects and then suballocate out of them, because allocation performance was so slow. This meant that userland had to pay attention to a lot of fencing issues (and you had to expose the idea of a fence) so that you could know which pieces were still used.
We went with a simpler model for GEM, where the userland caches buffer objects of similar size, and reuses them when the kernel tells us they're no longer used by the GPU. By returning "freed" buffers to the cache, we get wonderful performance when an app is running flat out and allocating and freeing buffers like mad. However, as we think about cairo moving to a GL backend, and all your apps sitting waiting for input hanging onto these cached buffers, the memory usage is likely to become an issue. They're pageable, but that's slow and we're looking at systems with limited memory+swap anyway. Instead, we need to free buffers when they're not serving any use in the cache. One way we're thinking about is on allocate, when a cached buffer is "really old" (seconds), actually free it.
That still leaves some excess memory laying around when an app does some rendering and then stops for input. The long-term solution would be for userland to tell the kernel when it wasn't going to actively use a buffer and the contents could be thrown out. Then, in the memory pressure callback from the kernel, we can throw out cached buffers and nuke mappings, and userland has to allocate new ones when it needs.