Interactive Ray-Tracing Thoughts

Submitted Wed Mar 19 19:19:12 2008 under tags "ray-tracing, graphics, computers, technology"

Here's a message that I sent to Dan Collens and Lawrence Kesteloot regarding real-time ray-tracing. (Probably better called "interactive ray-tracing" since real-time more specifically means having deterministic timing behavior.)

Ray-tracing rates for fairly complicated scenes with a few bounces but only one ray per pixel and fairly simple shaders are running in the tens of frames per second on a single computer. With even more bounces and some shadow tests and more complicated shaders, scenes run interactively on PlayStation3 and multi-core systems.

Jim and I still work on our ray-tracers from time to time. To my knowledge, mine still beats his by 50% or so on a pixels-per-core-per-second basis. I think that mostly comes down to him using kd-trees and me using a two-level uniform grid. Jim also has a very comprehensive scene graph and object model, which may be adding overhead.

Jim's ray-tracer can view a 1.5 million Prius model with a single shadow sample and several bounces at nearly 6 fps on an 8-core x86 (Mac Pro). By extrapolation, mine should run 8 to 9 fps, but we haven't tried mine. (Mine runs around 2.2 fps on a dual-core.) Some of Jim's pictures after minutes to hours of refinement can be seen at his flickr account.

Nowadays it seems that people concur that ray-tracing is largely memory-bandwidth-bound. The memory access patterns are difficult to make coherent not only within the acceleration structure but also because adjacent pixels may have widely diverging secondary and tertiary rays.

There seem to be two groups doing most of the RT ray-tracing stuff these days; Stanford seems to be writing more papers about GPU ray-tracing and Utah seems to be writing more papers about CPUs. I haven't researched this point of view much so I could be talking out of my butt.

The best GPU approach so far seems to be a kd-tree. Because fragment programs don't have unlimited branching and recursion, the kd-tree traversal is implemented iteratively, using a combination of a short stack for traversing back up from leaf nodes to their parent to the next leaf and traversing all the way down to the leaves from the root if necessary.

The best CPU approach seems to be a tie between a kd-tree and a uniform grid, coupled with SIMD instructions (SSE on x86) and "packetization". Packetization is basically tracing a "packet" of a small number of rays all at once (like 4-64 rays), with all the implied complication of them diverging inside the acceleration structure and such. The loss to overhead of testing 4 rays against a bunch of triangles at once when some of the rays would normally be excluded during traversal has been shown to be smaller than the improvement in memory coherence.

There seems to be some disagreement in the community about whether GPUs or CPUs are faster - I get the feeling that GPUs win for pure floating-point, but are even worse at incoherent memory access.

Neither Jim nor I have gotten into packetization yet, mostly because it seems to complicate the crap out of everything. (We both use what small facility GCC4 has for automatically using SSE.)

The Utah guys have recently come up with a paper describing their language for shaders for ray-tracing, based on the syntax and some of the primitives in the OpenGL Shading Language. The Ray-Tracing Shading Language, or RTSL, is notable because they can compile it to SSE instructions and extract the inner, ray-specific parts of intersection and material shaders and packetize them mechanically.

There's even a packetized PlayStation3 ray-tracer that traces on the 6 cell processors while the main CPU schedules tiling and such.

I'm not aware of a really good RTRT API but Autodesk mentioned Mental Ray has some kind of API for it's accelerated ray-tracer. I know ATI/AMD and Intel are both looking at ray-tracing heavily. I assuming that NVIDIA is doing the same.

I get the feeling these days that the field is ripe for a ray-tracing API based on a light scene graph. The underlying implementation would have to be very tightly coupled to probably both the CPU using SSE and packetization and the GPU when available.

Visit Brad's Home Page

See Brad's Full Blog Entry List