c – Once upon a time, when > was faster than < ... Wait, what?

The Question :

281 people think this question is useful

I am reading an awesome OpenGL tutorial. It’s really great, trust me. The topic I am currently at is Z-buffer. Aside from explaining what’s it all about, the author mentions that we can perform custom depth tests, such as GL_LESS, GL_ALWAYS, etc. He also explains that the actual meaning of depth values (which is top and which isn’t) can also be customized. I understand so far. And then the author says something unbelievable:

The range zNear can be greater than the range zFar; if it is, then the
window-space values will be reversed, in terms of what constitutes
closest or farthest from the viewer.

Earlier, it was said that the window-space Z value of 0 is closest and
1 is farthest. However, if our clip-space Z values were negated, the
depth of 1 would be closest to the view and the depth of 0 would be
farthest. Yet, if we flip the direction of the depth test (GL_LESS to
GL_GREATER, etc), we get the exact same result. So it’s really just a
convention. Indeed, flipping the sign of Z and the depth test was once
a vital performance optimization for many games.

If I understand correctly, performance-wise, flipping the sign of Z and the depth test is nothing but changing a < comparison to a > comparison. So, if I understand correctly and the author isn’t lying or making things up, then changing < to > used to be a vital optimization for many games.

Is the author making things up, am I misunderstanding something, or is it indeed the case that once < was slower (vitally, as the author says) than >?

Thanks for clarifying this quite curious matter!

Disclaimer: I am fully aware that algorithm complexity is the primary source for optimizations. Furthermore, I suspect that nowadays it definitely wouldn’t make any difference and I am not asking this to optimize anything. I am just extremely, painfully, maybe prohibitively curious.

The Question Comments :
  • The link to this tutorial seems to have (recently) gone dead. 🙁
  • @TZHX: Since the accepted answer is authored by the author of the tutorial, we have hope to find it again. See my last comment to his answer 🙂
  • The referenced OpenGL tutorial is available here.
  • (a < b) is identical to (b > a) so there is absolutely no need to implement both compare operations in hardware. The difference in performance is result of what happens as result of the compare operation. This is a long and winding road to take to explain all of the side-effects but here are a few pointers. Games used to fill depth buffer to avoid more expensive fragment processing for fragments that failed depth test. Quake used to split depth range into two halves to avoid clearing the frame buffer because the game always filled every pixel on screen and so on.
  • @Fons looks like the link dead, again 🙁

The Answer 1

350 people think this answer is useful

If I understand correctly, performance-wise, flipping the sign of Z and the depth test is nothing but changing a < comparison to a > comparison. So, if I understand correctly and the author isn’t lying or making things up, then changing < to > used to be a vital optimization for many games.

I didn’t explain that particularly well, because it wasn’t important. I just felt it was an interesting bit of trivia to add. I didn’t intend to go over the algorithm specifically.

However, context is key. I never said that a < comparison was faster than a > comparison. Remember: we’re talking about graphics hardware depth tests, not your CPU. Not operator<.

What I was referring to was a specific old optimization where one frame you would use GL_LESS with a range of [0, 0.5]. Next frame, you render with GL_GREATER with a range of [1.0, 0.5]. You go back and forth, literally “flipping the sign of Z and the depth test” every frame.

This loses one bit of depth precision, but you didn’t have to clear the depth buffer, which once upon a time was a rather slow operation. Since depth clearing is not only free these days but actually faster than this technique, people don’t do it anymore.

The Answer 2

3 people think this answer is useful

The answer is almost certainly that for whatever incarnation of chip+driver was used, the Hierarchical Z only worked in the one direction – this was a fairly common issue back in the day. Low level assembly/branching has nothing to do with it – Z-buffering is done in fixed function hardware, and is pipelined – there is no speculation and hence, no branch prediction.

The Answer 3

0 people think this answer is useful

Optimization like that will hurt performance on many embedded graphics solutions because it will make framebuffer resolve less efficient. Clearing a buffer is a clear signal to the driver that it does not need to store and restore the buffer when binning.

Little background information: a tiling/binning rasterizer processes the screen in number of very small tiles which fit into the on-chip memory. This reduces writes and reads to external memory which reduces traffic on memory bus. When a frame is complete (swap is called, or FIFOs are flushed because they are full, framebuffer bindings change, etc) the framebuffer must be resolved; this means every bin is processed in turn.

The driver must assume that the previous contents must be preserved. The preservation means that the bin has to be written out to the external memory and later restored from external memory when the bin is processed again. The clear operation tells the driver that the contents of the bin are well defined: the clear color. This is a situation which is trivial to optimize. There are also extensions to “discard” the buffer contents.

The Answer 4

-9 people think this answer is useful

It has to do with flag bits in highly tuned assembly.

x86 has both jl and jg instructions, but most RISC processors only have jl and jz (no jg).

Add a Comment