Jump to content
Eternal Lands Official Forums
Sign in to follow this  
KarenRei

Great code snippet: Fast inverse sqrt

Recommended Posts

I just ran into this code snippet courtesy of slashdot:

 

float invsqrt(float f)

{

float half = 0.5f * f;

int i = *(int*)&f;

i = 0x5f3759df - (i >> 1);

f = *(float*)&i;

f = f * (1.5f - half * f * f);

return f;

}

 

It calculates inverse square roots (that is, 1.0 / sqrt(f)). I benchmarked it, and it's four times faster than using the built-in sqrt function, and has a very small error margin. It gave perhaps a 5% boost in speed to my particle system.

 

Now if I can just stumble into a faster pow() function this clever. I made a pow cache that does pow(0.5, x) in 40% of the time and pow(x, y) in 75% of the time as the regular pow function, but it should be possible to go faster.

Share this post


Link to post
Share on other sites

The problem with tricks like that is that when the CPUs will eventually implement those functions in hardware, the software using such tricks will be slower.

But of course, for the time being most of the CPUs don't implement them, so..

Share this post


Link to post
Share on other sites

Worst case, you rewrite the function as:

 

{

#if FAST_SQRT

above code

#else

return 1 / sqrt(f);

#endif

}

 

Then you have FAST_SQRT defined at configure time. The compiler will automatically inline a function that short, so there would be no function call overhead.

Share this post


Link to post
Share on other sites

This is an interesting trick for (IEEE 754) floats; but remember the mantra:

Avoid premature optimisation
.

Only look for optimisations like this iff profiling reveals it to be a bottleneck. Obscure code can be more costly to maintain than its potential optimisation is worth. I seem to spend a lot of my time unpicking "optimisations"...

 

This is not generally portable code (relies on IEEE 754 spec floats), so you need to somehow check the float format in the code at compilation time (don't rely on externally set defines, as this would lead to a silent error).

 

I'll have to work out what it is doing...

Edited by trollson

Share this post


Link to post
Share on other sites

This is an interesting trick for (IEEE 754) floats; but remember the mantra:

Avoid premature optimisation
.

Only look for optimisations like this iff profiling reveals it to be a bottleneck. Obscure code can be more costly to maintain than its potential optimisation is worth. I seem to spend a lot of my time unpicking "optimisations"...

 

This is not generally portable code (relies on IEEE 754 spec floats), so you need to somehow check the float format in the code at compilation time (don't rely on externally set defines, as this would lead to a silent error).

 

I'll have to work out what it is doing...

You know, that code is used in Quake3, Crystal Space, the Titan Engine, etc.

If you want more info, take a look at http://www.lomont.org/Math/Papers/2003/InvSqrt.pdf

Share this post


Link to post
Share on other sites
You know, that code is used in Quake3, Crystal Space, the Titan Engine, etc.

If you want more info, take a look at http://www.lomont.org/Math/Papers/2003/InvSqrt.pdf

Yes I was reading that paper earlier; the method unravels to a Newton approximation, with the 'magic number' chosen to give a reasonable first-guess. That paper also suggests a better magic number.

 

While the error appears small, the suitability of the method depends on what you intend to do with the numbers afterwards -- for example, taking differences between two resultants would be very inadvisable. Although if you needed any real accuracy you wouldn't be using floats in the first place.

 

Also, this approach may prevent the compiler from doing more platform-specific optimisation; compare how a compiler might pack a series of operations into a SIMD instruction.

For example, SSE includes a 'RSQRTPS' op which performs '1.0/sqrt(x)' on four single precisions floats in two cycles. Using GCC this can be invoked through the '__builtin_ia32_rsqrtps' function (-msse), or through the '_mm_sqrt_ps' function ('xmmintrin.h') compatible with ICC.

Every software engineer should recite the words of Knuth each morning:

"
Premature optimization is the root of all evil"

Edited by trollson

Share this post


Link to post
Share on other sites

Every software engineer should recite the words of Knuth each morning:

"
Premature optimization is the root of all evil"

All the while remembering Wirth's Law. It probably isn't premature if it's dealing with particle systems.

 

By the way, that was Knuth restating Hoare, and the quote actually is

"We should forget about small efficiencies, say about 97% of the time:

premature optimization is the root of all evil."

Edited by crusadingknight

Share this post


Link to post
Share on other sites

It's not premature optimization. I've done two main optimizations: my pow cache and this invsqrt function. Both were done after code profiling with gprof (a *great* tool, by the way!), and finding that pow and sqrt were the two biggest time consumers. I made fast versions of pow and sqrt only after eliminating as many of those calls as possible. The best cases for removal of calls were in cases where I had to do things like get the magnitude of two vectors and divide one by the other (you can divide the magnitude squared, and then do the square root, eliminating one square root).

 

In the effect system, there are a number of sections of code that are structured like:

 

#if 0 /* Straightforward, but slow, version */

Some simple code here.

#else /* The same code as above, faster but not as clear. */

Some complex code here.

#endif

 

Even if there was no advantage to using a fast sqrt, there would be no disadvantage to wrapping it. There's no overhead in C/C++ if you have at least -O1 enabled, for wrapping a function within another function. I.e., this code:

 

some_fn()

{

...

foo = sqrt(bar);

...

}

 

... runs at the same speed as this code:

 

some_fn()

{

...

foo = my_sqrt(bar);

...

}

 

float my_sqrt(float f)

{

return sqrt(f);

}

 

That is to say, modern compilers automatically inline short functions, generally under even minimal optimization conditions.

 

Try it out for yourself. :icon13:

 

Modern compiler optimizations are often so smart that they can actually make it hard to do proof-of-concept demonstrations of compiler optimization :devlish: For example, I once wrote a program to demonstrate to another person that a compiler would optimize a certain function. I had a big long loop, with a function call on the inside, and compared the run time with no optimizations to the run time of it with full optimization. However, the program returned instantly no matter how big the loop was. The compiler basically optimized out the entire program except the part that figured out what time it was at the beginning and the part that reported how much time elapsed at the end. :devlish:

 

Anyways, back to the original topic: this function is uberfast, the error margin is pretty tiny, most cases where you'd use inv_sqrt can tolerate much bigger error margins, it's used in a number of big-name games, it's public domain code, and I strongly recommend it anywhere a significant number of sqrt calls will be made (esp. in vector math). That's all. :)

Edited by KarenRei

Share this post


Link to post
Share on other sites

Well, if you find other portions of the code where it can be used, feel free to do it, so long as you use a #define which can disable the hack and use the real CPU functions where available.

Share this post


Link to post
Share on other sites

Oh, just to verify: I added in -msse to see how performance compares with that flag on, as recommended by trollson. Performance is 3-4 times better. I also tried -msse2 -mfpmath=sse. Performance is 2-3 times better.

 

If anyone wants to critique my benchmark attempts, here the code (done in C++ for simplicity):

 

float ret=0;

u_int64_t start = get_time();

for (float f = 1e-6; f < 1e6; f *= 1.000001)

ret += invsqrt(f);

cout << "invsqrt(f): " << (get_time() - start) << ": " << ret << endl

 

ret=0;

start = get_time();

for (float f = 1e-6; f < 1e6; f *= 1.000001)

ret += 1.0 / sqrt(f);

cout << "1.0 / sqrt(f): " << (get_time() - start) << ": " << ret << endl

 

Notes:

* get_time() returns the time, in microseconds, as a u_int64_t.

* invsqrt() is defined as above

* Overhead for the benchmarks exists in the "f *=" portion and the "ret +=" portion. This means that the invsqrt() function, if anything is <I>even better</I> than we're seing here.

* The reason for the use of the variable "ret" is to stop the compiler from optimizing out the entire section (as described in the previous post). If you don't use the return value, the compiler may decide to ignore the entire loop.

Share this post


Link to post
Share on other sites

Not to go too far off topic, but does anyone know why we load a squared constraint radius in particles.c, only to sqrt it for almost every comparison?

 

Or why we calculate sqrt(2) in 2d_objects.c? (sqrt(2) being a constant which means about as much to me as 1.41421356.)

 

Or even why calc_light_frustum calculates the distance (sqrt(x*x+y*y)) four times? None of which are really optimization problems, but given the way the same constants are recalculated several times, they might be maintainability problems.

 

More back on the topic, there doesn't seem to be any use of the inverse square root in ELc yet, aside from normals_sse.h, which itself is used only by the disabled -DTERRAIN flag.

Share this post


Link to post
Share on other sites

The calc_light_frustum() having 4 sqrts is because the values of x & y are changing for each one with each one. Yes, in 2_onjects the sqrt(2) can be optimized out.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.

×