Kindar_Naar Report post Posted March 19, 2007 Mine says 1.5. You sure you can't update? Patch committed; the invsqrt function is now broken out. I left fastsqrt in eye candy since it's so simple; if we need to move it later, I will. Browsing through the CVS shows that the current revision is 1.4. Oh, and I think you forgot to add dont_optimize.h/cpp into CVS Share this post Link to post Share on other sites
KarenRei Report post Posted March 19, 2007 Ack, thanks for that catch! This is strange -- the thing with effect_breath. [meme@spathi eye_candy]$ cvs status effect_breath.cpp rei@cvs.elc.berlios.de's password: =================================================================== File: effect_breath.cpp Status: Up-to-date Working revision: 1.5 Repository revision: 1.5 /cvsroot/elc/elc/eye_candy/effect_breath.cpp,v Sticky Tag: (none) Sticky Date: (none) Sticky Options: (none) I'm going to try making a trivial edit to the file, then recommitting. Let's see if that forces an update for you. It's checked in now. Share this post Link to post Share on other sites
Kindar_Naar Report post Posted March 19, 2007 Ok I get the new files.. don't have time to build right now.... and I know the problem with effect_breath... I'm talking about the .h file not .cpp. The 10 that need to be changed to 10.0f is in the .h file Share this post Link to post Share on other sites
KarenRei Report post Posted March 19, 2007 Ah, that would explain it! There was a 10 that I converted to 10.0 in the .cpp earlier; didn't realize there was another in the .h. Fixed. Thanks! Share this post Link to post Share on other sites
Troca Report post Posted March 19, 2007 (edited) I have some information about the invsqrt problem. The problem is that optimization passes in the compiler assumes the strict aliasing rule. E.g. it assumes that a pointer to an int can't point to the same memory location as a pointer to a float. Therefore it can reorder the reads and writes to generate faster code. In the invsqrt function the strict aliasing rule is broken by the pointer casting and therefore invalid code may be generated. For more info see http://www.cellperformance.com/mike_acton/...t_aliasing.html I made the following test program, save it as invsqrt.c #include <stdio.h> #include <math.h> /* This one breakes strict aliasing rule */ __attribute__ ((noinline)) float invsqrt(float f) { float half = 0.5f * f; int i = *(int*)((void*)&f); i = 0x5f3759df - (i >> 1); f = *(float*)((void*)&i); f = f * (1.5f - half * f * f); return f; } /* This one is ok */ __attribute__ ((noinline)) float invsqrt2(float f) { union { int i; float f; } tmp; float half = 0.5f * f; tmp.f = f; tmp.i = 0x5f3759df - (tmp.i >> 1); f = tmp.f; f = f * (1.5f - half * f * f); return f; } int main(int argc, char **argv) { float f = 0.1; int i; for(i = 0; i < 100; i++) { printf("%f = %f, %f, %f\n", f, invsqrt(f), invsqrt2(f), (float) (1/sqrt(f))); f += 0.1; } return 0; } Try the following compilation commands: $ gcc invsqrt.c -lm $ gcc invsqrt.c -lm -O2 $ gcc invsqrt.c -lm -O2 -fno-strict-aliasing Running ./a.out after each and check the result. The middle one is bad for me but the first and last give good results. This is my version of gcc: gcc (GCC) 4.1.1 (Gentoo 4.1.1-r3) Edit: Try compiling with -S to get the assembler output. In the -O2 case gcc figured it could remove the whole line that calculates i (because the result "can't be used"), resulting in a float load from an uninitialized stack position. Edit2: Try removing the casts to void* and add -Wall to the compilation command. Now you get the following warnings: invsqrt.c: In function ‘invsqrt’: invsqrt.c:7: warning: dereferencing type-punned pointer will break strict-aliasing rules invsqrt.c:9: warning: dereferencing type-punned pointer will break strict-aliasing rules The void* casts hid this warnings. Maybe the casts where put there to get rid of the warnings, but that was the wrong fix. I've been bitten by this "bug" before and I hope the info I have given here will help others avoid it in the future. Keep up the good work, I like the look of the new features in EL. I am looking forward to start playing a bit again after the next update. Edited March 19, 2007 by Troca Share this post Link to post Share on other sites
KarenRei Report post Posted March 19, 2007 Troca, you're a dear. That's exactly what we need. I'll mess with the Linux Makefile and update the documentation for other platforms this evening. Share this post Link to post Share on other sites
Troca Report post Posted March 19, 2007 Troca, you're a dear. That's exactly what we need. I'll mess with the Linux Makefile and update the documentation for other platforms this evening. Adding -fno-strict-aliasing is just a workaround. The solution is to use the invsqrt2 function in my example code. The compiler can make better optimizations if you fix the code to follow the strict aliasing rule and compile with -fstrict-aliasing (the default with -O). Share this post Link to post Share on other sites
KarenRei Report post Posted March 19, 2007 (edited) I'm not sure how that's better, though. From what I read on the page that you linked, the union workaround only works because it first *copies* the value into the union. That's an extra operation. By using -fno-strict-aliasing, it won't be needing to do that copy. Given how short this code is, and given that there are no real "array" operations, I don't think that strict aliasing would speed it up. In short, it seems that no-strict-aliasing would be faster. What do you think? P.S. -- You're right, the void* casts were to get rid of those warnings. I googled for those warnings when I first got them (an older GCC didn't give them, and the code was working fine with the older GCC). One page recommended the void* cast, so that's what I did. Clearly, that was the wrong option. Edited March 19, 2007 by KarenRei Share this post Link to post Share on other sites
KarenRei Report post Posted March 19, 2007 (edited) Come to think of it, though: separating it into a different file, meaning a different .o, is preventing inlining, so you're right -- your version probably is faster. I'll do it that way. Actually, even better, I'll put it in the .h so that it's compiled each time: thus, files other than eye_candy.o will get faster square roots by being able to inline it. If I don't have to worry about the aliasing bugs due to the union, it doesn't matter if it gets inlined. This will, of course, mean making it a static function inside an object. Perhaps this should go in math cache, then... It seems frustrating to me that we have to add in an extra op to stop the compiler from doing something stupid. But it seems that it's either that or disable strict aliasing for a lot of code. Edited March 19, 2007 by KarenRei Share this post Link to post Share on other sites
Kindar_Naar Report post Posted March 19, 2007 Well in my opinion the new code for invsqrt is worth using. Let us not forget that the client is build on at least 3 different platforms with about 4-5 different compilers. Taking this into consideration I think we can sacrifice a bit of speed in favor of portability. We don't know if that '-fno-strict-aliasing' option can be enabled in all of them (I didn't see it it VC++; maybe I missed it) What's more, by using and union we get rid of the call to invsqrt_workaround. I tested this new invsqrt WITH SSE2 enabled and it works ok, so even though there is an union, we might actually gain by SSE2 optimization. Share this post Link to post Share on other sites
Troca Report post Posted March 19, 2007 Compile with $ gcc invsqrt.c -S -O2 -fno-strict-aliasing and look at the generated code. The difference between invsqrt and invsqrt2 is not that big. A quick look and I think there may be two additional stores to the stack in my version. The stack is in cache memory so that is not a big impact I think. Is it worth the extra complication of having a separate file compiled with different options, I don't know? Oh, I'm typing too slow, I see you already answered this yourself, well... Share this post Link to post Share on other sites
KarenRei Report post Posted March 19, 2007 Yeah, I think I've been neglecting the benefits of inlining a bit. I'm going to move all of the simple math functions from eye_candy.cpp to math_cache.h. It'll slow compiling down a tad, but it will give the code the option to inline those functions. You know, I think we're actually going to end up with faster code after all of this. Share this post Link to post Share on other sites
KarenRei Report post Posted March 20, 2007 (edited) Yeay! I got a crash! Normally that's not something to celebrate, except that it appears that my recent changes to hook in many more special effects are causing a crash right where someone earlier had reported problems with one. I wasn't able to reproduce it at the time, but now I can. Which means that I can now fix it. I'll keep you all posted. ... And, I think I've got it. But it's too late for me to fix it now. I'll try out my theory tomorrow. Edited March 20, 2007 by KarenRei Share this post Link to post Share on other sites
trollson Report post Posted March 20, 2007 invsqrt() and SSE Don't forget that SSE/MMX2 already includes reciprocal sqrt functions, rsqrtss and rsqrtps (four operations at once). These are equivalent to two iterations of the approximation used by the invsqrt function, giving ~14bit accuracy in 4-7 cycles. They are accessible from GCC, ICC, and VC through the Intel definitions. With the degree of approximation in invsqrt(), do not use it where the results may be compared, particulary if the difference between two results is important. Share this post Link to post Share on other sites
KarenRei Report post Posted March 20, 2007 (edited) With the degree of approximation in invsqrt(), do not use it where the results may be compared, particulary if the difference between two results is important. I never do. I tend to avoid floating point equality comparisons altogether; they're horribly unreliable unless you're looking for the chance that the exact same operations have been conducted on both variables. Rarely does a floating point equality check even make much sense. Greater than or less than comparisons, sure; they're just fine. Don't forget that SSE/MMX2 already includes reciprocal sqrt functions, rsqrtss and rsqrtps (four operations at once). These are equivalent to two iterations of the approximation used by the invsqrt function, giving ~14bit accuracy in 4-7 cycles. They are accessible from GCC, ICC, and VC through the Intel definitions. Not familiar with those functions, so thanks for the link. I have no use for four square roots at once, but rsqrtss may well be faster than invsqrt, since it's hardware. However, it appears to be an assembler statement, which would raise portability issues; I'd hesitate to use it. Are there any portable functions that I could call for "fast inverse square root" that would make use of these statements on appropriate architectures? I've benchmarked invsqrt() versus 1.0 / sqrt() with SSE2, and invsqrt is much faster. Any other portable square root functions? Just to head off anyone who might be thinking it, I already eliminiated square roots wherever possible. Cases where there was a square root over another square root were turned into a single square root. Cases where two distances were being compared to each other, I instead compare the distances squared. Etc. However, there are still many square root cases left that cannot be optimized out. Edited March 20, 2007 by KarenRei Share this post Link to post Share on other sites
trollson Report post Posted March 20, 2007 (edited) It wasn't floating point equality I was thinking of (which as you say is bad practice), but other comparisons, such at taking differences; for two similar values the difference can easily be less than the error, and you then have garbage in the system. The use of the approximation is ok if it is your answer, but be wary of the consequencies if it is an intermediate result. SSE/MMX functions can be found hidden away in "xmmintrin.h" (on GCC and VC), which is: Implemented from the specification included in the Intel C++ Compiler User Guide and Reference, version 8.0. I haven't got a quick introduction to using these functions, but google should provide. Incidentally, there was a reference that doing "X * RSQRTSS (X)" was quicked than plain old "sqrt (X)"; but I suspect that is very dependant on whether sqrt() has been implemented as a builtin operation. Edited March 20, 2007 by trollson Share this post Link to post Share on other sites
KarenRei Report post Posted March 20, 2007 (edited) but other comparisons, such at taking differences One of my favorite graphics books that I had in the mid 90s was the "Zen of Graphics Programming" by Michael Abrash. At the start of one chapter, he mentions how one of the asteroids in "The Empire Strikes Back" was a potato, while another was a shoe. How on Earth did they get away with that? Because everything is very "busy". Human eyes are very poor at perceiving imperfections in graphics when there's a lot of motion going on. You can cut corners left and right to improve speed. Even if it gives you some "wrong" answers, even ones with perceptable results, if those results are overshadowed by everything else that's going on, the viewer won't notice. In short, it's not a scientific application; all that matters in the end is whether the viewer can perceive any visual errors. I've tested with the quake invsqrt function and regular inverse square roots. No visual difference. Just a few examples of where square roots are used. * Determining the size to draw point sprites (if the size is off by a pixel, who will notice?). * Gradually shifting one velocity vector into another (if the vec shifts slightly faster or slower, who will notice?). * Maintaining "energy" levels in a gravity mover (simulating a point gravity source is prone to accumulated errors, so you have to regularly renormalize) (if the energy level is off for a frame, who will notice?) * Determining the gravitational force for particles in a gravity mover (if the force is slightly off, who will notice?) * Determining distances where you can't simplify by comparing two squared distances (many places) (if the distance determined is slightly wrong, who will notice?) * Scaling particles as the level of detail changes (X times as many particles generally implies a particle size 1.0/sqrt(X) times the old size) (if it's not exact on the size scalar, who will notice?) And so on. Anyways: I'll look into that header and see what I can find. Thanks! Edited March 20, 2007 by KarenRei Share this post Link to post Share on other sites
KarenRei Report post Posted March 20, 2007 (edited) I just did some benchmarks. Conclusion: xxmintrin.h is looking pretty darn nice! Once again, the quake routine blows away the default square root (by more than my last tests; interesting. It is on a different computer, though). However, even it gets blown away by the fast hardware routine, which seems to run almost as fast as the loop to determine overhead.** Assuming that it's as portable as you say it is, I'll hook this in surrounded by #ifdef __SSE__. Yeay, even more performance. Hmm, while I'm at it, have you heard of any fast power functions? I do exponents a good bit too -- they're needed to, among other things, deal with the fact that frames don't come at constant time intervals apart. If you want a particle's alpha to be cut in half every second, you need alpha *= (0.5 ^ time_in_seconds). I googled a while back and couldn't find any, so I wrote a powf cache that does linear interpolation between datapoints. It's better than nothing (the special case of 0.5 ^ x is ~40% the speed of powf; the case of 0.0 - 1.0 ^ x is ~75% the speed of powf), but I'm sure it could be done a lot better. ** While I didn't review the assembly code to make sure the compiler wasn't cheating, I did put in cheat-preventing features, such as storing up a sum of the accumulated results, then printing the sum so that the program has to actually compute the results. My best guess is that, since it can't unroll the for loop due to it's huge size, the pipeline that does the square root completes before the pipeline that deals with increasing the loop counter, the loop's conditional branch, and the sum's floating point addition. Edited March 20, 2007 by KarenRei Share this post Link to post Share on other sites
ttlanhil Report post Posted March 21, 2007 valgrind does have stuff about eye candy, however; here's a sample:==27001== Invalid read of size 4 well, I found the cause of this ValgrindThe NVIDIA OpenGL implementation makes use of self modifying code. To force Valgrind to retranslate this code after a modification you must run using the Valgrind command line option: --smc-check=all Without this option Valgrind may execute incorrect code causing incorrect behavior and reports of the form: ==30313== Invalid write of size 4 http://us.download.nvidia.com/XFree86/Linu...appendix-l.htmlit's not anything in your code Share this post Link to post Share on other sites
KarenRei Report post Posted March 21, 2007 Interesting I'm getting close to fixing all of the crash-potential issues here, so there should be a patch checked in tonight that will hopefully fix everyone's problems. Plus add in a number of new effects Share this post Link to post Share on other sites
KarenRei Report post Posted March 21, 2007 Latest update: * Speed improvements * Likely fixed the crash that some had reported. * Many new special effects: ** Fountains ** Various smokes ** Teleporters ** Teleport in/out ** Various fires, torches, and candles * Stayed up way, way, way too late. As always, please let me know if I introduced any new bugs that I didn't catch. Share this post Link to post Share on other sites
KarenRei Report post Posted March 21, 2007 I'll add a note that I plan to strip out the wind from the current wind effect into a separate object that can be accessed by other effects, including non-eye-candy effects. This would include smoke that blows in the wind, which I think would look pretty nice. Share this post Link to post Share on other sites
trollson Report post Posted March 21, 2007 Hmm, while I'm at it, have you heard of any fast power functions? I do exponents a good bit too -- they're needed to, among other things, deal with the fact that frames don't come at constant time intervals apart. If you want a particle's alpha to be cut in half every second, you need alpha *= (0.5 ^ time_in_seconds).The various extensions don't seem to include exp or pow functions[2]. The only thing I saw mentioned [1] is that using log() and exp() is "much" faster than pow(). Try comparing: y = exp (p * log (x); y = pow (x, p); And see if it helps (you may be able to cache log(x)?). [1] Other than citations or abstracts to papers that are pay-to-read. [2] The athlon maths library includes 'fastpow' and 'fastexp', but I'm not sure what that is exactly. Share this post Link to post Share on other sites
Kindar_Naar Report post Posted March 21, 2007 As always, please let me know if I introduced any new bugs that I didn't catch. I got one Beer for me extern "C" void ec_delete_effect_loc_type(float x, float y, ec_EffectEnum type) { for (int i = 0; i < (int)references.size(); ) { std::vector<ec_internal_reference*>::iterator iter = references.begin() + i; if ((*iter)->dead) { delete *iter; references.erase(iter); continue; } if (((*iter)->position.x == x) && ((*iter)->position.z == -y) && (type == (ec_EffectEnum)(*iter)->effect->get_type())) { (*iter)->effect->recall = true; continue; } i++; } } In my case: references.size() == 3 i = 2 if ((*iter)->dead) == false if (((*iter)->position.x == x) && ((*iter)->position.z == -y) && (type == (ec_EffectEnum)(*iter)->effect->get_type())) == true so... we continue and never increase the i. The i stays at 2 and we have infinite loop Happens at random time after starting campfire effect. Call stack > elc_d.exe!ec_delete_effect_loc_type(float x=52.750000, float y=73.250000, ec_EffectEnum type=EC_CAMPFIRE) Line 212 C++ elc_d.exe!remove_fire_at_tile(unsigned short x_tile=105, unsigned short y_tile=146) Line 587 + 0x15 bytes C elc_d.exe!process_message_from_server(const unsigned char * in_data=0x0d911a00, int data_length=7) Line 986 + 0x15 bytes C elc_d.exe!start_rendering() Line 110 + 0x12 bytes C elc_d.exe!Main(int argc=-1, char * * argv=0x00000000) Line 251 C elc_d.exe!WinMain(HINSTANCE__ * hInst=0x00400000, HINSTANCE__ * hPrev=0x00000000, char * lpCmd=0x00151ef7, int nShow=1) Line 306 + 0xd bytes C elc_d.exe!__tmainCRTStartup() Line 589 + 0x35 bytes C elc_d.exe!WinMainCRTStartup() Line 414 C Share this post Link to post Share on other sites
KarenRei Report post Posted March 21, 2007 (edited) Ack, yep, that'll do it. No campfire effects happened to be recalled while I was on, I suppose. I'll get that fix in tonight. If you want to try the fix out ahead of time, you just need to move the i++ up above that if statement. I'll check for any other code that might have this problem. Ed: Unfortunately, looks like there's several of these. I recently changed a bit of code in this area to simply recall effects instead of outright deleting their references; however, if I don't delete the reference, the array doesn't shrink, so the loop never breaks. I could check in the changes, but I haven't been able to test them (I'm at work) beyond compiling. Should I? They're pretty straightforward, and if there's a lockup, seems like checking is better than not checking in. Edited March 21, 2007 by KarenRei Share this post Link to post Share on other sites