Jump to content
Eternal Lands Official Forums
Sign in to follow this  
alvieboy

Excessive system calls in EL (linux)

Recommended Posts

Hello all,

 

I've been observing a too much amount of system calls done by EL in linux, which seem to cause an excessive CPU usage (+10% on my dual core system), independently of FPS.

 

Lemme give you an example: Using FPS limit on 25 frames, in a "simple" GL area I get (unfocused, on other desktop):

 

  PID USER	  PR  NI  VIRT  RES  SHR S %CPU %MEM	TIME+  COMMAND																		 
4659 alvieboy  15   0  218m 131m  29m S   15  6.5   0:53.21 el.x86.linux.bi

 

With fulscreen map on, and on another desktop I get:

 

  PID USER	  PR  NI  VIRT  RES  SHR S %CPU %MEM	TIME+  COMMAND																		 
4659 alvieboy  15   0  216m 130m  27m S   10  6.4   0:59.75 el.x86.linux.bi

 

With FPS limited to 1, I get in normal (non-fullscreen map) I get:

 

  PID USER	  PR  NI  VIRT  RES  SHR S %CPU %MEM	TIME+  COMMAND																		 
4659 alvieboy  15   0  217m 130m  27m S   10  6.5   1:09.12 el.x86.linux.bi

 

And with FS map on, I get (with 1 fps):

 

  PID USER	  PR  NI  VIRT  RES  SHR S %CPU %MEM	TIME+  COMMAND																		 
4659 alvieboy  15   0  218m 131m  27m S	9  6.5   1:11.93 el.x86.linux.bi

 

So, and after speaking to Vegar, I redid some tests:

 

# time strace -p $(pidof el.x86.linux.bin.new) 2> strace.log

(press Ctrl+C after 1 second)

 

real 0m1.004s

user 0m0.004s

sys 0m0.010s

 

# wc -l strace.log

1415 strace.log

 

This is more than 1K syscalls per second.

 

# grep select strace.log | wc -l

548

 

Half of them are select() syscalls, with {0,0} as timeout:

 

select(8, [7], NULL, NULL, {0, 0}) = 0 (Timeout)

gettimeofday({1204313841, 142053}, NULL) = 0

nanosleep({0, 1000000}, {0, 1000000}) = 0

select(8, [7], NULL, NULL, {0, 0}) = 0 (Timeout)

select(8, [7], NULL, NULL, {0, 0}) = 0 (Timeout)

select(8, [7], NULL, NULL, {0, 0}) = 0 (Timeout)

select(8, [7], NULL, NULL, {0, 0}) = 0 (Timeout)

 

Vegar said this might be a SDL bug. Can you confirm this happening on Windows?

 

I was even considering using AIO (Async IO) on Linux, with newer kernels, which would allow us to remove the select() call at all.

 

[edit]

This is also interesting:

# cat strace.log | cut -f1 -d\(| sort| uniq -c
433 gettimeofday
431 nanosleep
548 select

[/edit]

Álvaro

Edited by alvieboy

Share this post


Link to post
Share on other sites

Please keep in mind that SDL & SDL_net are being used to make cross platform easier. Better to see about getting a bug fix into there instead adding more system dependent code to EL or creating our own fork of SDL_net.

Share this post


Link to post
Share on other sites
Please keep in mind that SDL & SDL_net are being used to make cross platform easier. Better to see about getting a bug fix into there instead adding more system dependent code to EL or creating our own fork of SDL_net.

 

Sure, but realise that this might be related to how we use SDL_net, and not to SDL directly.

Anyway, you can clearly see that it uses more CPU in kernel mode than in user mode. I think we should at least take a look and see why this is happening (I do believe that this is not the intended behaviour).

 

[Edit]

11 seconds run:

# time strace -p $(pidof el.x86.linux.bin.new) 2> strace.log

real	0m11.628s
user	0m0.037s
sys	 0m0.116s

# cat strace.log | cut -f1 -d\(| sort| uniq -c
 20 futex
  2814 gettimeofday
  2340 nanosleep
  2313 sched_yield
  3696 select

[/Edit]

 

Álvaro

Edited by alvieboy

Share this post


Link to post
Share on other sites

This is from the SDL_net source, inside SDLNet_CheckSockets() (the file I found this in hasn't changed since 2004):

/* Set up the timeout */
tv.tv_sec = timeout/1000;
tv.tv_usec = (timeout%1000)*1000;

/* Look! */
retval = select(maxfd+1, &mask, NULL, NULL, &tv);

 

This is certainly not passing {0, 0}, and I have no idea what else could be calling select().

Share this post


Link to post
Share on other sites

Actually, it's not the system calls that takes cpu time, but rendering (that also requires system calls), for me it is about 20% of 1 core.

Without rendering (in console mode for example) it takes less than 1% of user time and 0% sys :)

 

I have few select(), most of calls are getpid().

sched_yield is generated by rendering library.

select most likely by SDL_net (and zero timeout is possible if 'timeout = 0' in your quote :) - false statement according to Vegar :)

 

And my getpid is same thing as your gettimeofday, most likely is used as NOP by some library (rendering I think).

Edited by Alia

Share this post


Link to post
Share on other sites

Further investigations using strace gives me this:

socket(PF_FILE, SOCK_STREAM, 0)
connect(6, {sa_family=AF_FILE, path="/tmp/.X11-unix/X0"}, 19)

That's the socket the excessive select() calls are checking.

 

Alia: It's not SDL_net that's causing the select()s in question. I've recompiled SDL_net with some extra printfs, and the timeout is always 100 (as passed in multiplayer.c).

Edited by Vegar

Share this post


Link to post
Share on other sites
Further investigations using strace gives me this:

socket(PF_FILE, SOCK_STREAM, 0)
connect(6, {sa_family=AF_FILE, path="/tmp/.X11-unix/X0"}, 19)

That's the socket the excessive select() calls are checking.

 

Well, then it is also about graphics, and there is nothing we can do about it.

And for me select calls are less than 1%, so it may depend of Xserver libraries, or graphics driver.

Share this post


Link to post
Share on other sites

Please do the following experiment:

Go in VOTD, somewhere without lots of people.

Check to see how many sys calls you get per second (on average)

Do the same in Isla Prima (again, unpopulated area)

Then the same in Whitestone.

 

Please let me know if there is a significant difference in calls or system CPU time.

Share this post


Link to post
Share on other sites

Looks like it's SDL_PollEvent

 

Breakpoint 1, 0xb7abf070 in select () from /lib/libc.so.6
(gdb) back
#0  0xb7abf070 in select () from /lib/libc.so.6
#1  0xb7f2ee88 in ?? () from /usr/lib/libSDL-1.2.so.0
#2  0x00000008 in ?? ()
#3  0xb7f2eed3 in ?? () from /usr/lib/libSDL-1.2.so.0
#4  0x08b71418 in ?? ()
#5  0x00a8168b in ?? ()
#6  0x00000400 in ?? ()
#7  0xb7efd9f5 in SDL_PumpEvents () from /usr/lib/libSDL-1.2.so.0
#8  0xb7efdef7 in SDL_PollEvent () from /usr/lib/libSDL-1.2.so.0
#9  0x080b0098 in start_rendering () at main.c:137

Share this post


Link to post
Share on other sites

After doing some more debugging (thanks Vegar for all the tips), and enabling the SDL_EVENTTHREAD, the number of system calls was reduced to half. Which decreased the CPU usage a bit.

 

The main reason I want those syscalls to be as low as possible is to save power on laptop systems. I know most EL players don't use laptops (or don't use them on battery power), but if we can (even if #ifdef'ed) have a "laptop mode", would be welcomed by those that have low power requisites.

 

On newer intel (after Pentium M) systems, which support full ACPI sleep states, the more time you stay at high-powersaving states (such as C2/3/4), the less current you use from battery, and the system will last longer. This, at end, require you that to save the most power you'll have to let your CPU sleep for the longest time possible. On my system, the number of CS (Context Switches) with full system (HW) running (like bluetooth), a total of +-850 CS/second is shown. This is not very good. But when EL is running, the number of CS is around 2900 in same conditions - this means that CPU is, on average, awaken every 1/2900 seconds. This completely kills the CPU powersaving features, because the overhead of switching from high-power (C0) to C3 has such a latency that its not feasible to be used with around 300ns average sleep.

 

So, I'll try to improve the "FPS limiter" to use, whenever possible, the longest sleep times, so that power can be saved (by CPU and by GPU [GPU is already being saved]).

 

This can be later used with #define LAPTOP, if someone gets interested on it. Also because this would kill high performance - I have to avoid computing accurate time, and rely on last known average frame time, and use some heuristic to allow the framerate to be somehow constant.

 

Álvaro

 

References : http://www.lesswatts.org/projects/powertop/powertop.php

Share this post


Link to post
Share on other sites
I have to avoid computing accurate time, and rely on last known average frame time, and use some heuristic to allow the framerate to be somehow constant.

I have increased the number of SDL_GetTicks() calls in my last commit to have a better accuracy on this but I think it didn't change a lot of things so I'll remove them because it should also have increased the number of gettimeofday() calls...

Share this post


Link to post
Share on other sites

Ok, the limiter is not very accurate, and requires linux :/.

 

But some improvements:

 

$ time strace -p $(pidof eltest) 2> log2

 

real 0m1.024s

user 0m0.011s

sys 0m0.022s

 

$ wc -l log2

867 log2

 

Most of those calls are now sched_yield(). Lots of them.

 

$ cat log2| cut -f1 -d\(| sort| uniq -c

47 gettimeofday

24 nanosleep

770 sched_yield

24 select

 

I'm using select() to sleep and catch interupted system calls (it returns the time not slept). My FPS is 20, so it's sort of accurate.

 

Will look at that sched_yield thing.

 

Álvaro

Share this post


Link to post
Share on other sites

This yield() thing does not happen until some other actor enters the scene.

 

alvieboy@della:~$ time strace -p $(pidof eltest) 2>log3

 

real 0m1.049s

user 0m0.003s

sys 0m0.009s

 

$ wc -l log3

85 log3

 

This is before I see a fox, for example. After that it goes up. Even when the fox gets out of scene. Any ideas ? The sched_yield() is being called from within the GL libraries, I was unable to get a proper backtrace though.

Share this post


Link to post
Share on other sites
This yield() thing does not happen until some other actor enters the scene.

 

alvieboy@della:~$ time strace -p $(pidof eltest) 2>log3

 

real 0m1.049s

user 0m0.003s

sys 0m0.009s

 

$ wc -l log3

85 log3

 

This is before I see a fox, for example. After that it goes up. Even when the fox gets out of scene. Any ideas ? The sched_yield() is being called from within the GL libraries, I was unable to get a proper backtrace though.

If it's in the gl_libraries, I doubt you can do very much. About the only thing is whether or not you configured to wait for VSYNC before changing frames.

Share this post


Link to post
Share on other sites
If it's in the gl_libraries, I doubt you can do very much. About the only thing is whether or not you configured to wait for VSYNC before changing frames.

 

Not related to that, I think. And makes me suspect because only on certain situations it happens.

 

But, since it looks like being triggered, might well be something we're not doing properly.

 

I'll try to isolate and see where it might come from.

 

Álvaro

Share this post


Link to post
Share on other sites

Ok, found it.

 

And it does not surprise me at all. It's glReadPixels(), used for mouse position check, which forces a full flush of the GPU, so that it can read back the rendered scene.

 

I have been reading some documents (like this one) that have some clues about how readback can be optimized. I was wondering if one frame latency could be acceptable (if PBO extension is avaliable) to check whether mouse was over some item - this would avoid flushing in principle, but has this latency. This would help not only the framerate, but also the powersaving issue IMHO.

 

You OpenGL gurus, any ideas ? I'm planning to try this out once I have some spare time.

 

Álvaro

Share this post


Link to post
Share on other sites
Ok, found it.

 

And it does not surprise me at all. It's glReadPixels(), used for mouse position check, which forces a full flush of the GPU, so that it can read back the rendered scene.

 

I have been reading some documents (like this one) that have some clues about how readback can be optimized. I was wondering if one frame latency could be acceptable (if PBO extension is avaliable) to check whether mouse was over some item - this would avoid flushing in principle, but has this latency. This would help not only the framerate, but also the powersaving issue IMHO.

 

You OpenGL gurus, any ideas ? I'm planning to try this out once I have some spare time.

 

Álvaro

Unless things have change, ReadPixel is being called as a frame is being built up to see if the pixel changes as each item near the cursor is drawn in order to detect a change. Whats needed is a faster/better method, a one frame delay wont get the functionality to see what item is under the mouse.

 

There is a gl specific method for finding an object under the mouse, but in previous discussions it was mentioned that was worse in many ways.

Share this post


Link to post
Share on other sites

You can try to increase the mouse limit in the option window. It'll test the objects under the mouse every X frames.

We can also do selection with the GL_SELECT method, don't know if it'll be faster or if it'll save systems calls though...

 

EDIT: learner was faster ;)

 

However...

There is a gl specific method for finding an object under the mouse, but in previous discussions it was mentioned that was worse in many ways.

Do you have a link to these discussions? I don't see why it would be worse in many ways...?

Edited by Schmurk

Share this post


Link to post
Share on other sites
a one frame delay wont get the functionality to see what item is under the mouse.

 

Yes it does, if you "queue" the checks plus the results from glMapBuffer(). That's how I see it.

 

Just queue "check for AAAA, use Buffer N in next frame to see if we were there". I might be wrong about this, but's worth a try.

 

Álvaro

Share this post


Link to post
Share on other sites

The system basicly works like this:

 

-draw N scenes

-draw scene for selection to backbuffer (no textures, all objects in a different solid color)

-read pixels near mouse position (like 20x20 pixels area)

-select object closest to mouse by color

 

The backbuffer then looks like this:

gallery_42721_12_124741.png

Also see this thread for a patch which extracts the whole backbuffer.

 

To prevent excessive glReadPixels you could grab the whole screen but this is very slow. A better idea would be grabbing a larger area (maybe 200x200 pixels or N*20xN*20 with N being the number of frames between selection updates) around the cursor and use this area for checks until it's outdated or the cursor is outside the area. This has the downside that you'd need to keep track which area you've grabbed but that should not be a to big probem.

Share this post


Link to post
Share on other sites
-draw N scenes

How do you compute N ?

 

-draw scene for selection to backbuffer (no textures, all objects in a different solid color)

-read pixels near mouse position (like 20x20 pixels area)

-select object closest to mouse by color

You mean, read pixels every time you write a new object there, right?

To prevent excessive glReadPixels you could grab the whole screen but this is very slow. A better idea would be grabbing a larger area (maybe 200x200 pixels or N*20xN*20 with N being the number of frames between selection updates) around the cursor and use this area for checks until it's outdated or the cursor is outside the area. This has the downside that you'd need to keep track which area you've grabbed but that should not be a to big probem.

 

If I understood correctly the PBO stuff, you can use them asynchronously - only when a mapBuffer call is issued the actual data is read back from the graphic card vram.

 

My idea was to only do the readback when everything is flushed down (glFinish()). That way the CPU/driver can have more ops in the pipeline, and only flush them down to GPU when it thinks its necessary.

 

Again, I might be wrong about this - I don't know openGL that much - but I know hardware, and it all makes sense.

 

Álvaro

Share this post


Link to post
Share on other sites

Xaphier tried the PBO, but I think there were some issues so it wasn't a viable solution, at least for now.

 

The N scenes is computed based on the mouse limit value.

 

Anyway, I think the best solution is to implement a very minimal OpenGL API in software (no textures needed), and do it on the CPU while the video card is busy with other stuff.

Of course, this is a lot of work, but the same system could be used for occlusion culling and other things. I learned that from an OpenGl guru.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.

×