Seti@Home optimized science apps and information
 
Welcome, Guest. Please login or register.
02 Sep 2010, 06:42:24 pm

Login with username, password and session length
 
If you've registered already but never got your activation email, please click here.
 
 
Seti@Home optimized science apps and information  |  Optimized Seti@Home apps  |  Windows  |  GPU crunching  |  Topic: Fighting the CUDA bug 0 Members and 0 Guests are viewing this topic. « previous next »
Pages: [1] 2 Go Down Print
Author Topic: Fighting the CUDA bug  (Read 1674 times)
Fred M
Alpha Tester
Knight o' The Realm
***
Offline Offline

Posts: 99



WWW
Fighting the CUDA bug
« on: 29 Jun 2009, 03:44:33 am »

On a number of computer, I got waiting CUDA WU. This is not a problem, but as some of them are kept in memory, it becomes a serious problem.
It causes CUDA to go into fall back mode, or even a total freeze of XP.

I've come up with 2 solutions.

1) Automatically restart the system when the GPU temperature goes below a set value. Works a lot of times.
2) Automatically restart the system when the GPU exe runs more than a set value. This can only work when your card can hold an extra cuda task in memory without crashing the system... Mine can hold 7 of them so I set this value to > 4, that is 2 extra in memory is allowed.


I have 6.6.36 installed but I think this is a problem in earlier versions as well.

If anyone has the same problems and wants to do some testing.
http://efmer.eu/boinc/ This version has 1) implemented.
2) is in beta testing and I anyone wants to test it let me know.
Logged

TThrottle Keep your temperatures controlled.
BoincTasks The best way to view BOINC
Raistmer
Code Wizard
Knight who says 'Ni!'
*****
Online Online

Posts: 6024



Re: Fighting the CUDA bug
« Reply #1 on: 30 Jun 2009, 06:32:53 am »

Do you restart system (OS) or only BOINC. It seems BOINC restart should be enough in this case, not ?
Logged
Fred M
Alpha Tester
Knight o' The Realm
***
Offline Offline

Posts: 99



WWW
Re: Fighting the CUDA bug
« Reply #2 on: 30 Jun 2009, 06:45:15 am »

Do you restart system (OS) or only BOINC. It seems BOINC restart should be enough in this case, not ?

It may, but better be safe...
I think the programs keeps on running, sort of, or are crashed, without the knowledge of BOINC . I haven't tested that, as it mostly happens when I'm not around.
And does it reallocate all the GPU memory in this case? I hope they fix this problem, but is is around sooooo long, and I believe they think they fixed it. Something like the rules are ok so this can't happen.
Logged

TThrottle Keep your temperatures controlled.
BoincTasks The best way to view BOINC
Raistmer
Code Wizard
Knight who says 'Ni!'
*****
Online Online

Posts: 6024



Re: Fighting the CUDA bug
« Reply #3 on: 30 Jun 2009, 07:02:43 am »

It's BOINC's problem. It should never leave GPU apps in memory...

Try to restart only BOINC (as experiment). Even if CUDA MB app still running it should exit after ~30 second with zero status (OK) and no heartbeat message. Then it can be restarted from checkpoint.
If BOINC will be restarted sooner (and it should be) no additional apps will be launched (more exactly - they will exit immediately with "can't aquire lock" message).

OS reboot too wasteful - so many CPU cycles lost... Wink (BOINC restart too, but it still better than task crash or CPU fallback of course).
Logged
popandbob
Guest


Email
Re: Fighting the CUDA bug
« Reply #4 on: 30 Jun 2009, 03:36:01 pm »

I believe the problem is caused by boinc's safeguard against non check pointing apps. If an application doesn't reach a checkpoint it will be left in memory regardless of what settings have been set.
Bob
Logged
Raistmer
Code Wizard
Knight who says 'Ni!'
*****
Online Online

Posts: 6024



Re: Fighting the CUDA bug
« Reply #5 on: 30 Jun 2009, 03:37:15 pm »

I believe the problem is caused by boinc's safeguard against non check pointing apps. If an application doesn't reach a checkpoint it will be left in memory regardless of what settings have been set.
Bob
CUDA MB does checkpoint. So not the case unfortunately...
Logged
Richard Haselgrove
Alpha Tester
Knight who says 'Ni!'
***
Offline Offline

Posts: 970


Re: Fighting the CUDA bug
« Reply #6 on: 30 Jun 2009, 04:26:25 pm »

DA has recently 'checked in' (i.e. modified the source code, but not yet compiled a new version) a change: previously/currently, BOINC would leave a CUDA app in memory if it was preempted before the first checkpoint. In future, it will be cleaned out even if it has never checkpointed - so application developers, get your checkpointing code working early on in the development process.
Logged
Raistmer
Code Wizard
Knight who says 'Ni!'
*****
Online Online

Posts: 6024



Re: Fighting the CUDA bug
« Reply #7 on: 30 Jun 2009, 04:30:29 pm »

so application developers, get your checkpointing code working early on in the development process.
Or make your tasks so fast that they will never need to checkpoint Grin Grin Grin
Logged
sunu
Alpha Tester
Knight who says 'Ni!'
***
Offline Offline

Posts: 604



Re: Fighting the CUDA bug
« Reply #8 on: 30 Jun 2009, 04:31:46 pm »

DA has recently 'checked in' (i.e. modified the source code, but not yet compiled a new version) a change: previously/currently, BOINC would leave a CUDA app in memory if it was preempted before the first checkpoint. In future, it will be cleaned out even if it has never checkpointed - so application developers, get your checkpointing code working early on in the development process.

Yes, currently, if the cuda app is preempted in the first 30sec or so of its initialisation in cpu, it is left in memory, no matter what settings you've got.
Logged
Jason G
Global Moderator
Knight who says 'Ni!'
*****
Offline Offline

Posts: 5876


Re: Fighting the CUDA bug
« Reply #9 on: 30 Jun 2009, 04:35:43 pm »

Or make your tasks so fast that they will never need to checkpoint Grin Grin Grin

That's no joke.  I had this in mind for multithreaded apps, triggered by Alex's treatment of spike finding code on Macs.  Goodbye 80% of BoincAPI if the tasks can be fast enough to not need to bother checkpointing.
Logged
Richard Haselgrove
Alpha Tester
Knight who says 'Ni!'
***
Offline Offline

Posts: 970


Re: Fighting the CUDA bug
« Reply #10 on: 30 Jun 2009, 04:38:30 pm »

Or make your tasks so fast that they will never need to checkpoint Grin Grin Grin

Are you volunteering to optimise the AQUA CUDA app? I've got one on my 9800GTX+ (84 GFLOPs) which is estimating 79 hours Shocked (and that's a big improvement - the last one took 89 hours).
Logged
Raistmer
Code Wizard
Knight who says 'Ni!'
*****
Online Online

Posts: 6024



Re: Fighting the CUDA bug
« Reply #11 on: 30 Jun 2009, 04:45:04 pm »

Or make your tasks so fast that they will never need to checkpoint Grin Grin Grin

Are you volunteering to optimise the AQUA CUDA app? I've got one on my 9800GTX+ (84 GFLOPs) which is estimating 79 hours Shocked (and that's a big improvement - the last one took 89 hours).
Hehe, no-no-no, as Artifical Realm closed (w/o any explanation, sadly) SETI-only here for now Wink
Cause many users now do many BOINC projects, optimization of another project app will help speedup SETI too, but this is too peripheral way to target Grin
Logged
Claggy
Alpha Tester
Knight Templar
***
Online Online

Posts: 423


Re: Fighting the CUDA bug
« Reply #12 on: 30 Jun 2009, 04:52:32 pm »

DA has recently 'checked in' (i.e. modified the source code, but not yet compiled a new version) a change: previously/currently, BOINC would leave a CUDA app in memory if it was preempted before the first checkpoint. In future, it will be cleaned out even if it has never checkpointed - so application developers, get your checkpointing code working early on in the development process.

DA has also 'checked in' another GPU related change after your question today.

Changeset 18531

Claggy
Logged
Richard Haselgrove
Alpha Tester
Knight who says 'Ni!'
***
Offline Offline

Posts: 970


Re: Fighting the CUDA bug
« Reply #13 on: 30 Jun 2009, 05:03:31 pm »


DA has also 'checked in' another GPU related change after your question today.

Changeset 18531

Claggy

And he's had to fix his own typos

Changeset 18533

Band-aid time, I reckon - but it's a [tacit] acknowledgement of the FIFO bug.....
Logged
Fred M
Alpha Tester
Knight o' The Realm
***
Offline Offline

Posts: 99



WWW
Re: Fighting the CUDA bug
« Reply #14 on: 01 Jul 2009, 06:04:42 am »

I had about  6 reboots this night. Grin  Today I got  one  just in time to see something, because it happens really quick.  I did an exit on the BOINC manager and checked Stop running science applications.
And that did indeed close them, so they are not crashed.
Starting BOINC again and everything works again... for the time being that is.
Logged

TThrottle Keep your temperatures controlled.
BoincTasks The best way to view BOINC
Pages: [1] 2 Go Up Print 
Seti@Home optimized science apps and information  |  Optimized Seti@Home apps  |  Windows  |  GPU crunching  |  Topic: Fighting the CUDA bug « previous next »
Jump to:  


Quote!
The past is a source of knowledge, and the future is a source of hope. Love of the past implies faith in the future.
- Stephen Ambrose, in Fast Company

 
Site Statistics
Total Members:123
Total Posts:29,786
Total Topics:892
Downloads
Apps
Windows R-1.x0
Windows R-2.00
Windows R-2.20
Linux 32bit 1.x0
Linux 32bit 2.20
Linux 64bit 2.20
Alpha/IA641,938
FreeBSD0
HPUX0
Subtotal:0
Source packs:5,803
Tool/WU packs:10,078
Total:22,048
GBs dl'd:309.53
Pages served
Today:6,776
Total:8,668,380
(since 6/26/2006)
173 Donations to S@H
U.S. Dollars:3,196.59
Euros:863.90
Last 24h:$ 0.00
Avg./24h:$ 3.32
Estim. total:$ 4,319.66
Latest Member:
Miep
 
 
Seti@Home optimized science apps and information | Powered by Enigma 2.0 (RC1).
© 2003-2010, LSP Dev Team. All Rights Reserved.
Seti@Home optimized science apps and information Forums | Powered by SMF.
© 2005, Simple Machines LLC. All Rights Reserved.
Powered by MySQL Powered by PHP Valid XHTML 1.0! Valid CSS!