|
|
Pages: [1] 2
|
 |
|
Author
|
Topic: Fighting the CUDA bug (Read 1337 times)
|
|
Fred M
|
On a number of computer, I got waiting CUDA WU. This is not a problem, but as some of them are kept in memory, it becomes a serious problem. It causes CUDA to go into fall back mode, or even a total freeze of XP. I've come up with 2 solutions. 1) Automatically restart the system when the GPU temperature goes below a set value. Works a lot of times. 2) Automatically restart the system when the GPU exe runs more than a set value. This can only work when your card can hold an extra cuda task in memory without crashing the system... Mine can hold 7 of them so I set this value to > 4, that is 2 extra in memory is allowed. I have 6.6.36 installed but I think this is a problem in earlier versions as well. If anyone has the same problems and wants to do some testing. http://efmer.eu/boinc/ This version has 1) implemented. 2) is in beta testing and I anyone wants to test it let me know.
|
|
|
|
|
Logged
|
|
|
|
|
Raistmer
|
Do you restart system (OS) or only BOINC. It seems BOINC restart should be enough in this case, not ?
|
|
|
|
|
Logged
|
|
|
|
|
Fred M
|
Do you restart system (OS) or only BOINC. It seems BOINC restart should be enough in this case, not ?
It may, but better be safe... I think the programs keeps on running, sort of, or are crashed, without the knowledge of BOINC . I haven't tested that, as it mostly happens when I'm not around. And does it reallocate all the GPU memory in this case? I hope they fix this problem, but is is around sooooo long, and I believe they think they fixed it. Something like the rules are ok so this can't happen.
|
|
|
|
|
Logged
|
|
|
|
|
Raistmer
|
It's BOINC's problem. It should never leave GPU apps in memory... Try to restart only BOINC (as experiment). Even if CUDA MB app still running it should exit after ~30 second with zero status (OK) and no heartbeat message. Then it can be restarted from checkpoint. If BOINC will be restarted sooner (and it should be) no additional apps will be launched (more exactly - they will exit immediately with "can't aquire lock" message). OS reboot too wasteful - so many CPU cycles lost...  (BOINC restart too, but it still better than task crash or CPU fallback of course).
|
|
|
|
|
Logged
|
|
|
|
|
popandbob
|
I believe the problem is caused by boinc's safeguard against non check pointing apps. If an application doesn't reach a checkpoint it will be left in memory regardless of what settings have been set. Bob
|
|
|
|
|
Logged
|
|
|
|
|
Raistmer
|
I believe the problem is caused by boinc's safeguard against non check pointing apps. If an application doesn't reach a checkpoint it will be left in memory regardless of what settings have been set. Bob
CUDA MB does checkpoint. So not the case unfortunately...
|
|
|
|
|
Logged
|
|
|
|
|
Richard Haselgrove
|
DA has recently 'checked in' (i.e. modified the source code, but not yet compiled a new version) a change: previously/currently, BOINC would leave a CUDA app in memory if it was preempted before the first checkpoint. In future, it will be cleaned out even if it has never checkpointed - so application developers, get your checkpointing code working early on in the development process.
|
|
|
|
|
Logged
|
|
|
|
|
|
|
sunu
|
DA has recently 'checked in' (i.e. modified the source code, but not yet compiled a new version) a change: previously/currently, BOINC would leave a CUDA app in memory if it was preempted before the first checkpoint. In future, it will be cleaned out even if it has never checkpointed - so application developers, get your checkpointing code working early on in the development process.
Yes, currently, if the cuda app is preempted in the first 30sec or so of its initialisation in cpu, it is left in memory, no matter what settings you've got.
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
That's no joke. I had this in mind for multithreaded apps, triggered by Alex's treatment of spike finding code on Macs. Goodbye 80% of BoincAPI if the tasks can be fast enough to not need to bother checkpointing.
|
|
|
|
|
Logged
|
|
|
|
|
Richard Haselgrove
|
Are you volunteering to optimise the AQUA CUDA app? I've got one on my 9800GTX+ (84 GFLOPs) which is estimating 79 hours  (and that's a big improvement - the last one took 89 hours).
|
|
|
|
|
Logged
|
|
|
|
|
Raistmer
|
Are you volunteering to optimise the AQUA CUDA app? I've got one on my 9800GTX+ (84 GFLOPs) which is estimating 79 hours  (and that's a big improvement - the last one took 89 hours). Hehe, no-no-no, as Artifical Realm closed (w/o any explanation, sadly) SETI-only here for now  Cause many users now do many BOINC projects, optimization of another project app will help speedup SETI too, but this is too peripheral way to target 
|
|
|
|
|
Logged
|
|
|
|
|
Claggy
|
DA has recently 'checked in' (i.e. modified the source code, but not yet compiled a new version) a change: previously/currently, BOINC would leave a CUDA app in memory if it was preempted before the first checkpoint. In future, it will be cleaned out even if it has never checkpointed - so application developers, get your checkpointing code working early on in the development process.
DA has also 'checked in' another GPU related change after your question today. Changeset 18531Claggy
|
|
|
|
|
Logged
|
|
|
|
|
Richard Haselgrove
|
DA has also 'checked in' another GPU related change after your question today. Changeset 18531Claggy And he's had to fix his own typos Changeset 18533Band-aid time, I reckon - but it's a [tacit] acknowledgement of the FIFO bug.....
|
|
|
|
|
Logged
|
|
|
|
|
Fred M
|
I had about 6 reboots this night.  Today I got one just in time to see something, because it happens really quick. I did an exit on the BOINC manager and checked Stop running science applications. And that did indeed close them, so they are not crashed. Starting BOINC again and everything works again... for the time being that is.
|
|
|
|
|
Logged
|
|
|
|
|
Pages: [1] 2
|
|
|
|
Quote!
To succeed in politics, it is often necessary to rise above your principles.- Murphy's Law
|
 |  |  |
| |
| Site Statistics |
| Total Members: | 2,265 |
| Total Posts: | 25,349 |
| Total Topics: | 805 | | Downloads |
| Apps |
| Windows R-1.x | 0 |
| Windows R-2.0 | 0 |
| Windows R-2.2 | 0 |
| Linux 32bit 1.x | 0 |
| Linux 32bit 2.2 | 0 |
| Linux 64bit 2.2 | 0 |
| Alpha/IA64 | 1,756 |
| FreeBSD | 0 |
| HPUX | 0 |
| Subtotal: | 0 |
| Source packs: | 5,329 |
| Tool/WU packs: | 9,517 |
| Total: | 85,100 | | GBs dl'd: | 365.28 | | Pages served |
| Today: | 6,552 |
| Total: | 7,100,340 |
| (since 6/26/2006) |
| 173 Donations to S@H |
| U.S. Dollars: | 3,196.59 |
| Euros: | 863.90 |
| Last 24h: | $ 0.00 |
| Avg./24h: | $ 3.84 |
| Estim. total: | $ 4,319.66 |
Latest Member: franjo5 |
| |
 | |  |
 |  |  |
| |
Online users/last 15m
28 Guests, 8 Users
Morten, Claggy, -ShEm-, Lazydude, glk63, gjpivko, arkayn, Ghost0210 85 Members/last 24hMorten, Claggy, -ShEm-, Lazydude, glk63, gjpivko, arkayn, Ghost0210, ScitechGrid, _Geordie_, Mortlake, Raistmer, Krypto, Matthias Lehmkuhl, mechtheist, jrusling, algoodman, ecki, Richard Haselgrove, pp, benool, Darwin MLP, clockman, John Galt 007, kit344, Cosmic_Ocean, Avatar1966, Wild6-NJ, Rectifier, WHRoeder, _heinz, crazyrabbit1, KarVi, Devaster, Urs Echternacht, cristipurdel, Geek@Play, ppppgabor, kararom, Fok, Franz, nenym, HMN, ic451uk, Pepi, bcvv28, Trucido, riofl, Skywalker66_Bln, Frawe, whiteyonenh, needqed, Kinguni, k6xt, clk, Jason G, perryjay, greenfinger, The Grinch, vendyhope, Gatland71, mr.mac52, Byron Leigh Hatch @ team Carl Sagan, Garton72, franjo5, Uwe12, Hafwen73, ralph52h, YceBear, Hailey74, Havard75, fumi, Jacques76, [B^S] zioriga, Rabinovitch, breadmix, TouchuvGrey, Grebuloner, HSchmirPo, Tye, norway415, fns, cgland, glennaxl, gRis
| |
 | |  |
|