Seti@Home optimized science apps and information
 
Welcome, Guest. Please login or register.
Did you miss your activation email?
11 Oct 2008, 10:17:53 pm

Login with username, password and session length
 
If you've registered already but never got your activation email, please click here.
 
 
Seti@Home optimized science apps and information  |  Optimized Seti@Home apps  |  Windows  |  GPU crunching  |  Topic: Some thinking and theoretic discussion about seti client on GPU 0 Members and 0 Guests are viewing this topic. « previous next »
Pages: [1] 2 Go Down Print
Author Topic: Some thinking and theoretic discussion about seti client on GPU  (Read 3142 times)
Devaster
Code Wizard
Knight Templar
*****
Offline Offline

Posts: 282


I like Duke !!!


View Profile
Some thinking and theoretic discussion about seti client on GPU
« on: 19 Dec 2007, 01:38:23 pm »

Now i am thinking how to best parallelize the pulsefind. in standard code is are pulses calculated in serial mode , by calling function in the main analyse loop .

what happen when i make something like this: ill take the cycle that is finding pulses at fft size count and run them in NumPoints/fftlen threads Huh

i think this would be nice parallelization for this. but there is one extreme - by fft size bigger than 4096 is number of parallel therads going down from 256 to 8. maybe there will be some performance bottleneck or  then would be GPU utilization very low ...

i must test this on next day ... see ya!
Logged

Devaster
Code Wizard
Knight Templar
*****
Offline Offline

Posts: 282


I like Duke !!!


View Profile
Re: Some thinking and theoretic discussion about seti client on GPU
« Reply #1 on: 19 Dec 2007, 02:17:40 pm »

about pulse find - i think there would be better to write all kernels manually and do not have it automatically generated - there wold be used loop unrolling too - better performance ....
Logged

popandbob
Knight o' the Realm
**
Offline Offline

Posts: 36


View Profile
Re: Some thinking and theoretic discussion about seti client on GPU
« Reply #2 on: 20 Dec 2007, 12:25:34 am »

To follow up on my last question..

Once all is programmed in CUDA will the CPU usage still be 100%? I know that Folding@home's ATI GPU client is... but I believe that's due to them not using CUDA...

~BoB
Logged
Devaster
Code Wizard
Knight Templar
*****
Offline Offline

Posts: 282


I like Duke !!!


View Profile
Re: Some thinking and theoretic discussion about seti client on GPU
« Reply #3 on: 20 Dec 2007, 07:59:49 am »

i t dont now . there will be still some parts that would be run on CPU ....
Logged

popandbob
Knight o' the Realm
**
Offline Offline

Posts: 36


View Profile
Re: Some thinking and theoretic discussion about seti client on GPU
« Reply #4 on: 20 Dec 2007, 11:11:22 pm »

Thanks for the reply Devaster, I do hope CPU usage wont be at 100% because then at least we would have something that Folding@home doesn't... A GPU app that can run with CPU apps (ie. dont have to reserve a core for GPU app)

~BoB
Logged
Devaster
Code Wizard
Knight Templar
*****
Offline Offline

Posts: 282


I like Duke !!!


View Profile
Re: Some thinking and theoretic discussion about seti client on GPU
« Reply #5 on: 21 Dec 2007, 11:53:33 am »

but by my observations is that 100% CPU usage only "empty loop" - waiting for driver response. by me at home when i run some pure GPU code from CUDA SDK  i haven't seemed any significant slowdown ....
Logged

abachler
Knave
*
Offline Offline

Posts: 6


View Profile
Re: Some thinking and theoretic discussion about seti client on GPU
« Reply #6 on: 22 Dec 2007, 02:56:56 am »

You are probably better off processing at least part fo the WU on teh CPU, so that it stays busywhile the GPU is processing the rest.   As for the FFT takign so long in RM, Yes, due to the nature of the FFT algorithm, it is difficult to implement it on a GPU without killing performance, but never fear, there is a workaround Smiley  Then again, since the CPU is idle, you should process a seperate WU on teh CPU while the GPU is processing the other.  I think ultimately the BOINC client will have to take care of recognizing when it should start mutiple clients including fro the GPU and to only use one client per GPU.
« Last Edit: 23 Dec 2007, 04:46:15 am by abachler » Logged
roisen.dubh
Knave
*
Offline Offline

Posts: 2


View Profile
Re: Some thinking and theoretic discussion about seti client on GPU
« Reply #7 on: 30 Dec 2007, 01:48:49 am »

From what I understand, Chirping the data is what takes the most amount of crunching. If getting the FFTs to crunch ion the GPU s what is causing the GPU client to go so slowly, why not have the GPUs chirp the data, and then send it to the CPU for the FFTs.

Or I could be completely mistaken
Logged
Jason G
Global Moderator
Knight who says 'Ni!'
*****
Offline Offline

Posts: 1986


View Profile
Re: Some thinking and theoretic discussion about seti client on GPU
« Reply #8 on: 30 Dec 2007, 01:55:54 am »

From what I understand, Chirping the data is what takes the most amount of crunching. If getting the FFTs to crunch ion the GPU s what is causing the GPU client to go so slowly, why not have the GPUs chirp the data, and then send it to the CPU for the FFTs.
From vague memory when I did some profiling on my p4's [may or may not be relevant to GPU prcoessing, Don't know] , from most intensive to slighlty less intensive :
    Pulse Folding/Finding, sheer moving data about the place, then Chirping, then iFFT's& FFT's, then Gauss fitting.  Each of which vary by angle range and task content.

[Baseline Smoothing showed up somewhere too, but I don't remember how expensive that was... lower down on the list I think]

I remember at the time thinking these processing tasks seemed to each use a more even proportion of the total processing time than I would have expected. [Something like each major inner functions around 4 to 11% total execution time each]


Jason
« Last Edit: 30 Dec 2007, 02:08:57 am by j_groothu » Logged
Devaster
Code Wizard
Knight Templar
*****
Offline Offline

Posts: 282


I like Duke !!!


View Profile
Re: Some thinking and theoretic discussion about seti client on GPU
« Reply #9 on: 30 Dec 2007, 10:24:31 am »

yes i know that pulse find is the most time comsuming operation , but i must begin with something easy - fft, power spectrum, data chirp ....
when you take look at pulse find code - is it more compex as find spike for example , and i am not so good for now to easy convert/rewrite the code to pararell architecture ...
Logged

Jason G
Global Moderator
Knight who says 'Ni!'
*****
Offline Offline

Posts: 1986


View Profile
Re: Some thinking and theoretic discussion about seti client on GPU
« Reply #10 on: 30 Dec 2007, 10:54:15 am »

One thing with the inner loops of the pulse folding, and the chirping routines also,  is there are a few different very well hand vectorised versions in there,  Though I don't know much about GPU programming at all, I'd imagine they'd need a similar kind of loop iteration independence and blocking etc to take advantage of the parallelism capability.  So It may actually help you to examine  some of the SSE/SSE2 optimised/vectorised code rather than the standard C code,  as a wild guess on my part, some of it may be possible to translate almost straight to GPU code, though definitely not the fastest most suitable for the chip, It may be closer to what you need than the stock, at least in concept.

Let us know if you need help with understanding some of the SSE2 code and/or intrinsics used etc...

Just a thought.

Jason
Logged
Devaster
Code Wizard
Knight Templar
*****
Offline Offline

Posts: 282


I like Duke !!!


View Profile
Re: Some thinking and theoretic discussion about seti client on GPU
« Reply #11 on: 31 Dec 2007, 11:14:48 am »

now i am working on find spikes code. the way what i have used is this :
in original seti code are all steps called sequently for every fft chunk in main analyse loop. i call the analyse functions for all chunks at one time ...
original seti code :
Code:
//main analyse - top analyse loop
for (icfft = state.icfft; icfft < num_cfft; icfft++)
.
.
.
       for (ifft = 0; ifft < NumFfts; ifft++)
       //inner loop for fft chunks
       fft calc;
       find spike
       .
       .
       .

by CUDA with his thread model can i create threads as fft chunk count and run on GPU - this will eliminate the inner loop for fft chunks .... so code look like
Code:
//main analyse - top analyse loop
for (icfft = state.icfft; icfft < num_cfft; icfft++)
.
.
.
fft calc; - for all chunks at one time
find spike - for all chunks at one time
       .
       .
       .
imagine that as you have a cpu that can run at one time 128k find spikes and return only best spike and result spike if its bigger than spike treshold ....
Logged

Josef W. Segur
Global Moderator
Knight who says 'Ni!'
*****
Offline Offline

Posts: 745


View Profile
Re: Some thinking and theoretic discussion about seti client on GPU
« Reply #12 on: 31 Dec 2007, 03:41:02 pm »

As long as the logic can report the same first 30 spikes for an overflow, that seems excellent.
                                                          Joe
Logged
Vyper
Pre-Release Tester
Knight o' the round Table
***
Offline Offline

Posts: 182


View Profile
Re: Some thinking and theoretic discussion about seti client on GPU
« Reply #13 on: 02 Jan 2008, 06:06:53 am »

I remember in the good old assembly days when u could program the cpu that it could do other things whilst the other hardware was running and when the hardware was done it generated an interupt so the code would jump to a specific place and just fetch what the hardware just had done or do the next part so it could go back to the previous code for what it was doing?!

Wonder if the s@h code is linear?
With that i mean u need to process the WU in a specific manner or could findpulse be ahead of fft and vice versa?

Kind Regards Vyper
Logged
Josef W. Segur
Global Moderator
Knight who says 'Ni!'
*****
Offline Offline

Posts: 745


View Profile
Re: Some thinking and theoretic discussion about seti client on GPU
« Reply #14 on: 02 Jan 2008, 04:33:41 pm »

...
Wonder if the s@h code is linear?
With that i mean u need to process the WU in a specific manner or could findpulse be ahead of fft and vice versa?

The basic code is of course linear because it is written for a single worker thread, but there are high level loops which could be modified to distribute to other processors. Here's a quick overview of processing:

1. Read data from the WU, convert to floating point, baseline smooth. This is done once per startup and produces an 8 MiB array.

2. Dechirp the above array into another same size array. This is done at a lot of incremental chirp rates from zero through +/- 100 Hz/sec. It loops between 37193 and 108194 times.

3. Do FFTs on the dechirped array to produce narrower frequency bands for analysis. The original array has 9765.625 Hz. bandwidth, we analyze at bandwidths ranging from 1220.7 Hz. to 0.0745 Hz. At zero chirp all 15 FFT lengths are used, at quite a few other chirps only one FFT length is used, so this would be an awkward place to try to parallelize on that basis. However, each FFT length is used multiple times; for instance length 8 is used 128K times and those can be done in parallel.

4. Convert the FFT output to PowerSpectrum data and analyze for Spikes, Gaussians, Triplets, and Pulses. If the telescope moved more than one beam width during recording of the work, for Triplets and Pulses the data is divided into chunks with just one beam width worth of data.

Basically the data has to be organized before it can be analyzed, but there are opportunities to split the processing into parallel paths.
                                                          Joe
Logged
Pages: [1] 2 Go Up Print 
Seti@Home optimized science apps and information  |  Optimized Seti@Home apps  |  Windows  |  GPU crunching  |  Topic: Some thinking and theoretic discussion about seti client on GPU « previous next »
Jump to:  


Quote!
Success always occurs in private, and failure in full view.
- Murphy's Law

 
Site Statistics
Total Members:1,046
Total Posts:9,974
Total Topics:440
Downloads
Apps
Windows R-1.x25,105
Windows R-2.020,321
Windows R-2.236,511
Linux 32bit 1.x6,551
Linux 32bit 2.24,349
Linux 64bit 2.21,751
Alpha/IA64193
FreeBSD606
HPUX334
Subtotal:94,588
Source packs:4,121
Tool/WU packs:7,791
Total:154,080
GBs dl'd:280.45
Pages served
Today:4,074
Total:3,216,332
(since 6/26/2006)
173 Donations to S@H
U.S. Dollars:3,196.59
Euros:863.90
Last 24h:$ 0.00
Avg./24h:$ 7.08
Estim. total:$ 4,319.66
Latest Member:
Leaps-from-Shadows
 
 
Seti@Home optimized science apps and information | Powered by Enigma 2.0 (RC1).
© 2003-2008, LSP Dev Team. All Rights Reserved.
Seti@Home optimized science apps and information Forums | Powered by SMF.
© 2005, Simple Machines LLC. All Rights Reserved.
Powered by MySQL Powered by PHP Valid XHTML 1.0! Valid CSS!