|
|
Author
|
Topic: optimized sources (Read 44321 times)
|
|
Jason G
|
This block compiles on mine: (For comparison, I can see no major functional difference to yours  ) ---------- CurrentSub = fftlen * (ifft + iC); sah_complex *WorkArea = &WorkData[iC * fftlen / 2]; // assume sah_complex 2 floats #if !(defined(USE_IPP) | defined(USE_FFTWF)) // makes ,memcpy inactive memcpy( WorkArea, &ChirpedData[CurrentSub], int(fftlen * sizeof(sah_complex)) ); #endif #if defined( USE_IPP ) ippsFFTInv_CToC_32fc( ( Ipp32fc * ) &ChirpedData[CurrentSub], // Source ( Ipp32fc * ) WorkArea, //Destination FftSpec[FftNum], FftBuf ); #elif defined( USE_FFTWF ) fftwf_execute_dft( analysis_plans[FftNum], &ChirpedData[CurrentSub], WorkArea ); #else // replace time with freq - ooura FFT cdft( fftlen * 2, 1, WorkArea, BitRevTab[FftNum], CoeffTab[FftNum] ); #endif ---------- I did notice it went haywire if I missed out a ( Ipp32fc * ) typecast.
|
|
|
|
« Last Edit: 28 Oct 2007, 12:10:13 pm by j_groothu »
|
Logged
|
|
|
|
|
_heinz
|
yes it compiles mine too ---> analyzeFuncs.cpp -----IPP----- -----SSE2----- Build log was saved at "file://c:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX\BuildLog.htm" seti_boinc - 0 error(s), 0 warning(s) ---------------------------------------------------------------------------------------- heinz
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
Ahh good one  , I'm thinking that this new way: --- Using no memcopy --- Using IPP Function as intended is better than the old way: --- Using a memcopy (even an optimised one, which I was looking at) --- Using IPP function in a wierd way of course only a test can show if this has any speed difference. Be a while before I could look at a rebuild as I have more schoolwork and have to give some tutoring this week . Even if it is slower I don't mind because it still has helped me to understand a small piece more of the code. The next step for me after testing this would probably be to look at Joe's even better suggestions, There are many now!. Thanks for trying this and keep plugging away ! Back later in the week! Jason
|
|
|
|
|
Logged
|
|
|
|
|
_heinz
|
changed benchmark.cpp -----> -------------------------------------------------------------------------------------------------------- for(loops = 0; loops < 25 && (end_cyc-total_run)< MAX_CYCLES; loops++) { if(pre_test == zero_out) memset( out_buf, 0, test_size ); if(pre_test == fill_in) memcpy( out_buf, workBuf, test_size ); ramming_speed(); cycles = cycleCount(); switch ( bench_list[idx].token ) { case _FFT: #if defined( USE_IPP ) if(pre_test == zero_out) { ippsFFTInv_CToC_32fc( ( Ipp32fc * ) out_buf, ( Ipp32fc * ) out_buf, FftSpec, NULL ); } else { ippsFFTInv_CToC_32fc( ( Ipp32fc * ) workBuf, // This is the source data, this is not overwritten ( Ipp32fc * ) out_buf, // This is some other Buffer destination // no memcpy required FftSpec, NULL ); } #endif //seti_britta: #if defined( USE_FFTWF ) fftwf_execute_dft( da_fft_plan, (sah_complex *)&in_buf[0], (sah_complex *)&out_buf ); #endif break; ----------------------------------------------------------------------------------------------------------------------------- it compiles well ---> benchmark.cpp -----IPP----- -----SSE2----- -----ipp----- -----sse2----- Build log was saved at "file://c:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\Optimizer\Release32-NOGFX\BuildLog.htm" Optimizer - 0 error(s), 0 warning(s) ------------------------------------------------------------------------------------------------------------------------------- will try this an look if it works well.... see you again here regards heinz
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
ahah I see.... now that IPP call is "In Place" You can do this: ... if(pre_test == zero_out) { ippsFFTInv_CToC_32fc( // ( Ipp32fc * ) out_buf, // Commented out this to make it inplace ( Ipp32fc * ) out_buf, // This is both source and destination FftSpec, NULL ); } ... Whether it makes any difference is another question  questions I have are: - Why benchmark an array of zeroes ? - If zeroed array needs to be benched , why not test it 'fully' out of place (separate src/dest buffer like below)?
|
|
|
|
« Last Edit: 28 Oct 2007, 01:35:16 pm by j_groothu »
|
Logged
|
|
|
|
|
_heinz
|
questions I have are: - Why benchmark an array of zeroes ? - If zeroed array needs to be benched , why not test it 'fully' out of place (separate src/dest buffer like below)?
hmm... maybe Alex Kan or Joe has a good answer
|
|
|
|
|
Logged
|
|
|
|
|
Josef W. Segur
|
questions I have are: - Why benchmark an array of zeroes ? - If zeroed array needs to be benched , why not test it 'fully' out of place (separate src/dest buffer like below)?
hmm... maybe Alex Kan or Joe has a good answer The 2.2B benchmark.cpp source doesn't set pre_test to zero_out anyplace. Setting pre_test = fill_in makes sense for the in place transform so it always works on the same random data, that's not needed for out of place. But the FFT benchmark is timing only, and wasted time at that except in standalone runs with -bench or -verbose, since it is not used to choose a "best" variant. The lunatics.at 2.4 builds don't run the FFT benchmark test, though Crunch3r's 2.4V builds which use IPP FFTs do. I don't know why Ben Herndon used the out of place form of parameters in the ippsFFTInv_CToC_32fc() calls, but he may have checked the actual code produced and determined that was slightly more efficient. Joe
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
I don't know why Ben Herndon used the out of place form of parameters in the ippsFFTInv_CToC_32fc() calls, but he may have checked the actual code produced and determined that was slightly more efficient. Joe
I wracked my brain about this, and ultimately came to a similar (though more convoluted and speculative) conclusion. It would make sense to me if an explicit out of place call could make better use of the prefetch, cache and paralellism mechanisms we have discussed in a different context. An explicit in place call could not, (so far as I can see for now, through read write dependancies). After considering that, another possibility presented itself: for the same reasons, as originally presented the memcopy followed by the out of place form call (with inplace parameters), may simply be faster than 'true out of place' way we're playing with  . If so, I suspect a 'cache doubling effect' from using same source & dest. The flipside is that if that effect shows verifiably then it might even indicate the particular calls are not using streaming writes to start with... possibly bringing your hybridised codelet phased processing screaming to a new sense of urgency. More speculation than hard data at the moment, I'll think about some small simple external tests for a while and stew on it for a couple of weeks  Jason
|
|
|
|
|
Logged
|
|
|
|
|
_heinz
|
ahah I see.... now that IPP call is "In Place" You can do this: ... if(pre_test == zero_out) { ippsFFTInv_CToC_32fc( // ( Ipp32fc * ) out_buf, // Commented out this to make it inplace ( Ipp32fc * ) out_buf, // This is both source and destination FftSpec, NULL ); }
if we do this we get a error message ----> .\benchmark.cpp(634) : error C2660: 'w7_ippsFFTInv_CToC_32fc' : function does not take 3 arguments also let it so as it is ---> if(pre_test == zero_out) { ippsFFTInv_CToC_32fc( ( Ipp32fc * ) out_buf, ( Ipp32fc * ) out_buf, FftSpec, NULL ); } -------------------------------------------- so it compiles heinz
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
so it compiles heinz
Yes, as we have discovered before I must need my eyes checked  and it would make sense , if it was ever used in the zero fill context, to leave it using the same form as might occur in a real analysis anyway. For the sakes of information - Here is the form for out of place Inverse FFT (as exists): IppStatus ippsFFTInv_CToC_32fc( const Ipp32fc* pSrc, Ipp32fc* pDst, const IppsFFTSpec_C_32fc* pFFTSpec, Ipp8u* pBuffer); And Here is the form for in place : IppStatus ippsFFTInv_CToC_32fc _I( Ipp32fc* pSrcDst, const IppsFFTSpec_C_32fc* pFFTSpec, Ipp8u* pBuffer); I am currently learning much about what is connected to what by trying to separate out the benchmark (for exploratory purposes). Piece by piece it connects to almost the whole codebase, Still a few external references to track down, but I may end up with a stripped down custom testbed for examining function of different algorithms, libraries & optimised functions. The main reason for this unnecessary but educational exploration is, I may wish to try and see actual differences between the FFT libraries, different compilers and flags, without touching my main copy of the code anymore. Also I am interested to see how close to ideal the forward and inverse transforms are when a 'Maximum Length Sequence' is applied as input, rather than zeroes or random data (I hope I'll get a constant power spectrum, with no spikes etc...We''ll See  ) Jason
|
|
|
|
|
Logged
|
|
|
|
|
_heinz
|
Hi Jason, her you see the output of ET I use to measure codepieces of two functions p1, p2 -------------------------------------------------------------------------------------------------------------------- ET v1.0 test seti ------------------- Timer Frequency in: Hz = 3579545 MHz = 3.57955 GHz = 0.00358
Start Time = 1080132967465 Ticks Stop Time = 1080134441029 Ticks
Duration in Ticks = 1473564 Duration in seconds = 0.4116623760841 -------------------------------------- Start Time = 1080134443291 Ticks Stop Time = 1080138377735 Ticks
Duration in Ticks = 3934444 Duration in seconds = 1.0991463998916 -------------------------------------- P1 = 1473564 P2 = 3934444 dif= 2460880
Solution:P1 is faster than P2 Press the Enter Key! ------------------------------------------------------------------------------------------------ so we see the success without running a test WU....
heinz
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
Cool , thanks for the links by PM. could be quite handy for the things I intend to be looking at soon.... but LOL, where is etimer.lib file that is discussed in the intel site ? The link at the end of the etimer article is giving me some 3d transform program files INTEAD  , if I can't find it I probably should let Intel know their link is broken .... [ LOL now they fixed it !  , maybe they read Lunatics]
|
|
|
|
« Last Edit: 03 Nov 2007, 07:42:27 am by j_groothu »
|
Logged
|
|
|
|
|
_heinz
|
maybe....we are one of the most accessed, now more than 22 000 ...... 
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
'Tis truly an Epical Thread  .... But Wait there's more! .... Using the timers I ran some big loop math array test pieces to establish the best optimisation configurations on my old p4 Northwood:: With everything else equal: the xW sse2 setting I've been using all along = 14.15 secs (repeated runs to make sure) the xN sse2 setting I wanted to test properly = 12.8 secs (repeated runs to make sure) That makes xN builds nearly 10% faster on my old clunker with looping math code! This means that: The good news is I may already have found a way to acheive my 5 to 10% speed improvement goal for this machine! (without doing much at all.... Hmmm ...Better start thinking of a new goal! )  Bad news is that I now have to go and rebuild the seti projects with my new settings to see if it will work ... and no time this week!  Surprise Surprise, a Qx N build is faster on my Northwood  LOL
|
|
|
|
« Last Edit: 03 Nov 2007, 01:43:35 pm by j_groothu »
|
Logged
|
|
|
|
|
_heinz
|
lol... make a copy of your current seti folder and set it parallel to the boinc folder...so you need not touch the old one.
|
|
|
|
|
Logged
|
|
|
|
|
|
Quote!
'An it harm none, do as ye will.- Wiccan Rede
|
 |  |  |
| |
| Site Statistics |
| Total Members: | 1,070 |
| Total Posts: | 10,726 |
| Total Topics: | 446 | | Downloads |
| Apps |
| Windows R-1.x | 25,141 |
| Windows R-2.0 | 20,353 |
| Windows R-2.2 | 36,615 |
| Linux 32bit 1.x | 6,573 |
| Linux 32bit 2.2 | 4,405 |
| Linux 64bit 2.2 | 1,784 |
| Alpha/IA64 | 203 |
| FreeBSD | 628 |
| HPUX | 345 |
| Subtotal: | 94,871 |
| Source packs: | 4,062 |
| Tool/WU packs: | 7,923 |
| Total: | 157,579 | | GBs dl'd: | 281.91 | | Pages served |
| Today: | 2,141 |
| Total: | 3,349,035 |
| (since 6/26/2006) |
| 173 Donations to S@H |
| U.S. Dollars: | 3,196.59 |
| Euros: | 863.90 |
| Last 24h: | $ 0.00 |
| Avg./24h: | $ 6.64 |
| Estim. total: | $ 4,319.66 |
Latest Member: Claggy |
| |
 | |  |
 |  |  |
| |
Online users/last 15m
18 Guests, 2 Users
Haselgrove, Raistmer 27 Members/last 24hHaselgrove, Raistmer, Devaster, Jason G, Geoff, Urs Echternacht, gaulois952, _heinz, Leaps-from-Shadows, The Grinch, ceciltseng, KWSN - jonnyv, Josef W. Segur, iceMan, zangetsu, WHRoeder, jbenfield, mark henderson, Geek@Play, firefox, Slawek, Claggy, popandbob, Vyper, Gecko_R7, KarVi, sunu
| |
 | |  |
|