|
|
Pages: 1 2 [3] 4
|
 |
|
Author
|
Topic: sources with Orcas (Read 7096 times)
|
|
Jason G
|
yeah I keep images with TruImage. Had Ghost before but sometimes the disks didn't work for me when I went to grab some files off, so I changed. Oh well NET seems to have reinstalled OK anyway and my build environment is back up. Still can't help feeling the whole thing is a big slow pig made of eggshells  . Probably time to think about getting a lottery ticket for buying a MAC  Jason
|
|
|
|
« Last Edit: 09 Nov 2007, 11:03:59 am by j_groothu »
|
Logged
|
|
|
|
|
_heinz
|
have now some modified code versions under measuring and tuning. AKFCOMP Alex modified compact code PFCASE modified case construct of pulsefind.cpp PFLOOP compact loop construct of pulsefind.cpp FPUCOMP compact construct of opt_FPU.cpp
heinz
|
|
|
|
|
Logged
|
|
|
|
|
_heinz
|
A look at the asm code shows SIMD instructions are used and how opt_v_GetPowerSpectrum performs as a part of FPUCOMP 3 MMX register (XMM0, XMM1, XMM2) are used to handle powerful MMX-Instructions but keep in mind, all code must be measured.....  heinz --------------------------------------------------------------------------------------------------------------------------------------- PUBLIC ?opt_v_GetPowerSpectrum@@YAXPAY01MPAMHHH@Z ; opt_v_GetPowerSpectrum EXTRN __fltused:DWORD ; Function compile flags: /Ogtpy ; File c:\i\sc\pultimb_5\optimizer\opt_fpu.cpp ; COMDAT ?opt_v_GetPowerSpectrum@@YAXPAY01MPAMHHH@Z _TEXT SEGMENT tv1161 = -8 ; size = 4 _i$ = -4 ; size = 4 _FreqData$ = 8 ; size = 4 _PowerSpectrum$ = 12 ; size = 4 _this_fft_len$ = 16 ; size = 4 _bin_off$ = 20 ; size = 4 _bin_len$ = 24 ; size = 4 ?opt_v_GetPowerSpectrum@@YAXPAY01MPAMHHH@Z PROC ; opt_v_GetPowerSpectrum, COMDAT ; 36 : register int i, bin; //seti_britta: register ; 37 : float *workBuf = (float *)FreqData; ; 38 : // float psNum; //seti_britta: no longer necessary ; 39 : ; 40 : ALIGNED_YES( FreqData ); ; 41 : ALIGNED_YES( PowerSpectrum ); ; 42 : for ( i = 0, bin = 0; i < this_fft_len; i++, bin += bin_len) mov eax, DWORD PTR _FreqData$[esp-4] sub esp, 8 push ebx push ebp mov ebp, DWORD PTR _this_fft_len$[esp+12] xor ecx, ecx xor ebx, ebx cmp ebp, 4 push esi mov esi, DWORD PTR _bin_len$[esp+16] jl $LC9@opt_v_GetP mov edx, DWORD PTR _PowerSpectrum$[esp+16] mov ecx, DWORD PTR _bin_off$[esp+16] add ebp, -4 ; fffffffcH shr ebp, 2 inc ebp mov DWORD PTR tv1161[esp+20], ebp lea ecx, DWORD PTR [edx+ecx*4] add ebp, ebp lea edx, DWORD PTR [eax+8] add eax, 12 ; 0000000cH add ebp, ebp push edi mov DWORD PTR _i$[esp+24], ebp mov ebp, DWORD PTR tv1161[esp+24] lea edi, DWORD PTR [esi*4] npad 1 $LL10@opt_v_GetP: ; 43 : { ; 44 : // psNum = FreqData[i][0] * FreqData[i][0] + FreqData[i][1] * FreqData[i][1]; ; 45 : // PowerSpectrum[bin_off + bin] = // Large cache miss here...can it be fixed? ; 46 : // workBuf[i] = psNum; ; 47 : //seti_britta: new statement ; 48 : PowerSpectrum[bin_off + bin] = workBuf[i] = (FreqData[i][0] * FreqData[i][0]) + (FreqData[i][1] * FreqData[i][1]); movss xmm1, DWORD PTR [eax-8] movss xmm0, DWORD PTR [eax-12] mulss xmm0, xmm0 movaps xmm2, xmm1 mulss xmm2, xmm1 addss xmm0, xmm2 movss DWORD PTR [edx-8], xmm0 movss DWORD PTR [ecx], xmm0 movss xmm1, DWORD PTR [eax-4] movss xmm0, DWORD PTR [eax] mulss xmm0, xmm0 add ecx, edi movaps xmm2, xmm1 mulss xmm2, xmm1 addss xmm0, xmm2 movss DWORD PTR [edx-4], xmm0 movss DWORD PTR [ecx], xmm0 movss xmm1, DWORD PTR [eax+4] movss xmm0, DWORD PTR [eax+8] mulss xmm0, xmm0 add ecx, edi movaps xmm2, xmm1 mulss xmm2, xmm1 addss xmm0, xmm2 movss DWORD PTR [edx], xmm0 movss DWORD PTR [ecx], xmm0 movss xmm1, DWORD PTR [eax+12] movss xmm0, DWORD PTR [eax+16] add ebx, esi add ebx, esi add ecx, edi movaps xmm2, xmm1 mulss xmm0, xmm0 mulss xmm2, xmm1 addss xmm0, xmm2 add ebx, esi movss DWORD PTR [edx+4], xmm0 movss DWORD PTR [ecx], xmm0 add ebx, esi add ecx, edi add edx, 16 ; 00000010H add eax, 32 ; 00000020H sub ebp, 1 jne $LL10@opt_v_GetP mov ebp, DWORD PTR _this_fft_len$[esp+20] mov eax, DWORD PTR _FreqData$[esp+20] mov ecx, DWORD PTR _i$[esp+24] pop edi $LC9@opt_v_GetP: ; 36 : register int i, bin; //seti_britta: register ; 37 : float *workBuf = (float *)FreqData; ; 38 : // float psNum; //seti_britta: no longer necessary ; 39 : ; 40 : ALIGNED_YES( FreqData ); ; 41 : ALIGNED_YES( PowerSpectrum ); ; 42 : for ( i = 0, bin = 0; i < this_fft_len; i++, bin += bin_len) cmp ecx, ebp jge SHORT $LN8@opt_v_GetP mov edx, DWORD PTR _bin_off$[esp+16] add esi, esi add esi, esi add ebx, edx mov edx, DWORD PTR _PowerSpectrum$[esp+16] lea edx, DWORD PTR [edx+ebx*4] npad 9 $LC3@opt_v_GetP: ; 43 : { ; 44 : // psNum = FreqData[i][0] * FreqData[i][0] + FreqData[i][1] * FreqData[i][1]; ; 45 : // PowerSpectrum[bin_off + bin] = // Large cache miss here...can it be fixed? ; 46 : // workBuf[i] = psNum; ; 47 : //seti_britta: new statement ; 48 : PowerSpectrum[bin_off + bin] = workBuf[i] = (FreqData[i][0] * FreqData[i][0]) + (FreqData[i][1] * FreqData[i][1]); movss xmm1, DWORD PTR [eax+ecx*8+4] movss xmm0, DWORD PTR [eax+ecx*8] movaps xmm2, xmm1 mulss xmm0, xmm0 mulss xmm2, xmm1 addss xmm0, xmm2 movss DWORD PTR [eax+ecx*4], xmm0 movss DWORD PTR [edx], xmm0 inc ecx add edx, esi cmp ecx, ebp jl SHORT $LC3@opt_v_GetP $LN8@opt_v_GetP: pop esi pop ebp pop ebx ; 49 : ; 50 : } ; 51 : } add esp, 8 ret 0 ?opt_v_GetPowerSpectrum@@YAXPAY01MPAMHHH@Z ENDP ; opt_v_GetPowerSpectrum
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
hmmm, time to get out the p4 optimisation reference. Some of the things in those SSE loops might be really good on a core/core2, don't know, but I'm a little bit wierded out by a few things  . Some of the ordering of the instructions could be improved, (not entirely convinced out-of-order execution would fix that). I also think that where I think the core2 likes moderately tight loops, as would seem to be designed, the p4 might appreciate a further manual unroll.( or maybe the other way ! both worth a try due to different architecture generation. ) Did you use any optimisation yet ?( or is that the cleaned up version of what I did with QxN? nope different function. whew.) l'm a bit surprised at some of the code generated. Jason
|
|
|
|
« Last Edit: 21 Nov 2007, 04:00:41 am by j_groothu »
|
Logged
|
|
|
|
|
_heinz
|
No hurry with all this... all must be measured and running against the original-code version to see if there is a real progress. It is easy to destroy a well performed loop with some simple changes. As Joe mentioned, its not a good idea to use the pultime project to measure code by using the MS-compiler(Have not as my own the whole Intel Performance package, with Intel-Compiler, VTune and so on). Using now the etimer-project to see any differences. In this way we can better test short code-pieces. And if we found any progress it is at least necessary to compile with Intel-compiler to see if it really rocks. Do you like my sight of view to see equation systems looking at the code ? Resolutions can be found by using mathematical methods. Therefore some of my code-constructs are a little bit crazy. But you know mathematicians are crazy people. And I´m a mathematician.  Regards heinz
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
No, No hurry, my holidays are coming in a few weeks, I still consider this "Orientation". ------------------------Detour------------ And I´m a mathematician. I think the Mathematician's Anonymous meeting is three doors down ... Mathematicians may be crazy or not, But I always wondered where my lecturers did 'stash their flagons' before class, and who knitted them the stylish brown vest that is two sizes too small  (jokes). If you have a formula to get crazy, I can examine its computational complexity well enough (comp sci) though a little out of practice, then build it in hardware to IPC class 3 military standards (Electronic Engineering). It is a shame the algorithm for sanity is O(n^3) and requires too many connections to implement on an FPGA, so sorry I can't help you there  [Though I can't offer any sanity, I do have a drawer full of high speed logic I can let you dig through ....] ------------------------------------------------------------------------- My p4 selects PwrSpectrumOnly_ptt( sse_GetPSO_sc16_npr ) take a look it is nice, No Author's name is listed  . Maybe it is Ben & Joe? - It has care with the SSE pipelines (Even is laid out showing them  ) - It Is using many more registers ( Those aren't variable assignments really, no  ) - It looks to be unrolled to help the pipelines/cache/prefetch, very pretty  . I'm more impressed with that function, It may help you to compare to that so you can see how to use the hardware better. Jason
|
|
|
|
« Last Edit: 21 Nov 2007, 10:48:21 am by j_groothu »
|
Logged
|
|
|
|
|
Josef W. Segur
|
It hasn't been mentioned lately that in 2004 Eric Korpela set up a setiboinc sourceforge project to encourage submission of optimized code. Ben Herndon participated in that, and his code in ../setiboinc/client/opt is where I got the original form of the GetPSO functions. I just made it into four separate versions to compare the prefetching. On some systems and some builds there have been clear differences particularly on my Pentium-M system and on Core 2 systems. On P4 or PD systems, OTOH, the differences have been small. Francois Piednoel's PowerSpectrum code from the "Who? optimizations Part 2" NC thread may be even better. I hope he'll release source soon but that part would be fairly easy to fit into our codebase anyhow. Joe
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
Thanks Joe, Though I did run maybe a couple of workunits when sah classic first started (I remember the news release on tv here in Oz... Incidently, could never rember the email I used then....), I just came back around last November....
so really '2004' is indeed before my time... Thanks for the link, more reading is always handy.
Just opened the Pulse Timing project for the first time, Looks like I'll be occupied for the weekend trying to get a working build, Looks Extremely handy for testing some of the things been worked on behind the scenes.
Jason
|
|
|
|
|
Logged
|
|
|
|
|
_heinz
|
Looking now to vectorize the most used functions of S@H. Here I show you a modified FillTrigArray as it performs as part of chirpfft.cpp Have fun  Heinz --------------------------------------------------------------------------- PUBLIC ?FillTrigArray@@YAXH@Z ; FillTrigArray ; Function compile flags: /Ogtpy ; COMDAT ?FillTrigArray@@YAXH@Z _TEXT SEGMENT _k$ = 8 ; size = 4 ?FillTrigArray@@YAXH@Z PROC ; FillTrigArray, COMDAT ; 731 : CurrentTrig[k].Sin = ((CurrentTrig[k].Sin * TrigStep[k].Cos) + (CurrentTrig[k].Cos * TrigStep[k].Sin)); mov edx, DWORD PTR ?TrigStep@@3PAUSinCosArray@@A ; TrigStep mov ecx, DWORD PTR ?CurrentTrig@@3PAUSinCosArray@@A ; CurrentTrig mov eax, DWORD PTR _k$[esp-4] shl eax, 4 movsd xmm1, QWORD PTR [eax+edx+8] mulsd xmm1, QWORD PTR [eax+ecx] movsd xmm0, QWORD PTR [eax+ecx+8] mulsd xmm0, QWORD PTR [eax+edx] addsd xmm0, xmm1 movsd QWORD PTR [eax+ecx], xmm0 ; 732 : CurrentTrig[k].Cos = ((CurrentTrig[k].Cos * TrigStep[k].Cos) - (CurrentTrig[k].Sin * TrigStep[k].Sin)); mov edx, DWORD PTR ?TrigStep@@3PAUSinCosArray@@A ; TrigStep movsd xmm0, QWORD PTR [eax+edx+8] push esi mov esi, DWORD PTR ?CurrentTrig@@3PAUSinCosArray@@A ; CurrentTrig movsd xmm1, QWORD PTR [eax+esi] mulsd xmm0, QWORD PTR [eax+esi+8] mulsd xmm1, QWORD PTR [eax+edx] lea ecx, DWORD PTR [eax+esi+8] subsd xmm0, xmm1 movsd QWORD PTR [ecx], xmm0 pop esi ; 733 : } ret 0 ?FillTrigArray@@YAXH@Z ENDP ; FillTrigArray _TEXT ENDS
|
|
|
|
|
Logged
|
|
|
|
|
_heinz
|
I. Going parallel or how to cut the leek ! This morning I was in the kitchen to make a salad of leek. After washing the leafs I took one to cut it in fit parts. But how big is a fit part ? 1mm, 10mm, 100mm ? I have a relative big leaf of 24 cm so we choose 10mm= 1cm as a fit part. Now we know I must cut it into 24 parts. How todo that ? 1. we take the knife and cut it into 24 pieces one after the other. Wee need 24-1 = 23 cuts We have stiil one tree to cut. This means sequential works. or we do following--> 2. we cut the leaf in two eaqual parts(1 cut), then lay both parts parallel to each other and cut it. We need 12-1 = 11 cuts +1 extra cut from the first. Summary 12 cuts. This means parallel work. We have 2 parallel trees to cut.The one extra cut and lay both parts parallel is the overhead(organize parallel work). 3. we cut the leaf into 2 parts (1 cut), laying both parts parallel, cut now again into 2 parts(1 cut), laying again the 2 parts parallel(have now 4 parallel) and cut it. We need still 6-1 cuts plus the 2 extra cuts = summary 7 cuts. This means much more parallel. We have 4 trees. The overhead is now grown to 2 cuts plus laying 4 parts parallel. 4. we cut the leaf into 2 parts(1cut) laying both parts parallel, cut now again into 2 parts(1 cut), laying again both parts parallel and cut it into 2 parts(1 cut), laying the two parallel(have now 8 parts parallel) and cut. We need 3-1=2 cuts plus 3 extra cuts summary 5 cuts. The overhead grows now to 3 cuts plus laying 3 times (2³ = 8parts) parallel. Summary of all: 1. sequential = 23 cuts --> no overhead 2. parallel (2) = 11+1=12 cuts (1 cut overhead) 3. parallel (4) = 5+2=7 cuts (2 cuts overhead) 4. parallel ( 8 ) = 2+3=5 cuts (3 cuts overhead) ----------------------------------------------------------------------- In this way (4.) we can do the same work with still 5 cuts against (1.) 23 cuts if we use sequential work. But attention the overhead with 3 is bigger as the real work-cuts(2) and we run 8 trees parallel. This method of organize parallel work is called blocking. The problem is to determine the length of the pieces(fit parts) and the choose of parallel trees to get maximal performance. I believe every max performance solution is for every given work (1000), (100 000), (1 000 000) and machine a other. Therefore this thema is relative complex and so difficult to handle. Believe me, the parallel cutted salad has a fine taste. Have fun  regards heinz
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
 ...Or the 'other' kind of coarse grained parallelism ... where you ask your mum to make you a sandwich instead, then you watch TV in parallel  [ 0 cuts + 1 small communication overhead] Jason
|
|
|
|
« Last Edit: 08 Dec 2007, 06:08:44 am by j_groothu »
|
Logged
|
|
|
|
|
_heinz
|
Hi Jason, great, your sample means start a task (ask mum to make a sandwich) parallel to the Main Program (TV program). You must still wait till the sandwich ( the task) is ready.  We can enlarge this too: Start a variable number of tasks parallel to the Main Program. Later we can do so. But at first we had to resolve some basics on the way to go parallel as "Load balanced parallel execution of a fixed number of independent loop iterations" and some others. heinz
|
|
|
|
|
Logged
|
|
|
|
|
Josef W. Segur
|
1. If the leaves you're cutting are always the same size and shape, an ideal tool would make the cuts all at once. If the leaves come in a few different sizes, either a tool for each size or an even more complex tool with suitable adjustments is needed.
2. The characteristics of the Validator need to be kept in mind when thinking about dividing the work differently. When it is comparing results it checks that each signal in result A has a matching signal in result B, then checks that each signal in result B has a matching signal in result A. For the ~95% of WUs which have less than 31 reportable signals the order signals are found wouldn't make a difference. But for the ~5% which overflow we need to be sure we'll report the same subset as the stock app does. Joe
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
1. If the leaves you're cutting are always the same size and shape, an ideal tool would make the cuts all at once. If the leaves come in a few different sizes, either a tool for each size or an even more complex tool with suitable adjustments is needed. Or perhaps a modular tool, with a set of adaptors designed to fit each possible variation [or groups of variations], with a different plan/tool adapter for each one of the finite set of possibilties. (A single complex tool is large and unwieldly, many different tools is more efficient but maybe even larger in total with redundancy (and requires selection), a modular tool seems an ideal compromise but also requires selection/adaptation overhead)... mmm all food for thought. 2. The characteristics of the Validator need to be kept in mind when thinking about dividing the work differently. When it is comparing results it checks that each signal in result A has a matching signal in result B, then checks that each signal in result B has a matching signal in result A. For the ~95% of WUs which have less than 31 reportable signals the order signals are found wouldn't make a difference. But for the ~5% which overflow we need to be sure we'll report the same subset as the stock app does. Joe
 , so even though a faster overflow detection mechanism may be possible, the positive overflow will still require the same processing order/results... [You seem to be saying the order of signals is important in those ~5% where overflow occurs] thinking about that a little I can probably live with the current speed, or even reduced speed, where it results in overflow. I wonder if there may be benefit to quickly disproving [or just detecting reduced likelihood of] overflow condition early on... (then we may perhaps tactically reorder detection) Jason
|
|
|
|
« Last Edit: 10 Dec 2007, 04:58:19 am by j_groothu »
|
Logged
|
|
|
|
|
Josef W. Segur
|
..., so even though a faster overflow detection mechanism may be possible, the positive overflow will still require the same processing order/results...[You seem to be saying the order of signals is important in those ~5% where overflow occurs] thinking about that a little I can probably live with the current speed, or even reduced speed, where it results in overflow. I wonder if there may be benefit to quickly disproving [or just detecting reduced likelihood of] overflow condition early on... (then we may perhaps tactically reorder detection)
Jason The order of the signals within the output result file never matters, but I can see no practical way to select the right subset of what may be a very large number of potential signals other than using the same sequence of searches as stock. Prechecking for possible overflow is certainly an interesting concept. If someone came up with a really efficient way to do that, the project might consider putting that code in the splitter. In the science app, maybe the best opportunity is during baseline smoothing. I'll also note that if we found a way of dividing the work much more effectively, the changes could be applied to the official sources prior to the next stock release. That release could be named setiathome_multibeam or something similar, and all participants would have to upgrade. Joe
|
|
|
|
|
Logged
|
|
|
|
|
Pages: 1 2 [3] 4
|
|
|
|
Quote!
If there is a worse time for something to go wrong, it will happen then.- Murphy's Law
|
 |  |  |
| |
| Site Statistics |
| Total Members: | 1,021 |
| Total Posts: | 9,117 |
| Total Topics: | 425 | | Downloads |
| Apps |
| Windows R-1.x | 25,067 |
| Windows R-2.0 | 20,291 |
| Windows R-2.2 | 36,400 |
| Linux 32bit 1.x | 6,527 |
| Linux 32bit 2.2 | 4,305 |
| Linux 64bit 2.2 | 1,714 |
| Alpha/IA64 | 187 |
| FreeBSD | 581 |
| HPUX | 323 |
| Subtotal: | 94,304 |
| Source packs: | 4,071 |
| Tool/WU packs: | 7,680 |
| Total: | 150,615 | | GBs dl'd: | 279.10 | | Pages served |
| Today: | 497 |
| Total: | 3,093,906 |
| (since 6/26/2006) |
| 173 Donations to S@H |
| U.S. Dollars: | 3,196.59 |
| Euros: | 863.90 |
| Last 24h: | $ 0.00 |
| Avg./24h: | $ 7.54 |
| Estim. total: | $ 4,319.66 |
Latest Member: fos |
| |
 | |  |
 |  |  |
| |
Online users/last 15m
17 Guests, 1 User
Archangel999 17 Members/last 24hArchangel999, Jason G, EastWind, Geek@Play, msattler, rperaza26, Gecko_R7, Raistmer, fos, ajs, JDWhale, WHRoeder, _heinz, speedimic, Josef W. Segur, sunu, Fredericx51
| |
 | |  |
|