|
|
Author
|
Topic: optimized sources (Read 39494 times)
|
|
_heinz
|
Surprise Surprise, a Qx N build is faster on my Northwood  LOL have a Northwood too ---> CPU(s) Number of CPUs 1 Name Intel Pentium 4 Code Name Northwood Specification Intel(R) Pentium(R) 4 CPU 2.66GHz Family / Model / Stepping F 2 7 Extended Family / Model 0 0 Brand ID 9 Package mPGA-478 Core Stepping C1 Technology 0.13 um Supported Instructions Sets MMX, SSE, SSE2 CPU Clock Speed 2672.8 MHz Clock multiplier x 20.0 Front Side Bus Frequency 133.6 MHz Bus Speed 534.6 MHz L1 Data Cache 8 KBytes, 4-way set associative, 64 Bytes line size L1 Trace Cache 12 Kuops, 8-way set associative L2 Cache 512 KBytes, 8-way set associative, 64 Bytes line size L2 Speed 2672.8 MHz (Full) L2 Location On Chip L2 Data Prefetch Logic yes L2 Bus Width 256 bits ----------------------------------------------------------------------------------------- Let us speed up the old machines ---> 
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
Boincstats Host cpus, top 10 highest number on seti@home: Pos., CPU, #, Total Credit 1 Intel(R) Pentium(R) 4 CPU 3.00GHz 104,449 1,920,980,979.29 2 Intel(R) Pentium(R) 4 CPU 2.80GHz 88,848 1,254,181,274.59 3 Intel(R) Pentium(R) 4 CPU 2.40GHz 57,309 633,952,931.43 4 Intel(R) Pentium(R) 4 CPU 3.20GHz 45,737 875,822,530.51 5 AMD Athlon(tm) 64 Processor 3000+ 31,878 257,872,702.50 6 AMD Athlon(tm) 64 Processor 3200+ 30,304 288,741,370.07 7 AMD Athlon(tm) Processor 27,726 129,774,610.58 8 Intel(R) Pentium(R) 4 CPU 2.00GHz 21,701 197,541,843.70 9 Intel(R) Pentium(R) 4 CPU 2.66GHz 19,200 208,668,039.95 10 AMD Athlon(tm) 64 Processor 3500+ 19,049 191,994,766.55 We're Both in the top 10 most popular  , I have a #8 & #4  [Doesn't it feel good to know you're with the 'in crowd'?] [Must get around to try to strip mine those inner pulse foldiing loops for the p4 64k / 1meg aliasing problem]
|
|
|
|
« Last Edit: 05 Nov 2007, 12:31:43 pm by j_groothu »
|
Logged
|
|
|
|
|
_heinz
|
It is worth to speed them up....  Although Dr. Who is already running his code... we give the old boxes a chance squeezed the code of pulsefind.cpp again sum1 and sum2 are no longer neededhere the case construct ---> switch (i) { // case 30: // sum1 = one[29] + two[29]; sum2 = one[28] + two[28]; // sum1 += three[29]; sum2 += three[28]; // P->dest[29] = sum1; P->dest[28] = sum2; // if (sum1 > tmax1) tmax1 = sum1; if (sum2 > tmax2) tmax2 = sum2; //seti_britta: new code: case 30: P->dest[29]= one[29] + two[29]+three[29]; P->dest[28]= one[28] + two[28]+three[28]; // sum1 += three[29]; sum2 += three[28]; // P->dest[29] = sum1; P->dest[28] = sum2; if (P->dest[29] > tmax1) tmax1 = P->dest[29]; if (P->dest[28] > tmax2) tmax2 = P->dest[28]; and so on for all cases ---------------------------------------------------------------------------------------------------------------------------------------------------- and here the loop construct// ---------------------------------------------------------------------------- // Function: sum_func_ptt( sw_sum3_t31 ) // Typ : float // Inhalt : folding subroutines, FPU optimized // parameter: sw_sum3_t31 // last update:23.09.2007 by:seti_britta new function // ---------------------------------------------------------------------------- sum_func_ptt( sw_sum3_t31 ) { register int i, j, k; float tmax2, tmax1; //seti_britta: new float *one = ss[0]; float *two = ss[0]+P->tmp0; float *three = ss[0]+P->tmp1; tmax2 = tmax1 = (0.0f); //seti_britta: no convert !! i = P->di; if ( i & 1 ) { i -= 1; P->dest[i] = tmax1 = one[i] + two[i] + three[i]; //seti_britta:new } for ( j = i-1, k = i-2; j > 0; j -= 2, k -= 2 ) { P->dest[j]= one[j] + two[j] + three[j]; P->dest[k]= one[k] + two[k] + three[k]; if (P->dest[j] > tmax1) tmax1 = P->dest[j]; if (P->dest[k] > tmax2) tmax2 = P->dest[k]; } if (tmax1 > tmax2) return tmax1; return tmax2; } ------------------------------------------------------------------------------------------------------------------------------------------- maybe the compact loop have a chance so far it compiles well... now we must measure to find fastest have fun regards heinz 
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
Yes, I think I would like to carefully go back and rexamine Joe's ideas/Posts in the other thread for incorporating 3 phase processing/ block prefetch in some places. I'll get a chance to look next weekend, and hopefully plan a methodical approach that might be able to handle striping for the p4 at the same time. Intel theories suggest 3 to 5 times possible improvement, in certain code by fixing those p4 problems, And the 3 phase & prefetch techniques [ Ala AMD Paper] even more. If it adds up to a 10 to 20% crunch time improvement I'll be happy because it would bring my p4 3.2 back over 1000 RAC 
|
|
|
|
« Last Edit: 06 Nov 2007, 05:18:39 am by j_groothu »
|
Logged
|
|
|
|
|
Jason G
|
Progress so far, Long way to go  : [Each compared against preset 2.3S9 xW SSE2 IPP build, on vs2005/ICC, p4 Northwood 2.0A@2.1GHz,NoHT, WinXP] Tactic Type Status Effect1- Better memcpy in GetFixedPot Generic x86 Prelim Tests ~0.3%2- Out of Place FFTs / eliminating associated memcopies Intel IPP Initial ~?.?% 3- Once off seti.cpp 8meg memcpy Generic x86 Untested ~0.?% 4- Chirp function Block Prefetch, memcpy++ zerocase & 3phase chirp Generic x86 Untested ~?.?% 5- Compiler Flags (xN SSE2 p4 Specific) P4 specific Tested ~10%6- Strip Mined Inner loops (p4 specific, 64k & 1M variants) P4, possible x86 Untested ~??% 7- GaussFit Improvements To be Determined ~ means approximate, my system, 'your mileage may vary'. [Please anyone feel free to suggest additions, updates or corrections to this list: either fairly generic OR p4 specific will do  , Consider equivalent xP SSE3 builds as already on the list for later] Jason
|
|
|
|
« Last Edit: 06 Nov 2007, 09:54:55 am by j_groothu »
|
Logged
|
|
|
|
|
Jason G
|
4- Chirp function Block Prefetch, memcpy++ zerocase & 3phase chirp Generic x86 Untested ~?.?% Took a quick look between school and work, looks like this may be easier than I thought to try. On my configuration the consistantly selected chirping function is the outstanding "sse2_ChirpData_ak". nice one. The structure is already there for potential 3 phase processing, though it is currently straight SSE2 rendering it vectorised SIMD as far as I can see. The existing prefetch, processing and writing sections are all SSE2, clearly laid out and exhibit the clean crystal vase like 'niceness' quality that make you reluctant to tamper  With few other adaptations, adjusting the prefetch, changing the processing to FPU, and suitably adjusting the streaming writes should do the trick, ... though for the p4 I would like to try to keep the aliasing issue in mind which might just dictate some of the block sizes and order they are processed. Oh for the weekend 
|
|
|
|
« Last Edit: 07 Nov 2007, 07:22:50 am by j_groothu »
|
Logged
|
|
|
|
|
Jason G
|
First run of original code [ Will need run more times for baseline though ] : ( Very Nice function already )
-------------------------------------------------------------------------------------- Testing xN SSE2 Build.
sse2_ChirpData_ak:
NumDataPoints = 1024*1024 test_points = 32768
Timer Frequency in:
Hz = 3579545 MHz = 3.57955 GHz = 0.00358
Start Time = 1585115997106 Ticks Stop Time = 1585116003199 Ticks
Duration in Ticks = 6093 Duration in seconds = 0.0017021716447
--------------------------------------------------------------------------------------
Inner loop executes 8192 times
|
|
|
|
« Last Edit: 07 Nov 2007, 11:10:42 am by j_groothu »
|
Logged
|
|
|
|
|
_heinz
|
measure its the best to try code and find optimal variants.  the loop construct in pulsefind.cpp is ready now, but not measured. Today I will squeeze the case-construct code. have still some good ideas to eleminate code else and there...we will see...
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
measure its the best to try code and find optimal variants.  the loop construct in pulsefind.cpp is ready now, but not measured. Today I will squeeze the case-construct code. have still some good ideas to eleminate code else and there...we will see... Great!, a pulsefind baseline will be good too. for underneath pulsefind It seems my machine also selects always AK folding routines and spends much of its time in the x2AL version.. I am running vtune on the chirp one now to look for any p4 specific slowdowns, wickedly fast code though 
|
|
|
|
|
Logged
|
|
|
|
|
_heinz
|
I am running vtune on the chirp one now to look for any p4 specific slowdowns, wickedly fast code though  have a strong modified chirpfft.cpp which we can try too
|
|
|
|
|
Logged
|
|
|
|
|
_heinz
|
easy we can compile all 3 cases with the präprozessordefinition now ---> --------------------------------------------------------------------------------------------------- // USE_PFLOOP --> Präprozessordirective // USE_PFCASE --> Präprozessordirective #if defined( USE_PFLOOP ) #pragma message ("-----PFLOOP-----") #include "pfloop.h" //use the loop-construct #else #if defined( USE_PFCASE ) #pragma message ("-----PFCASE-----") #include "pfcase.h" //use the modified case-construct #else //use original code #endif // USE_PFCASE #endif // USE_PFLOOP ----------------------------------------------------------------------------------------- ------ Build started: Project: seti_boinc, Configuration: Release32-NOGFX Win32 ------ Compiling... Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.20404 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. cl /Od /Ob2 /Oi /Ot /Oy /GT /I "." /I "../../../boinc/api" /I "../../../boinc/client/win" /I "../../../boinc/lib" /I ".." /I "glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\db" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\jpeglib" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\Optimizer" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\image_libs" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX" /I "C:\I\SC\vs90\boinc" /I "C:\I\SC\vs90\boinc\api" /I "C:\I\SC\vs90\boinc\client\win" /I "C:\I\SC\vs90\boinc\lib" /D "WIN32" /D "_WIN32" /D "_WINDOWS" /D "NBOINC_APP_GRAPHICS" /D "CLIENT" /D "_MT" /D "USE_IPP" /D "USE_SSE2" /D "_DEBUG" /D " USE_PFLOOP" /D "_VC80_UPGRADE=0x0600" /D "_MBCS" /GF /Gm /EHsc /MTd /Zp16 /Gy /Fp".\Release/seti_boinc.pch" /Fo".\Release32-NOGFX\\" /Fd".\Release32-NOGFX\vc90.pdb" /FR".\Release32-NOGFX\\" /W3 /c /Wp64 /Zi /TP "..\pulsefind.cpp" pulsefind.cpp -----PFLOOP-----..\pulsefind.cpp(1487) : warning C4146: unary minus operator applied to unsigned type, result still unsigned Build log was saved at "file://c:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX\BuildLog.htm" seti_boinc - 0 error(s), 1 warning(s) ========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ========== regards 
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
have a strong modified chirpfft.cpp which we can try too
Good we'll do that I think it is a very good idea, I have p4 sse2 primary performance data (vtune) for the sse2_ChirpData_ak, 10000 loops on p4 Northwood with 512k l2 cache, which took a toral time of 10 secs execution time: (19 runs worth of data gathered) (preliminary data, subject to verification with further runs) 64k Alaising : almost none... Accounts for 1.34% of function workload (about 0.13 secs) Second Level Cache misses: Accounts for 10.28% of the workload (about 1 second) other statistics (preliminary, subject to verification) : 128 bit mmx instructions ~82 million (no 64 bit MMX instructions counted) packed double precision Floating Point SSE instructions ~1.4 billion (thousand million) packed single precision Floating Point SSE instructions ~4 billion (thousand million) Mispredicted Branches = 0 !!!  No Machine Clear counts (Pipeline flushes), split loads or blocked store forwards at all  I think that's a really good function, much better statistics than the pulefolding functions gave me, but I'll have to retest those in isolation too as I'm getting better at selecting the correct compiler settings and driving vtune too. Well I'll check a few build setting and run primary performance measures again to verify those results, and add secondary performance indicators to see what else turns up.... Then on the weekend maybe fiddle with that 3 phase idea to see if it actually works....All good fun  ... Jason
|
|
|
|
« Last Edit: 08 Nov 2007, 05:06:50 am by j_groothu »
|
Logged
|
|
|
|
|
_heinz
|
the modified PFCASE is ready now ----------------------------------------------- ------ Build started: Project: seti_boinc, Configuration: Release32-NOGFX Win32 ------ Compiling... Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.20404 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. cl /Od /Ob2 /Oi /Ot /Oy /GT /I "." /I "../../../boinc/api" /I "../../../boinc/client/win" /I "../../../boinc/lib" /I ".." /I "glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\db" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\jpeglib" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\Optimizer" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\image_libs" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX" /I "C:\I\SC\vs90\boinc" /I "C:\I\SC\vs90\boinc\api" /I "C:\I\SC\vs90\boinc\client\win" /I "C:\I\SC\vs90\boinc\lib" /D "WIN32" /D "_WIN32" /D "_WINDOWS" /D "NBOINC_APP_GRAPHICS" /D "CLIENT" /D "_MT" /D "USE_IPP" /D "USE_SSE2" /D "_DEBUG" /D "USE_PFCASE" /D "_VC80_UPGRADE=0x0600" /D "_MBCS" /GF /Gm /EHsc /MTd /Zp16 /Gy /Fp".\Release/seti_boinc.pch" /Fo".\Release32-NOGFX\\" /Fd".\Release32-NOGFX\vc90.pdb" /FR".\Release32-NOGFX\\" /W3 /c /Wp64 /Zi /TP "..\pulsefind.cpp" pulsefind.cpp -----PFCASE-----..\pulsefind.cpp(1487) : warning C4146: unary minus operator applied to unsigned type, result still unsigned Build log was saved at "file://c:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX\BuildLog.htm" seti_boinc - 0 error(s), 1 warning(s) ========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ========== 
|
|
|
|
|
Logged
|
|
|
|
|
_heinz
|
modified PFCASE rocks here as it was before ---> ar=0.435000 done. Total flop count: 108711033335.208650 PulTimB 0.5 Totals: Ratio Ticks standard: 1.000 87303043476 Plan < 512 FPU swi ! : 0.575 50201832416 Plan < 512 AK SSE ! : 0.634 55338411648 Plan < 512 BHx SSE ! : 0.993 86661631716 Plan < 512 BH SSE ! : 0.774 67545465584 PFCASE ---->ar=0.435000 done. Total flop count: 108711033335.208650 PulTimB 0.5 Totals: Ratio Ticks standard: 1.000 87387438720 Plan < 512 FPU swi ! : 0.504 44014700492 Plan < 512 AK SSE ! : 0.633 55324520388 Plan < 512 BHx SSE ! : 0.992 86681643504 Plan < 512 BH SSE ! : 0.773 67531081560 ---------------------------------------------------------------------------------------------------- modified PFCASE ---> ~13% faster  heinz
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
Woohoo!, It's weekend! that function was with just the changes you made before? I'll guess that maybe the compiler did vectorise some of that, I would like to look at disassembly output, if the compiler was smart enough to put prefetch plus FPU plus streaming stores then that IS 3-Phase  , anything is possible, have you compared for accuracy as well ?
|
|
|
|
« Last Edit: 09 Nov 2007, 01:50:00 am by j_groothu »
|
Logged
|
|
|
|
|
|
Quote!
All that is necessary for the triumph of evil is that good men do nothing.- Edmund Burke
|
 |  |  |
| |
| Site Statistics |
| Total Members: | 1,021 |
| Total Posts: | 9,117 |
| Total Topics: | 425 | | Downloads |
| Apps |
| Windows R-1.x | 25,069 |
| Windows R-2.0 | 20,291 |
| Windows R-2.2 | 36,400 |
| Linux 32bit 1.x | 6,527 |
| Linux 32bit 2.2 | 4,306 |
| Linux 64bit 2.2 | 1,714 |
| Alpha/IA64 | 187 |
| FreeBSD | 582 |
| HPUX | 323 |
| Subtotal: | 94,307 |
| Source packs: | 4,072 |
| Tool/WU packs: | 7,682 |
| Total: | 150,645 | | GBs dl'd: | 279.14 | | Pages served |
| Today: | 1,707 |
| Total: | 3,095,116 |
| (since 6/26/2006) |
| 173 Donations to S@H |
| U.S. Dollars: | 3,196.59 |
| Euros: | 863.90 |
| Last 24h: | $ 0.00 |
| Avg./24h: | $ 7.53 |
| Estim. total: | $ 4,319.66 |
Latest Member: fos |
| |
 | |  |
 |  |  |
| |
Online users/last 15m
18 Guests, 3 Users
jaro3003, Jason G, _heinz 18 Members/last 24hjaro3003, Jason G, _heinz, Haselgrove, Archangel999, WHRoeder, sunu, Moustacha, EastWind, Geek@Play, msattler, rperaza26, Gecko_R7, Raistmer, fos, ajs, JDWhale, speedimic
| |
 | |  |
|