|
|
Pages: 1 2 [3]
|
 |
|
Author
|
Topic: BOINC as library (Read 4313 times)
|
|
Jason G
|
Well, "results strong similar" ....
Nice  , sounds to me like it might be only some compiler flags different for only 2% difference! Jason
|
|
|
|
|
Logged
|
|
|
|
|
Raistmer
|
Here all diffs that were done by me to compile 2.4 sources (actually, 2.39S but there only 2 differences in #define strings that was added to diffs after build and could not prevent to rebuild client again) with VS 2005 and trial versions of ICC and IPP. opt_config.h was added to simplify tuning of conditional defines and compilation through few source files affected.
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
Good one to record the changes like that, mine are scribbled on an old envelope  , Looks like similar changes overall. Did you end up with favourite compiler settings ? the 2.4lunatics one for QxN looks pretty close for the ones I've played with on my p4. Jason
|
|
|
|
|
Logged
|
|
|
|
|
Raistmer
|
Well, I'm not even record my changes  Just downloaded yesterday lunatic 2.4 sources from link on main seti board, ran WinDiff utility and collected all discrepancies in one rar  I used SSE2 build options cause that binary was intented to run and be profiled on AMD 64 host. I use CodeAnalyst as profiling tool (governing by assumption that AMD should know their own CPUs better than Intel  ) It would be interesting to compare your's vTune data with CodeAnalyst one to highlight area of interests for some improvements. Probably need to check that options set more presisely cause my build little less than optimal. Another possibility - options are fine and 2% difference in speed comes from trial nature of mine IPP installation. Intel approves only dynamic linking with trial IPP library. So dll-calls... Don't know really could this accont for 2% slowness or not (even 2% still preliminary - tested only on short WU).
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
Very good idea to compare SSE2 QxN p4 vtune data against your sse2 AMD build. There is some Arguments about that !  . Maybe you found Hotspots in an inner folding routine ? Mine chooses FoldArrayBy2AL and spends a about 10% of total time in there. Maybe yours chooses a different routine? either way we could compare asm listing output of those even, which might explain some differences between the chips! (Those functions don't depend on IPP as far as I know.) [Note that also because I am using ICC, about 11% of time is being spent in _Intel_fast_memcpy, Which having looked at a mixture of improved memcopies, elimination of them, and hybrid processing techniques in other areas ,might make some generally applicable improvements.(not just intel chips) ] Even though yours calls the dynamic library it would be nice to see if the dispatching is calling the same IPP functions (but DLL versions) ...or some different maybe more generic one... mine calls the w7 static ones which are p4 sse2, but the internal names given by vtune / codeanalyst will give the real names. Jason
|
|
|
|
« Last Edit: 17 Nov 2007, 01:09:41 pm by j_groothu »
|
Logged
|
|
|
|
|
Raistmer
|
Well, some initial results. Most time my version spends in sse3_ChirpData_ak (1 function, 78 instructions, Total: 12300 samples, 19.37% of samples in the module, 5.12% of total session samples) [these line take most: Address Line Trace Source Code Bytes Timer samples 0x4a9dfe 125 m = vec_recip3(_mm_add_ps(_mm_mul_ps(x, x), _mm_mul_ps(y, y))); 3989 0x4a9d76 111 c = _mm_add_ps(_mm_mul_ps(_mm_add_ps(_mm_mul_ps(_mm_add_ps(_mm_mul_ps(y, CC3), 3631 ] It's pretty strange cause I used SSE2 build options... Maybe function name not quite adequate?... (or maybe smth wrong with profiler or my understanding of its results  ) Next one is fastcopy_I (1 function, 24 instructions, Total: 8714 samples, 13.72% of samples in the module, 3.63% of total session samples) and in IPP dll most samples hitted ippsZero_8u (1 function, 1160 instructions, Total: 46923 samples, 99.89% of samples in the module, 19.55% of total session samples) [this line leader: Address Code Bytes Instruction Symbol Timer samples 0x200ede9 0x 0F 28 4C 32 10 movaps xmm1,[edx+esi+10h] ippsZero_8u+1473833 5965 ] 1 instructions, Total: 5965 samples, 4.87% of samples in module p:\bin\intel\ipp\5.3\ia32\bin\ippst7-5.3.dll, 0.99% of total session samples As one can see it's almost single called function in whole dll ... (very strange too). It was 240 sec profiling run. What time scale best suitable for profiling all main app activities? I will try to increase profiling time, maybe it will get more adequate results... Some addon: sse_sum2,3,4,5 and sse_f_GetPeak have the most unaligned accesses number. sse3_ChirpData_ak and fastcopy_I have the most data cache misses
|
|
|
|
« Last Edit: 22 Nov 2007, 06:39:19 pm by Raistmer »
|
Logged
|
|
|
|
|
Josef W. Segur
|
Well, some initial results. Most time my version spends in sse3_ChirpData_ak ... It's pretty strange cause I used SSE2 build options... Maybe function name not quite adequate?... (or maybe smth wrong with profiler or my understanding of its results  ) The program design is to build all the hand-optimized code with at least whatever minimum options are required, then use run-time testing of the host to decide which of those routines to test. So the opt_SSE3.cpp module is built with its needed SSE3 setting, your CPU supports SSE3, and it tests faster than the other chirp routines on your system so is chosen as the one to use during actual crunching. The fraction of time spent chirping is very much affected by the angle range. The reason WUs at high angle range are quick is that they do no Gaussian fitting and not much Pulse or Triplet finding. Chirping is also reduced, but not so much, so it becomes more of the total run time. I don't know CodeAnalyst, so don't understand the "19.37% of samples in the module, 5.12% of total session samples" distinction. Joe
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
...so don't understand the "19.37% of samples in the module, 5.12% of total session samples" distinction. Joe
As well as the familiar/traditional instrumented 'Device Under Test' Style profiling, vTune, and I guess from this data CodeAnalyst too, collects the OS/System Counters, so Data is available on all processes /Threads running at the time of test. Without having seen the rest of the data: ( And presuming Time-based sampling was used rather than Event-Based Sampling) From the given information, if it were vTune, for the module/process which spent 20% of its time in the chirp routine, that 20% self time constituted about 5% system time .... This 'might' imply the total self time of the module makes 25% of the system time. That might suggest a single threaded module going full pelt (constant 100% usage) on 1 core of a quad, Constant 100% usage would be one of the first System level optimisation Goals. !!!!GOAL!!!! move onto further optimisation levels.--> Application achitecture level --> MicroArchitecure level
otherwise if it's a dual or single core then it may be using less than 100% of available system cpu time ... either other processes running taking system resources during the profile (can diagnose system problems like this), or the module is either IO or memory bound (might suggest deeper optimisation if system problems are eliminated). Again those are just guesses / general guidelines without looking at other data... at system level, for example, what proportion the Total module samples were of total system samples might be, especially a cpu usage graph by module, might tell you that you forgot to stop boinc (done it many times), maybe a virus scan had started, a windows update, maybe you were watching a DVD? LOL (joke) Jason
|
|
|
|
« Last Edit: 23 Nov 2007, 01:59:14 am by j_groothu »
|
Logged
|
|
|
|
|
Raistmer
|
 Not watched DVD  Host under testing is AMD 63 3200 Venice, SSE3 support available indeed. Yes, with timer-based profile CodeAnalyst gathers data on whole system. Yes, there was BOINC run in background (einstein project in very that time). I interesting only time distribution inside SETI exe and IPP dll so didn't care about stopping/restarting BOINC during test. It makes "total system time %" meaningless sure. But SETI should take ~50% of CPU time in this situation, not just 25%. Maybe CodeAnalyst counted IPP dll as distinct module?... Work Unit Info True angle range: 0.405774 Any comments about why ippsZero_8u takes most time, please ? and (accordingly dll name) it seems IPP dispatcher chose "standart" library version, not one of specificaly optimized (not w7 for example).
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
 Not watched DVD  ... But SETI should take ~50% of CPU time in this situation, not just 25%. .... Right so Single core (like my non HT p4), more Guesses: ~50% 1 Einstein task ~1 to 5% - boinc ( is higher because of context switching on single core, I've measured see below) ~2 to 20% - CodeAnalyst (high sampling rate increases load ~2 to 10% - Other system/kernel drivers & services subtotal : 55% ->85% ... Average ~70%  remaining 45%~15% Average 30% - your seti run. So before you can move on to deeper optimisation level , you need to measure/graph with codeanalyst: whatever the equivalent system counters are for vTune names: 1) With Boinc+Einstein+your seti task (Same conditions as you did) - "System: Processor Queue Length" - "System: Context Switches/sec" might also be helpful 2) Without Boinc+Einstein, just your seti task - "System: Processor Queue Length" - "System: Context Switches/sec" might also be helpful Maybe too some memory usage might show something if you have limited physical RAM etc... "System: Processor Queue Length" (vTune name) Gives a reading of how many NON-IDLE threads are waiting in the queue for CPU time .... on my 2.0GHz non HT p4 this typically averages about 5 with a seti run (but no boinc+seti), that means I could benefit, for the software I run, from A dual core of at least 2GHz, preferably a bit more to bring it into the range of 1 to 2. ( A fast quad would probably be wasted on me, but give practically every running thread, on average, a fresh whole core to itself...) "System: Context Switches/sec", might also give an idea of how much priority competition is happening on your machine (Threads/Modules competing ... You see this raise slightly during mouse moves, or having more active background programs that poll for something regularly (e.g. speedfan, boincview), that looks like speed humps in the context switches/sec. Any comments about why ippsZero_8u takes most time, please ? and (accordingly dll name) it seems IPP dispatcher chose "standart" library version, not one of specificaly optimized (not w7 for example).
Mine spends some large times in a few of the IPP functions. When you get to do some application and / architectural level performance measurement you will see the reasons, it in some small way might partially be related to the 'denormal data' issue you brought up before (take a look at the IPP flash tutorials about that). I've been thinking about ways to approach a custom (stripped down) FFTW build for a while now, but aren't ready yet. The use of the standard library and the fact that it would be a DLL on a single core would be an issue too(probably extra context switches / cpu queue length)... means that like me you'd probably benefit from an extra core  so if you need to justify going to more cores for santa to bring one then "I need one for software development purposes" is probably a pretty good reason to add to the list  . IMO, from the measurements I get, It is a myth that software doesn't benefit from multicore or even HT yet. Who runs only 1 single threaded process at a time? Only DOS! [ And perhaps reviewers doing synthetic benchmarks] The windows OS handles all the thread switching much better with multicore or even HT, for DLLs and services. Even without boinc/seti running, system responsiveness and use of system resources would improve for us  Jason
|
|
|
|
« Last Edit: 23 Nov 2007, 07:18:40 am by j_groothu »
|
Logged
|
|
|
|
|
Raistmer
|
Yes.... but if it would be multicore there were multi seti/einstein processes to eat CPU too  It seems my version still not appropriate for profiling, it better suits for debugging - checkpointing broken.
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
Yes.... but if it would be multicore there were multi seti/einstein processes to eat CPU too  It seems my version still not appropriate for profiling, it better suits for debugging - checkpointing broken. LOL, Good point, though you would tend to use the fully loaded cores profile data just for overall system performance analysis rather than program profile information. You would stop boinc for deeper module profile to not obscure the run. Checkpointing? sounds like boincapi problem maybe
|
|
|
|
« Last Edit: 23 Nov 2007, 04:39:55 pm by j_groothu »
|
Logged
|
|
|
|
|
Josef W. Segur
|
...checkpointing broken. The default checkpoint interval is 300 seconds. When running with BOINC, the "Write to disk" preference overrides that, when running standalone you need to use an init_data.xml file to supply that and maybe a useful memory size. The knabench package has a suitable one, but I often use this simpler one: ----------------------------------------------------------------------------- <app_init_data> <wu_cpu_time>0</wu_cpu_time> <checkpoint_period>60.000000</checkpoint_period> <host_info> <m_nbytes>134217728.000000</m_nbytes> </host_info> </app_init_data> ----------------------------------------------------------------------------- Joe
|
|
|
|
|
Logged
|
|
|
|
|
Raistmer
|
Thank you, but my app does write checkpoint (every 300 sec only maybe but it does). It cant restore computation state from saved data - that i meant when wrote "checkpointing broken".
|
|
|
|
« Last Edit: 25 Nov 2007, 04:31:49 am by Raistmer »
|
Logged
|
|
|
|
|
Pages: 1 2 [3]
|
|
|
|
Quote!
Ever mind the rule of three,
Three times your deeds return to thee.
This lesson well, thou must learn,
thee only gets what thou does earn.- Lady Gwen
|
 |  |  |
| |
| Site Statistics |
| Total Members: | 1,072 |
| Total Posts: | 10,818 |
| Total Topics: | 447 | | Downloads |
| Apps |
| Windows R-1.x | 25,145 |
| Windows R-2.0 | 20,356 |
| Windows R-2.2 | 36,624 |
| Linux 32bit 1.x | 6,574 |
| Linux 32bit 2.2 | 4,406 |
| Linux 64bit 2.2 | 1,784 |
| Alpha/IA64 | 204 |
| FreeBSD | 629 |
| HPUX | 346 |
| Subtotal: | 94,889 |
| Source packs: | 4,068 |
| Tool/WU packs: | 7,928 |
| Total: | 157,826 | | GBs dl'd: | 281.97 | | Pages served |
| Today: | 1,101 |
| Total: | 3,358,117 |
| (since 6/26/2006) |
| 173 Donations to S@H |
| U.S. Dollars: | 3,196.59 |
| Euros: | 863.90 |
| Last 24h: | $ 0.00 |
| Avg./24h: | $ 6.62 |
| Estim. total: | $ 4,319.66 |
Latest Member: Luke@SETI |
| |
 | |  |
 |  |  |
| |
Online users/last 15m
12 Guests, 2 Users
Haselgrove, Jason G 28 Members/last 24hHaselgrove, Jason G, ajs, Raistmer, Leaps-from-Shadows, Luke@SETI, sunu, tfp, Josef W. Segur, Fivestar Crashtest, WHRoeder, Yin Gang, elec999, KarVi, firefox, Geek@Play, Urs Echternacht, Claggy, _heinz, Slawek, Devaster, Purple Rabbit, akula-ssh, Toffa, pu154r, indian, The Grinch, serb
| |
 | |  |
|