Seti@Home optimized science apps and information
 
Welcome, Guest. Please login or register.
Did you miss your activation email?
05 Sep 2008, 11:11:48 am

Login with username, password and session length
 
If you've registered already but never got your activation email, please click here.
 
 
Seti@Home optimized science apps and information  |  Optimized Seti@Home apps  |  Windows  |  Topic: optimized sources 0 Members and 0 Guests are viewing this topic. « previous next »
Pages: 1 ... 15 16 [17] 18 19 ... 25 Go Down Print
Author Topic: optimized sources  (Read 39494 times)
_heinz
Code Wizard
Knight who says 'Ni!'
*****
Online Online

Posts: 692


View Profile
Re: optimized sources
« Reply #240 on: 05 Nov 2007, 11:53:36 am »

Surprise Surprise, a  QxN build is faster on my Northwood Tongue
LOL     
have a Northwood too  --->
CPU(s)   
Number of CPUs 1
 
Name Intel Pentium 4
Code Name Northwood
Specification Intel(R) Pentium(R) 4 CPU 2.66GHz
Family / Model / Stepping F 2 7
Extended Family / Model 0 0
Brand ID 9
Package mPGA-478
Core Stepping C1
Technology 0.13 um
Supported Instructions Sets MMX, SSE, SSE2
CPU Clock Speed 2672.8 MHz
Clock multiplier x 20.0
Front Side Bus Frequency 133.6 MHz
Bus Speed 534.6 MHz
L1 Data Cache 8 KBytes, 4-way set associative, 64 Bytes line size
L1 Trace Cache 12 Kuops, 8-way set associative
L2 Cache 512 KBytes, 8-way set associative, 64 Bytes line size
L2 Speed 2672.8 MHz (Full)
L2 Location On Chip
L2 Data Prefetch Logic yes
L2 Bus Width 256 bits
-----------------------------------------------------------------------------------------
Let us speed up the old machines --->  Grin


Logged
Jason G
Global Moderator
Knight who says 'Ni!'
*****
Online Online

Posts: 1684


View Profile
Re: optimized sources
« Reply #241 on: 05 Nov 2007, 12:21:38 pm »

Boincstats Host cpus, top 10 highest number on seti@home:
Pos.,  CPU, #, Total Credit

1    Intel(R) Pentium(R) 4 CPU 3.00GHz     104,449     1,920,980,979.29    
2    Intel(R) Pentium(R) 4 CPU 2.80GHz     88,848     1,254,181,274.59    
3    Intel(R) Pentium(R) 4 CPU 2.40GHz     57,309     633,952,931.43    
4    Intel(R) Pentium(R) 4 CPU 3.20GHz     45,737     875,822,530.51    
5    AMD Athlon(tm) 64 Processor 3000+     31,878     257,872,702.50    
6    AMD Athlon(tm) 64 Processor 3200+     30,304     288,741,370.07    
7    AMD Athlon(tm) Processor                   27,726        129,774,610.58    
8    Intel(R) Pentium(R) 4 CPU 2.00GHz     21,701    197,541,843.70
9    Intel(R) Pentium(R) 4 CPU 2.66GHz     19,200     208,668,039.95    
10    AMD Athlon(tm) 64 Processor 3500+     19,049     191,994,766.55    

We're Both in the top 10 most popular Cheesy,  I have a #8 & #4  Tongue [Doesn't it feel good to know you're with the 'in crowd'?]

[Must get around to try to strip mine those inner pulse foldiing loops for the p4 64k / 1meg aliasing problem]
« Last Edit: 05 Nov 2007, 12:31:43 pm by j_groothu » Logged
_heinz
Code Wizard
Knight who says 'Ni!'
*****
Online Online

Posts: 692


View Profile
Re: optimized sources
« Reply #242 on: 05 Nov 2007, 10:02:53 pm »

It is worth to speed them up.... Grin

Although Dr. Who is already running his code... we give the old boxes a chance

squeezed the code of pulsefind.cpp again
sum1 and sum2 are no longer needed

here the case construct --->
  switch (i) {
//    case 30:
//      sum1 = one[29] + two[29];           sum2 = one[28] + two[28];
//      sum1 += three[29];                  sum2 += three[28];
//      P->dest[29] = sum1;                 P->dest[28] = sum2;
//      if (sum1 > tmax1) tmax1 = sum1;     if (sum2 > tmax2) tmax2 = sum2;
 //seti_britta: new code:
    case 30:
      P->dest[29]= one[29] + two[29]+three[29];           P->dest[28]= one[28] + two[28]+three[28];
 //     sum1 += three[29];                  sum2 += three[28];
 //     P->dest[29] = sum1;                 P->dest[28] = sum2;
      if (P->dest[29] > tmax1) tmax1 = P->dest[29];     if (P->dest[28] > tmax2) tmax2 = P->dest[28];

and so on for all cases
----------------------------------------------------------------------------------------------------------------------------------------------------

and here the loop construct
// ----------------------------------------------------------------------------
//   Function:   sum_func_ptt( sw_sum3_t31 )
//   Typ      :   float
//   Inhalt   :   folding subroutines, FPU optimized                     
//   parameter:   sw_sum3_t31         
//   last update:23.09.2007   by:seti_britta   new function
// ----------------------------------------------------------------------------
sum_func_ptt( sw_sum3_t31 ) {
  register int i, j, k;
  float tmax2, tmax1; //seti_britta: new
  float *one   = ss[0];
  float *two   = ss[0]+P->tmp0;
  float *three = ss[0]+P->tmp1;
  tmax2 = tmax1 = (0.0f); //seti_britta: no convert !!
  i = P->di;
  if ( i & 1 )
  {
    i -= 1;
    P->dest[i] = tmax1 = one[i] + two[i] + three[i]; //seti_britta:new
  }
   for ( j = i-1, k = i-2; j > 0; j -= 2, k -= 2 )
   {
      P->dest[j]= one[j] + two[j] + three[j];           P->dest[k]= one[k] + two[k] + three[k];
      if (P->dest[j] > tmax1) tmax1 = P->dest[j];     if (P->dest[k] > tmax2) tmax2 = P->dest[k];
   }
  if (tmax1 > tmax2) return tmax1;
  return tmax2;
}
-------------------------------------------------------------------------------------------------------------------------------------------
maybe the compact loop have a chance
so far it compiles well... now we must measure to find fastest
have fun
regards heinz   Grin  Grin
Logged
Jason G
Global Moderator
Knight who says 'Ni!'
*****
Online Online

Posts: 1684


View Profile
Re: optimized sources
« Reply #243 on: 06 Nov 2007, 02:32:21 am »

Yes, I think I would like to carefully go back and rexamine Joe's ideas/Posts in the other thread for incorporating 3 phase processing/ block prefetch in some places. I'll get a chance to look next weekend, and hopefully plan a methodical approach that might be able to handle striping for the p4 at the same time. 

Intel theories suggest 3 to 5 times possible improvement, in certain code by fixing those p4 problems,  And the 3 phase & prefetch techniques [ Ala AMD Paper] even more.  If it adds up to a 10 to 20% crunch time improvement I'll be happy because it would bring my p4 3.2 back over 1000 RAC Cheesy

« Last Edit: 06 Nov 2007, 05:18:39 am by j_groothu » Logged
Jason G
Global Moderator
Knight who says 'Ni!'
*****
Online Online

Posts: 1684


View Profile
Re: optimized sources
« Reply #244 on: 06 Nov 2007, 07:34:37 am »

Progress so far,  Long way to go Cheesy :
[Each compared against preset 2.3S9 xW SSE2 IPP build, on vs2005/ICC, p4 Northwood 2.0A@2.1GHz,NoHT, WinXP]

Tactic                                                                                                        Type            Status                 Effect
1- Better memcpy in GetFixedPot                                                                   Generic x86   Prelim Tests      ~0.3%
2- Out of Place FFTs / eliminating associated memcopies                                   Intel IPP        Initial          ~?.?%
3- Once off seti.cpp 8meg memcpy                                                                Generic x86    Untested    ~0.?%
4- Chirp function Block Prefetch, memcpy++ zerocase & 3phase chirp                  Generic x86   Untested        ~?.?%
5- Compiler Flags (xN SSE2 p4 Specific)                                                               P4 specific   Tested            ~10%
6- Strip Mined Inner loops (p4 specific, 64k & 1M variants)                        P4, possible x86   Untested        ~??%
7- GaussFit Improvements                                                                                   To be Determined

~ means approximate, my system, 'your mileage may vary'.

[Please anyone feel free to suggest additions, updates or corrections to this list: 
            either fairly generic OR p4 specific will do Cheesy, Consider equivalent xP SSE3 builds as already on the list for later]

Jason
« Last Edit: 06 Nov 2007, 09:54:55 am by j_groothu » Logged
Jason G
Global Moderator
Knight who says 'Ni!'
*****
Online Online

Posts: 1684


View Profile
Re: optimized sources
« Reply #245 on: 07 Nov 2007, 07:13:57 am »

Quote
4- Chirp function Block Prefetch, memcpy++ zerocase & 3phase chirp                  Generic x86   Untested        ~?.?%

Took a quick look between school and work, looks like this may be easier than I thought to try.  On my configuration the consistantly selected chirping function is the outstanding "sse2_ChirpData_ak".  nice one.

The structure is already there for potential 3 phase processing, though it is currently straight SSE2 rendering it vectorised SIMD as far as I can see. The existing prefetch, processing and writing sections are all SSE2, clearly laid out and exhibit the clean crystal vase like 'niceness' quality that make you reluctant to tamper Cheesy

With few other adaptations, adjusting the prefetch, changing the processing to FPU, and suitably adjusting the streaming writes should do the trick,
  ... though for the p4 I would like to try to keep the aliasing issue in mind which might just dictate some of the block sizes and order they are processed.

Oh for the weekend Cheesy

« Last Edit: 07 Nov 2007, 07:22:50 am by j_groothu » Logged
Jason G
Global Moderator
Knight who says 'Ni!'
*****
Online Online

Posts: 1684


View Profile
Re: optimized sources
« Reply #246 on: 07 Nov 2007, 11:05:59 am »

First run of original code [ Will need run more times for baseline though ] : ( Very Nice function already )

--------------------------------------------------------------------------------------
Testing xN SSE2 Build.

sse2_ChirpData_ak:

NumDataPoints = 1024*1024
test_points = 32768

Timer Frequency in:

Hz  =       3579545
MHz =       3.57955
GHz =    0.00358

Start Time =    1585115997106 Ticks
Stop Time  =    1585116003199 Ticks

Duration in Ticks   =  6093
Duration in seconds =  0.0017021716447

--------------------------------------------------------------------------------------

Inner loop executes 8192 times
« Last Edit: 07 Nov 2007, 11:10:42 am by j_groothu » Logged
_heinz
Code Wizard
Knight who says 'Ni!'
*****
Online Online

Posts: 692


View Profile
Re: optimized sources
« Reply #247 on: 07 Nov 2007, 11:47:04 am »

measure its the best to try code and find optimal variants.  Grin

the loop construct in pulsefind.cpp is ready now, but not measured.
Today I will squeeze the case-construct code.
have still some good ideas to eleminate code else and there...we will see...

Logged
Jason G
Global Moderator
Knight who says 'Ni!'
*****
Online Online

Posts: 1684


View Profile
Re: optimized sources
« Reply #248 on: 07 Nov 2007, 12:14:29 pm »

measure its the best to try code and find optimal variants.  Grin

the loop construct in pulsefind.cpp is ready now, but not measured.
Today I will squeeze the case-construct code.
have still some good ideas to eleminate code else and there...we will see...



Great!, a pulsefind baseline will be good too. for underneath pulsefind  It seems my machine also selects always AK folding routines and spends much of its time in the x2AL version..  I am running vtune on the chirp one now to look for any p4 specific slowdowns, wickedly fast code though Cheesy
Logged
_heinz
Code Wizard
Knight who says 'Ni!'
*****
Online Online

Posts: 692


View Profile
Re: optimized sources
« Reply #249 on: 07 Nov 2007, 01:55:39 pm »




 I am running vtune on the chirp one now to look for any p4 specific slowdowns, wickedly fast code though Cheesy

have a strong modified chirpfft.cpp which we can try  too
Logged
_heinz
Code Wizard
Knight who says 'Ni!'
*****
Online Online

Posts: 692


View Profile
Re: optimized sources
« Reply #250 on: 07 Nov 2007, 04:47:27 pm »

easy we can compile all 3 cases with the präprozessordefinition now --->
---------------------------------------------------------------------------------------------------
// USE_PFLOOP  --> Präprozessordirective
// USE_PFCASE  --> Präprozessordirective
#if defined( USE_PFLOOP )
   #pragma message ("-----PFLOOP-----")
   #include "pfloop.h" //use the loop-construct
#else
#if defined( USE_PFCASE )
   #pragma message ("-----PFCASE-----")
   #include "pfcase.h" //use the modified case-construct
#else
   //use original code
#endif // USE_PFCASE
#endif // USE_PFLOOP
-----------------------------------------------------------------------------------------
------ Build started: Project: seti_boinc, Configuration: Release32-NOGFX Win32 ------
Compiling...
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.20404 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
cl /Od /Ob2 /Oi /Ot /Oy /GT /I "." /I "../../../boinc/api" /I "../../../boinc/client/win" /I "../../../boinc/lib" /I ".." /I "glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\db" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\jpeglib" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\Optimizer" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\image_libs" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX" /I "C:\I\SC\vs90\boinc" /I "C:\I\SC\vs90\boinc\api" /I "C:\I\SC\vs90\boinc\client\win" /I "C:\I\SC\vs90\boinc\lib" /D "WIN32" /D "_WIN32" /D "_WINDOWS" /D "NBOINC_APP_GRAPHICS" /D "CLIENT" /D "_MT" /D "USE_IPP" /D "USE_SSE2" /D "_DEBUG" /D "USE_PFLOOP" /D "_VC80_UPGRADE=0x0600" /D "_MBCS" /GF /Gm /EHsc /MTd /Zp16 /Gy /Fp".\Release/seti_boinc.pch" /Fo".\Release32-NOGFX\\" /Fd".\Release32-NOGFX\vc90.pdb" /FR".\Release32-NOGFX\\" /W3 /c /Wp64 /Zi /TP "..\pulsefind.cpp"
pulsefind.cpp
-----PFLOOP-----
..\pulsefind.cpp(1487) : warning C4146: unary minus operator applied to unsigned type, result still unsigned
Build log was saved at "file://c:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX\BuildLog.htm"
seti_boinc - 0 error(s), 1 warning(s)
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

regards   Grin

Logged
Jason G
Global Moderator
Knight who says 'Ni!'
*****
Online Online

Posts: 1684


View Profile
Re: optimized sources
« Reply #251 on: 08 Nov 2007, 04:50:05 am »

       have a strong modified chirpfft.cpp which we can try  too

Good we'll do that I think it is a very good idea, I have p4 sse2  primary performance data  (vtune) for the sse2_ChirpData_ak, 10000 loops on p4 Northwood with 512k l2 cache, which took a toral time of 10 secs execution time: (19 runs worth of data gathered)
(preliminary data, subject to verification with further runs)
   64k Alaising : almost none... Accounts for 1.34% of function workload (about 0.13 secs)
  Second Level Cache misses: Accounts for 10.28% of the workload (about 1 second)

other statistics (preliminary, subject to verification) :
128 bit mmx instructions ~82 million (no 64 bit MMX instructions counted)
packed double precision Floating Point SSE instructions ~1.4 billion (thousand million)
packed single precision  Floating Point SSE instructions ~4 billion (thousand million)

Mispredicted Branches = 0 !!!  Shocked

No Machine Clear counts (Pipeline flushes), split loads or blocked store forwards at all Cheesy

I think that's a really good function, much better statistics than the pulefolding functions gave me, but I'll have to retest those in isolation too as I'm getting better at selecting the correct compiler settings and driving vtune too.

Well I'll check a few build setting and run primary performance measures again to verify those results, and add secondary performance indicators to see what else turns up.... Then on the weekend maybe fiddle with that 3 phase idea to see if it actually works....All good fun Cheesy...

Jason


« Last Edit: 08 Nov 2007, 05:06:50 am by j_groothu » Logged
_heinz
Code Wizard
Knight who says 'Ni!'
*****
Online Online

Posts: 692


View Profile
Re: optimized sources
« Reply #252 on: 08 Nov 2007, 12:12:38 pm »

the modified PFCASE is ready now
-----------------------------------------------
------ Build started: Project: seti_boinc, Configuration: Release32-NOGFX Win32 ------
Compiling...
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.20404 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
cl /Od /Ob2 /Oi /Ot /Oy /GT /I "." /I "../../../boinc/api" /I "../../../boinc/client/win" /I "../../../boinc/lib" /I ".." /I "glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\db" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\glut" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\jpeglib" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\Optimizer" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\image_libs" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build" /I "C:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX" /I "C:\I\SC\vs90\boinc" /I "C:\I\SC\vs90\boinc\api" /I "C:\I\SC\vs90\boinc\client\win" /I "C:\I\SC\vs90\boinc\lib" /D "WIN32" /D "_WIN32" /D "_WINDOWS" /D "NBOINC_APP_GRAPHICS" /D "CLIENT" /D "_MT" /D "USE_IPP" /D "USE_SSE2" /D "_DEBUG" /D "USE_PFCASE" /D "_VC80_UPGRADE=0x0600" /D "_MBCS" /GF /Gm /EHsc /MTd /Zp16 /Gy /Fp".\Release/seti_boinc.pch" /Fo".\Release32-NOGFX\\" /Fd".\Release32-NOGFX\vc90.pdb" /FR".\Release32-NOGFX\\" /W3 /c /Wp64 /Zi /TP "..\pulsefind.cpp"
pulsefind.cpp
-----PFCASE-----
..\pulsefind.cpp(1487) : warning C4146: unary minus operator applied to unsigned type, result still unsigned
Build log was saved at "file://c:\I\SC\vs90\seti_boinc_2k3_2.2B-Ben-Joe\client\win_build\Release32-NOGFX\BuildLog.htm"
seti_boinc - 0 error(s), 1 warning(s)
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========
 Grin
Logged
_heinz
Code Wizard
Knight who says 'Ni!'
*****
Online Online

Posts: 692


View Profile
Re: optimized sources
« Reply #253 on: 08 Nov 2007, 09:50:55 pm »

modified PFCASE rocks

here as it was before --->
ar=0.435000 done. Total flop count: 108711033335.208650

PulTimB 0.5    Totals:  Ratio            Ticks
             standard:  1.000      87303043476
Plan < 512 FPU swi ! :  0.575      50201832416
 Plan < 512 AK SSE ! :  0.634      55338411648
Plan < 512 BHx SSE ! :  0.993      86661631716
 Plan < 512 BH SSE ! :  0.774      67545465584

PFCASE ---->
ar=0.435000 done. Total flop count: 108711033335.208650

PulTimB 0.5    Totals:  Ratio            Ticks
             standard:  1.000      87387438720
Plan < 512 FPU swi ! : 0.504      44014700492
 Plan < 512 AK SSE ! :  0.633      55324520388
Plan < 512 BHx SSE ! :  0.992      86681643504
 Plan < 512 BH SSE ! :  0.773      67531081560
----------------------------------------------------------------------------------------------------
modified PFCASE ---> ~13% faster     Grin
heinz
Logged
Jason G
Global Moderator
Knight who says 'Ni!'
*****
Online Online

Posts: 1684


View Profile
Re: optimized sources
« Reply #254 on: 09 Nov 2007, 01:45:24 am »

Woohoo!, It's weekend! that function was with just the changes you made before? I'll guess that maybe the compiler did vectorise some of that,  I would like to look at disassembly output,  if the compiler was smart enough to put prefetch plus FPU plus streaming stores then that IS 3-Phase Cheesy, anything is possible, have you compared for accuracy as well ?

« Last Edit: 09 Nov 2007, 01:50:00 am by j_groothu » Logged
Pages: 1 ... 15 16 [17] 18 19 ... 25 Go Up Print 
Seti@Home optimized science apps and information  |  Optimized Seti@Home apps  |  Windows  |  Topic: optimized sources « previous next »
Jump to:  


Quote!
All that is necessary for the triumph of evil is that good men do nothing.
- Edmund Burke

 
Site Statistics
Total Members:1,021
Total Posts:9,117
Total Topics:425
Downloads
Apps
Windows R-1.x25,069
Windows R-2.020,291
Windows R-2.236,400
Linux 32bit 1.x6,527
Linux 32bit 2.24,306
Linux 64bit 2.21,714
Alpha/IA64187
FreeBSD582
HPUX323
Subtotal:94,307
Source packs:4,072
Tool/WU packs:7,682
Total:150,645
GBs dl'd:279.14
Pages served
Today:1,707
Total:3,095,116
(since 6/26/2006)
173 Donations to S@H
U.S. Dollars:3,196.59
Euros:863.90
Last 24h:$ 0.00
Avg./24h:$ 7.53
Estim. total:$ 4,319.66
Latest Member:
fos
 
 
Seti@Home optimized science apps and information | Powered by Enigma 2.0 (RC1).
© 2003-2008, LSP Dev Team. All Rights Reserved.
Seti@Home optimized science apps and information Forums | Powered by SMF.
© 2005, Simple Machines LLC. All Rights Reserved.
Powered by MySQL Powered by PHP Valid XHTML 1.0! Valid CSS!