Donate To Seti@HomeSeti@Home optimized science apps and information
 
Welcome, Guest. Please login or register.
21 May 2013, 11:59:38 am

Login with username, password and session length
 
» Home
» Forums
» Downloads
» FAQ
» News

» Search site
 
 
 
If you've registered already but never got your activation email, please click here.
 
 
Seti@Home optimized science apps and information  |  Optimized Seti@Home apps  |  Discussion Forum  |  Topic: AVX Optimized App Development 0 Members and 0 Guests are viewing this topic. « previous next »
Pages: [1] 2 3 ... 11 Go Down Print
Author Topic: AVX Optimized App Development  (Read 33639 times)
Win95GUI
Squire
*
Offline Offline

Posts: 30


AVX Optimized App Development
« on: 31 Jan 2011, 05:28:46 pm »

Hey all,
I just wanted to see if anyone was working on employing the AVX extensions into a future build.  There have been questions/comments flying about at S@H about this.  And yes, I am aware of the chipset bug that has surfaced recently.

Todd
Logged
BANZAI56
Squire
*
Offline Offline

Posts: 19


Re: AVX Optimized App Development
« Reply #1 on: 31 Jan 2011, 10:09:13 pm »

It will be interesting to watch and see how this will progress.

I say that as we're still watching the progress and development of the GPU apps.


Lots of talented folks here and y'all have my appreciation and thanks for what you do!
Logged
Win95GUI
Squire
*
Offline Offline

Posts: 30


Re: AVX Optimized App Development
« Reply #2 on: 01 Feb 2011, 01:17:25 am »

Please see this thread at S@H if you are interested in receiving hardware to develop this application.  Of course this stuff will still be mine and I do want it returned in a reasonable timeframe following the development efforts.  Or I could put it out there on the internet for your usage as long as need be.

http://setiathome.berkeley.edu/forum_thread.php?id=63033

Todd
Logged
_heinz
Volunteer Developer
Knight who says 'Ni!'
*****
Offline Offline

Posts: 2102


Re: AVX Optimized App Development
« Reply #3 on: 01 Feb 2011, 08:16:46 pm »

Hi,
if you have not seen it AVX is in preparation.
Together with the ATOM build I worked on it since a while.
http://lunatics.kwsn.net/2-windows/optimized-sources.msg35172.html#msg35172

heinz
Logged
Frizz
Volunteer Developer
Knight who says 'Ni!'
*****
Offline Offline

Posts: 541



Re: AVX Optimized App Development
« Reply #4 on: 14 Feb 2011, 03:27:46 pm »

AFAIK Jason is working on supporting AVX.

From what I understood the only improvement in Intels version of AVX, besides non-destructive instructions, will basically only be to extend the 4 float operations to 4 double operations.

AMD (Bulldozer) will allow to use AVX in a more flexible way: Either do 4 x double, or 8 x float operations in parallel. And Bulldozer will support XOP and FMA4.

I always thought I get a Sandy Bridge system as soon as it becomes available. But after the most recent facts + rumours (benchmarks) I will wait for Bulldozer and compare both platforms.

Question (for Jason?): It seems a lot of guys at the S@H forum have high hopes in AVX. But do you really think we get such a tremendous speed up? I mean, does MB really use so much double precision?
« Last Edit: 14 Feb 2011, 05:04:07 pm by Frizz » Logged

Please stop using this 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993
Jason G
Construction Fraggle
Knight who says 'Ni!'
*****
Offline Offline

Posts: 8980


Re: AVX Optimized App Development
« Reply #5 on: 14 Feb 2011, 04:05:29 pm »

Question (for Jason?): It seems a lot of guys at the S@H forum have high hopes in AVX. But do you really think we get such a tremendous speed up? I mean, does MB really use so much double precision?

AVX supports  256 bit vectors of single floats AFAIK, and there are ample execution units in Sandy bridge to handle the operations in parallel.  The problem is existing core hard code is coded for 128 bit, so requires a recode to 256 bit.  Relying on compilers to do that does not work.  Putting something in hardware to attempt to parallel those 128 bit ops is a nice idea, but dependencies won't allow full parallelism there, as you rarely get 2 128 bit vector operations in a row that are not dependent somehow.  The required changes are algorithmic high level code ones.

As is, even 128bit vectors in SSE are challenging to program for 'properly', mostly due to the diversity of architectures which vary significantly in memory/cache subsystem.  Poorly coded SSE+ tends to stall cache anyway ( e.g. crappy codec tearing  Wink ).  That will only get harder as Processors keep doubling performance every so often, where RAM only gets ~10% faster in the same timeframe.  You mitigate that with cache management, and *most* code doesn't do that well at all.

Since the Intel and AMD patent sharing stuff is back on, and the CPUs show a remarkable convergence in some key aspects, especially memory subsystem,  It should be easier to juggle things into line for more portable hand vectorised code.  3 operand instructions, combined with less code to worry about for the new class of machines should see things go further.

So Summing up, naieve compiler based 'optimisation', will not get the job done.  There's a lot of work to do to extract the potential of both architectures.
Jason
« Last Edit: 14 Feb 2011, 04:19:51 pm by Jason G » Logged
Frizz
Volunteer Developer
Knight who says 'Ni!'
*****
Offline Offline

Posts: 541



Re: AVX Optimized App Development
« Reply #6 on: 14 Feb 2011, 04:23:10 pm »

So Summing up, naieve compiler based 'optimisation', will not get the job done.  There's a lot of work to do to extract the potential of both architectures.

Yes, I was thinking about hand optimized code (like it's already used in the Lunatics apps). My question was more: Will there really be any substantial advantage using AVX(Intel flavour)?

AFAIK Intel (Sandy Bridge) will not be able to split an FPU. So if they are running non-AVX code, their 8 256-bit FPUs are 8 128-bit FPUs. For Bulldozer, when they run non-AVX code, they have 16 128-bit FPUs.

So the only real benefit of AVX(Intel) will be that they can do 8 256-bit(double) instead of 8 128-bit(float) with SSE.

Hence my question: Is there really so much double precision code in MB?
« Last Edit: 14 Feb 2011, 04:25:50 pm by Frizz » Logged

Please stop using this 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993
Jason G
Construction Fraggle
Knight who says 'Ni!'
*****
Offline Offline

Posts: 8980


Re: AVX Optimized App Development
« Reply #7 on: 14 Feb 2011, 04:36:54 pm »

Yes, I was thinking about hand optimized code (like it's already used in the Lunatics apps). My question was more: Will there really be any substantial advantage using AVX(Intel flavour)?

AFAIK Intel (Sandy Bridge) will not be able to split an FPU. So if they are running non-AVX code, their 8 256-bit FPUs are 8 128-bit FPUs. For Bulldozer, when they run non-AVX code, they have 16 128-bit FPUs.

That splitting into extra 128 bit FPUs was what I was angling at, with the mention of dependancies.   Let's look at the dechirp from Astropulse for a clear example:

Quote
  #if TWINDECHIRP
   #define NUMPERPAGE 1024 // 4096/sizeof(float)
   static const __m128 NEG_S = {-0.0f, 0.0f, -0.0f, 0.0f};

   if (negredy != dm) {
     unsigned int kk, tlbmsk = fft_len*2-1;
     __m128 tmp1, tmp2, tmp3;
     float* tinP = temp_in_neg[0];

     for (kk=0; kk<fft_len*2; kk += NUMPERPAGE) {
      tlbt1 = dataP[(kk+NUMPERPAGE)&tlbmsk]; // TLB priming
      tlbt2 = chirpP[(kk+NUMPERPAGE)&tlbmsk]; // TLB priming
      // prefetch entire blocks, one 32 byte P3 cache line per loop
      for (i=kk+8; i<kk+NUMPERPAGE; i+=8) {     
      _mm_prefetch((char*)&dataP[i], _MM_HINT_NTA);
      _mm_prefetch((char*)&chirpP[i], _MM_HINT_NTA);
      }                                         
      // process 4 floats per loop               
      for (i=kk; i<kk+NUMPERPAGE; i+=4) {       
        tmp1=_mm_load_ps(&chirpP[i]);            //  s,  c
        tmp2=_mm_load_ps(&dataP[i]);             //  i,  r
        tmp3=tmp1;                               
        tmp1=_mm_movehdup_ps(tmp1);              //  s,  s
        tmp3=_mm_moveldup_ps(tmp3);              //  c,  c
        tmp1=_mm_xor_ps(tmp1, NEG_S);            //  s, -s
        tmp3=_mm_mul_ps(tmp3, tmp2);             // ic, rc
        tmp2=_mm_shuffle_ps(tmp2, tmp2, 0xb1);   //  r,  i
        tmp1=_mm_mul_ps(tmp1, tmp2);             // rs,-is
        tmp2=tmp1;                               
        tmp2=_mm_add_ps(tmp2, tmp3);             
        _mm_store_ps(&tinP[i], tmp2);            // ic+rs, rc-is
        tmp3=_mm_sub_ps(tmp3, tmp1);             
        _mm_store_ps(&tempP[i], tmp3);           // ic-rs, rc+is
      }                                         
     } //kk                                         
     negredy = dm;
   }

Here you have dependant sequences of 128 bit instructions.  You must recode this entirely for 256 bit by hand.

Leaving this as is, since the majority of the 'legacy' 128 bit operations must done in sequence to arrive at the correct answers, trying to execute more in parallel must be done at a higher level via a rewrite of the innemost loop,  changing i+=4 to i+=8.  Architectural improvements will make this 'legacy' code faster indeed, but nowhere near if it were rewritten to take advantage of 256 bit wide vectors & 3 operand instructions.

16x or 8x 32 bit wide FPUs working on this code would be starving either way, since the elaborate & slow mechanisms there are more to do with memory speed and triggering cache prefetches etc.

You could expect a pure AVX variant to have exactly half as many cache misses, due to exactly half the number of load requests.
« Last Edit: 14 Feb 2011, 04:40:54 pm by Jason G » Logged
Frizz
Volunteer Developer
Knight who says 'Ni!'
*****
Offline Offline

Posts: 541



Re: AVX Optimized App Development
« Reply #8 on: 14 Feb 2011, 04:51:47 pm »

I think we need a phone conference ... or a beer ... or both  Grin ... we are talking at cross purposes.

Let me put my question this way: AVX (Intel flavour) will not improve performance compared to existing SSE code, since all AVX does is extend 128bit(float) to 256bit(double). And we are not using much double precision in MB and AP. No?

Logged

Please stop using this 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993
Raistmer
Working Code Wizard
Volunteer Developer
Knight who says 'Ni!'
*****
Offline Offline

Posts: 11022



Re: AVX Optimized App Development
« Reply #9 on: 14 Feb 2011, 04:53:54 pm »

@Frizz
Check your arithmetic.
SSE allows only 4 float instructions per register, not 8.
Logged
Frizz
Volunteer Developer
Knight who says 'Ni!'
*****
Offline Offline

Posts: 541



Re: AVX Optimized App Development
« Reply #10 on: 14 Feb 2011, 04:57:57 pm »

@Frizz
Check your arithmetic.
SSE allows only 4 float instructions per register, not 8.

Darn ... I had 4 first, then later modified it to 8 ... got confused with number of registers vs. floating point numbers per register Wink

Point is:

- AVX (Intel flavour) doesn't double the number of operations - only doubles the width of the register files (128 -> 256)

- AVX (AMD flavour) allows to split, so effectively doubles the number of operations performed in parallel compared to SSE.
« Last Edit: 14 Feb 2011, 05:05:20 pm by Frizz » Logged

Please stop using this 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993
Raistmer
Working Code Wizard
Volunteer Developer
Knight who says 'Ni!'
*****
Offline Offline

Posts: 11022



Re: AVX Optimized App Development
« Reply #11 on: 14 Feb 2011, 04:59:52 pm »

Maybe, I'm not looked into AVX ISA yet, I just reading and making corrections Wink
Logged
Jason G
Construction Fraggle
Knight who says 'Ni!'
*****
Offline Offline

Posts: 8980


Re: AVX Optimized App Development
« Reply #12 on: 14 Feb 2011, 05:04:54 pm »

@Frizz
Check your arithmetic.
SSE allows only 4 float instructions per register, not 8.

Darn ... I had 4 first, then later modified it to 8  Wink

Point is:

- AVX (Intel flavour) doesn't double the number of operations - only doubles the width of the register files (128 -> 256)

- AVX (AMD flavour) allows to split, so effectively doubles the number of operations performed in parallel compared to SSE.

Which Is what I am saying code dependancies prevent in legacy SSE code, unless the chip has a special magic loop unroller that will change the number of loop interations.
Logged
Raistmer
Working Code Wizard
Volunteer Developer
Knight who says 'Ni!'
*****
Offline Offline

Posts: 11022



Re: AVX Optimized App Development
« Reply #13 on: 14 Feb 2011, 05:07:42 pm »

AFAIK outlaw made AVX build on SETI forums.
But I didn't see any benchmarks so far... This "just rebuild" approach could give starting point at least, but for now we have no even such point.
Logged
Frizz
Volunteer Developer
Knight who says 'Ni!'
*****
Offline Offline

Posts: 541



Re: AVX Optimized App Development
« Reply #14 on: 14 Feb 2011, 05:11:01 pm »

Which Is what I am saying code dependancies prevent in legacy SSE code, unless the chip has a special magic loop unroller that will change the number of loop interations.

I am aware of the fact the the code needs (more) hand optimization, ifdefs for AVX, Intel, AMD , etc. ... and that we don't get this for free (the magic loop unroller that you mentioned *g*).

Point is:

- It won't matter for Intel AVX (we still only have 4 operations in parallel)

- It might (will imho) matter for AMD AVX (we will have 8 operations in parallel)

No?
Logged

Please stop using this 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993
Pages: [1] 2 3 ... 11 Go Up Print 
Seti@Home optimized science apps and information  |  Optimized Seti@Home apps  |  Discussion Forum  |  Topic: AVX Optimized App Development « previous next »
Jump to:  


Quote!
To succeed in politics, it is often necessary to rise above your principles.
- Murphy's Law

 
Site Statistics
Total Members:91
Total Posts:51,094
Total Topics:1,430
Downloads
..Some PHP stuff ToDo
Pages served
Today:3,485
Total:17,311,666
(since 6/26/2006)
Latest Member:
[seti.international] Philip J. Fry
 
 
Seti@Home optimized science apps and information | Powered by Enigma 2.0 (RC1).
© 2003-2013, LSP Dev Team. All Rights Reserved.
Seti@Home optimized science apps and information Forums | Powered by SMF.
© 2005, Simple Machines LLC. All Rights Reserved.
Powered by MySQL Powered by PHP Valid XHTML 1.0! Valid CSS!