Donate To Seti@HomeSeti@Home optimized science apps and information
 
Welcome, Guest. Please login or register.
01 Sep 2014, 07:12:46 am

Login with username, password and session length
 
» Home
» Forums
» Downloads
» FAQ
» News

» Search site
 
 
 
If you've registered already but never got your activation email, please click here.
 
 
Seti@Home optimized science apps and information  |  Optimized Seti@Home apps  |  Discussion Forum  |  Topic: AVX Optimized App Development 0 Members and 0 Guests are viewing this topic. « previous next »
Pages: 1 [2] 3 4 ... 11 Go Down Print
Author Topic: AVX Optimized App Development  (Read 44009 times)
Raistmer
Working Code Wizard
Volunteer Developer
Knight who says 'Ni!'
*****
Online Online

Posts: 12362



Re: AVX Optimized App Development
« Reply #15 on: 14 Feb 2011, 05:14:45 pm »

It can depend on how much cycles CPU use to do same operation via AVX register and via XMM register.
Even if it will do same 4 operations speed could be different. Instruction set per se, w/o knowledge about cost of each operation in CPU cycles, means nothing.
Logged
Frizz
Volunteer Developer
Knight who says 'Ni!'
*****
Offline Offline

Posts: 541



Re: AVX Optimized App Development
« Reply #16 on: 14 Feb 2011, 05:27:43 pm »

It can depend on how much cycles CPU use to do same operation via AVX register and via XMM register.
Even if it will do same 4 operations speed could be different. Instruction set per se, w/o knowledge about cost of each operation in CPU cycles, means nothing.

Thats true.

Assuming both architectures use about the same amount of CPU cycles, Bulldozer has at least the potential to be 2x faster - compared to "old" SSE. While for Intel it won't matter.

By the way ... I'm still thinking about Jasons comment ("16x or 8x 32 bit wide FPUs working on this code would be starving either way") ... so true. And I still have to get used to it ... what I've learnt from my OpenCL experiments: "Keep the ALUs busy at all cost - avoid memory access" Smiley ... guess that will be true for SSE/AVX too.
Logged

Please stop using this 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993
Raistmer
Working Code Wizard
Volunteer Developer
Knight who says 'Ni!'
*****
Online Online

Posts: 12362



Re: AVX Optimized App Development
« Reply #17 on: 14 Feb 2011, 05:33:37 pm »

yes, good rule. In GPU one have shared memory for direct access managing. For CPU we have only cache and more or less implicit prefetches (quite implicit actually due to hardware prefetching). So CPU memory access avan more tricky Wink
Logged
Josef W. Segur
Janitor o' the Board
Knight who says 'Ni!'
*****
Offline Offline

Posts: 2867


Re: AVX Optimized App Development
« Reply #18 on: 15 Feb 2011, 12:32:19 am »

Sandy Bridge AVX does have 256 bit packed single float operations, basically the VEX.256 encoding is available for all mathematical functions we might use. But I agree with Jason that the difficulty will be getting the data to and from memory. And I think it would be a mistake to believe Intel marketing hype and expect Sandy Bridge to challenge GPUs for S@H processing.

Still, there are parts of the vectorized code which are probably compute bound and will benefit from AVX, such as the MB dechirping. For the stock code, an analyzeFuncs_avx.cpp with dechirping and perhps 8x8 transpose functions would be fairly straightforward.
                                                                                                  Joe
Logged
Frizz
Volunteer Developer
Knight who says 'Ni!'
*****
Offline Offline

Posts: 541



Re: AVX Optimized App Development
« Reply #19 on: 15 Feb 2011, 03:53:08 am »

I checked Intels AVX examples on their web page and they really can operate on 8 x float in parallel ... stupid me, what was I thinking?

Sorry for getting confused yesterday Wink

It all comes down to this here:

Intel Sandy Bridge: 1 x 128 bit (SSE) or 1 x 256 bit (AVX) per clock cycle
AMD Bulldozer:      2 x 128 bit (SSE) or 1 x 256 bit (AVX) per clock cycle
« Last Edit: 15 Feb 2011, 03:55:52 am by Frizz » Logged

Please stop using this 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993
Raistmer
Working Code Wizard
Volunteer Developer
Knight who says 'Ni!'
*****
Online Online

Posts: 12362



Re: AVX Optimized App Development
« Reply #20 on: 15 Feb 2011, 04:03:54 am »

And now, are you sure for "per clock cycle" for both?
AMD is known for very poor initial SSE3 implementation where SSE3 instruction, while supported, took too many cycles (cause internaly they were computed as 2x64 instead of 1x128) to be useful...
Logged
Frizz
Volunteer Developer
Knight who says 'Ni!'
*****
Offline Offline

Posts: 541



Re: AVX Optimized App Development
« Reply #21 on: 15 Feb 2011, 04:55:28 am »

And now, are you sure for "per clock cycle" for both?

As sure as I can be without having the actual piece of hardware in my hands  Wink

John Fruehe/AMD: "The Flex FP unit is built on two 128-bit FMAC units. The FMAC building blocks are quite robust on their own.  Each FMAC can do an FMAC, FADD or a FMUL per cycle."

computerbase.de: "Bei „Sandy Bridge“ heißt es also: Je Funktionseinheit und Takt können wahlweise 1× 128 Bit (SEE) oder 1× 256 Bit (AVX) breite Befehle verarbeitet werden. Die erwartete Konkurrenz in Form von AMD ist hier geschickter:„Bulldozer“ spricht in einem Zyklus wahlweise volle 256 oder 2× 128 Bit pro Takt an – die Flex-FP genannte Einheit teilen sich jedoch zwei Cores innerhalb eines „Bulldozer“-Moduls."


EDIT: Who knows what will happen to AMD, Bulldozer, etc. in the near future (AMD Pops 5 % On Dell Takeover Rumor)
« Last Edit: 15 Feb 2011, 06:06:09 am by Frizz » Logged

Please stop using this 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993
Josef W. Segur
Janitor o' the Board
Knight who says 'Ni!'
*****
Offline Offline

Posts: 2867


Re: AVX Optimized App Development
« Reply #22 on: 28 Apr 2011, 08:26:32 pm »

I've done some coding using AVX intrinsics for possible addition to the S@H v7 at S@H Beta, and of course here too. But I have not yet succeeded in getting either of the emulation capabilities from Intel working, so I'm just going to post a test here. It's basically the 'optimal function test' section of the stock code separated out, runs like this on my Win2k Pentium-M laptop:

Code:
=========================================================
Ftst_v7 started.

Optimal function choices:
-------------------------------------------------------
                            name  timing   error
-------------------------------------------------------
                v_BaseLineSmooth (no other)

              v_GetPowerSpectrum 0.00129 0.00000  test
             v_vGetPowerSpectrum 0.00076 0.00000  test
            v_vGetPowerSpectrum2 0.00126 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00073 0.00000  test
    v_vGetPowerSpectrumUnrolled2 0.00126 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00073 0.00000  choice

                     v_ChirpData 0.05096 0.00000  test
                   fpu_ChirpData 0.05843 0.00000  test
               fpu_opt_ChirpData 0.05117 0.00000  test
             v_vChirpData_x86_64 0.16249 0.00000  test
               sse1_ChirpData_ak 0.03466 0.00000  test
               sse2_ChirpData_ak 0.02976 0.00000  test
               sse2_ChirpData_ak 0.02976 0.00000  choice

                     v_Transpose 0.12368 0.00000  test
                    v_Transpose2 0.06344 0.00000  test
                    v_Transpose4 0.03413 0.00000  test
                    v_Transpose8 0.05463 0.00000  test
                  v_pfTranspose2 0.06328 0.00000  test
                  v_pfTranspose4 0.03372 0.00000  test
                  v_pfTranspose8 0.05253 0.00000  test
                   v_vTranspose4 0.03367 0.00000  test
                 v_vTranspose4np 0.03455 0.00000  test
                v_vTranspose4ntw 0.02493 0.00000  test
              v_vTranspose4x8ntw 0.02046 0.00000  test
             v_vTranspose4x16ntw 0.02077 0.00000  test
            v_vpfTranspose8x4ntw 0.02486 0.00000  test
              v_vTranspose4x8ntw 0.02046 0.00000  choice

                 FPU opt folding 0.00624 0.00000  test
                  AK SSE folding 0.00266 0.00000  test
                  BH SSE folding 0.00248 0.00000  test
                  BH SSE folding 0.00248 0.00000  choice

                   Test duration   13.79 seconds

Ftst_v7 completed successfully.

That output is appended to a stderr.txt file for each invocation of the program. With an AVX capable CPU and Win7 SP1 there should also be an AVX PowerSpectrum function, two AVX Chirp functions, and two AVX Transpose functions.

It's a 32 bit console mode program, after extracting it from the 7zip archive to a convenient folder you can just double click and it will create a console window with "Ftst_v7 starting...." at the top. In that case when the program finishes its window will close. If you prefer to first open an "MS-DOS prompt" window and run from there you'd see something like:

C:\Test>Ftst_v7_6.91_J28_W32
Ftst_v7 starting....
Ftst_v7 completed, details appended to stderr.txt.

C:\Test>


Assuming it runs and doesn't crash on appropriate systems, I'm interested in seeing whether there's a significant speedup and whether I've gotten the right output data where it should go so the 'error' terms are acceptable.

It runs at normal priority, so won't be impacted by CPU tasks being run by BOINC but GPU tasks with the -hp priority boost some of Raistmer's builds support could affect timings. Just run it several times in that case.
                                                                                                 Joe

Edit: attachment deleted, see later post for an updated test.
« Last Edit: 01 May 2011, 12:03:19 am by Josef W. Segur » Logged
Jason G
Construction Fraggle
Knight who says 'Ni!'
*****
Offline Offline

Posts: 8980


Re: AVX Optimized App Development
« Reply #23 on: 28 Apr 2011, 08:45:18 pm »

oooh, my wallet just twinged...
Logged
arkayn
Alpha Tester
Knight who says 'Ni!'
***
Offline Offline

Posts: 1148


Aaaarrrrgggghhhh


WWW
Re: AVX Optimized App Development
« Reply #24 on: 28 Apr 2011, 09:44:19 pm »

Runs fine on my Q8200
Code:
=========================================================
Ftst_v7 started.

Optimal function choices:
-------------------------------------------------------
                            name  timing   error
-------------------------------------------------------
                v_BaseLineSmooth (no other)

              v_GetPowerSpectrum 0.00050 0.00000  test
             v_vGetPowerSpectrum 0.00030 0.00000  test
            v_vGetPowerSpectrum2 0.00021 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00017 0.00000  test
    v_vGetPowerSpectrumUnrolled2 0.00020 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00017 0.00000  choice

                     v_ChirpData 0.01733 0.00000  test
                   fpu_ChirpData 0.02611 0.00000  test
               fpu_opt_ChirpData 0.01718 0.00000  test
             v_vChirpData_x86_64 0.08318 0.00000  test
               sse1_ChirpData_ak 0.01189 0.00000  test
               sse2_ChirpData_ak 0.01225 0.00000  test
               sse3_ChirpData_ak 0.01158 0.00000  test
               sse3_ChirpData_ak 0.01158 0.00000  choice

                     v_Transpose 0.04329 0.00000  test
                    v_Transpose2 0.02241 0.00000  test
                    v_Transpose4 0.01175 0.00000  test
                    v_Transpose8 0.01840 0.00000  test
                  v_pfTranspose2 0.02277 0.00000  test
                  v_pfTranspose4 0.01191 0.00000  test
                  v_pfTranspose8 0.01807 0.00000  test
                   v_vTranspose4 0.01170 0.00000  test
                 v_vTranspose4np 0.01159 0.00000  test
                v_vTranspose4ntw 0.00818 0.00000  test
              v_vTranspose4x8ntw 0.00862 0.00000  test
             v_vTranspose4x16ntw 0.00624 0.00000  test
            v_vpfTranspose8x4ntw 0.00836 0.00000  test
             v_vTranspose4x16ntw 0.00624 0.00000  choice

                 FPU opt folding 0.00344 0.00000  test
                  AK SSE folding 0.00124 0.00000  test
                  BH SSE folding 0.00121 0.00000  test
                  BH SSE folding 0.00121 0.00000  choice

                   Test duration    6.02 seconds

Ftst_v7 completed successfully.
Logged

Jason G
Construction Fraggle
Knight who says 'Ni!'
*****
Offline Offline

Posts: 8980


Re: AVX Optimized App Development
« Reply #25 on: 28 Apr 2011, 10:16:04 pm »

Similar result here on the E8400 (of course).  Darn, now I'm CPU shopping  Roll Eyes

* stderr.txt (2.16 KB - downloaded 114 times.)
Logged
Josef W. Segur
Janitor o' the Board
Knight who says 'Ni!'
*****
Offline Offline

Posts: 2867


Re: AVX Optimized App Development
« Reply #26 on: 28 Apr 2011, 10:42:21 pm »

Runs fine on my Q8200
...

Thanks, that's a better basis for comparison since it includes the SSE3 chirp which 'most everyone will see. And although I'm not particularly concerned about the 13 lines of assembly code which checks CPU and OS to decide whether AVX is supported, confirmation that Win7 SP1 by itself isn't enough is good.
                                                                                                 Joe
Logged
Josef W. Segur
Janitor o' the Board
Knight who says 'Ni!'
*****
Offline Offline

Posts: 2867


Re: AVX Optimized App Development
« Reply #27 on: 29 Apr 2011, 11:17:30 am »

From dnolan via PM at NC, result on his i7 2600 w/W7 64 SP1:

Code:
Ftst_v7 started.

Optimal function choices:
-------------------------------------------------------
                            name  timing   error
-------------------------------------------------------
                v_BaseLineSmooth (no other)

              v_GetPowerSpectrum 0.00010 0.00000  test
             v_vGetPowerSpectrum 0.00005 0.00000  test
            v_vGetPowerSpectrum2 0.00006 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00005 0.00000  test
    v_vGetPowerSpectrumUnrolled2 0.00007 0.00000  test
           v_avxGetPowerSpectrum 0.00004 38.07197  test
     v_vGetPowerSpectrumUnrolled 0.00005 0.00000  choice

                     v_ChirpData 0.00444 0.00000  test
                   fpu_ChirpData 0.01053 0.00000  test
               fpu_opt_ChirpData 0.00444 0.00000  test
             v_vChirpData_x86_64 0.05060 0.00000  test
               sse1_ChirpData_ak 0.00590 0.00000  test
               sse2_ChirpData_ak 0.00567 0.00000  test
               sse3_ChirpData_ak 0.00556 0.00000  test
                 avx_ChirpData_a 0.00230 0.85637  test
                 avx_ChirpData_b 0.00231 0.85637  test
                     v_ChirpData 0.00444 0.00000  choice

                     v_Transpose 0.00270 0.00000  test
                    v_Transpose2 0.00292 0.00000  test
                    v_Transpose4 0.00149 0.00000  test
                    v_Transpose8 0.00271 0.00000  test
                  v_pfTranspose2 0.00161 0.00000  test
                  v_pfTranspose4 0.00149 0.00000  test
                  v_pfTranspose8 0.00313 0.00000  test
                   v_vTranspose4 0.00088 0.00000  test
                 v_vTranspose4np 0.00114 0.00000  test
                v_vTranspose4ntw 0.00716 0.00000  test
              v_vTranspose4x8ntw 0.00298 0.00000  test
             v_vTranspose4x16ntw 0.00085 0.00000  test
            v_vpfTranspose8x4ntw 0.00719 0.00000  test
            v_avxTranspose8x4ntw 0.00299 0.00000  test
            v_avxTranspose8x8ntw 0.00232 9696326.77324  test
             v_vTranspose4x16ntw 0.00085 0.00000  choice

                 FPU opt folding 0.00204 0.00000  test
                  AK SSE folding 0.00045 0.00000  test
                  BH SSE folding 0.00043 0.00000  test
                  BH SSE folding 0.00043 0.00000  choice

                   Test duration    2.53 seconds

Ftst_v7 completed successfully.

Nice speedups on the Chirp functions, but I obviously need to rework data shuffling.
                                                                                                       Joe
Logged
Jason G
Construction Fraggle
Knight who says 'Ni!'
*****
Offline Offline

Posts: 8980


Re: AVX Optimized App Development
« Reply #28 on: 29 Apr 2011, 11:44:39 am »

Nice speedups on the Chirp functions, but I obviously need to rework data shuffling.

Numbered bottlecaps help with that for me.  Good to see some hints that with work the architecture additions may perform very well.

Jason
Logged
Claggy
Alpha Tester
Knight who says 'Ni!'
***
Offline Offline

Posts: 2880


Re: AVX Optimized App Development
« Reply #29 on: 29 Apr 2011, 12:35:07 pm »

Similar result here on the E8400 (of course).  Darn, now I'm CPU shopping  Roll Eyes

This is what an E8500 @ 4.14GHz gets (with Boinc, v7 Seti Beta CPU apps, an NV Seti Cuda MB app and an ATI OpenCL Seti MB app running)(ran it 5 times):

Code:
Ftst_v7 started.

Optimal function choices:
-------------------------------------------------------
                            name  timing   error
-------------------------------------------------------
                v_BaseLineSmooth (no other)

              v_GetPowerSpectrum 0.00013 0.00000  test
             v_vGetPowerSpectrum 0.00006 0.00000  test
            v_vGetPowerSpectrum2 0.00006 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00005 0.00000  test
    v_vGetPowerSpectrumUnrolled2 0.00006 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.00005 0.00000  choice

                     v_ChirpData 0.03146 0.00000  test
                   fpu_ChirpData 0.01685 0.00000  test
               fpu_opt_ChirpData 0.02659 0.00000  test
             v_vChirpData_x86_64 0.04977 0.00000  test
               sse1_ChirpData_ak 0.00881 0.00000  test
               sse2_ChirpData_ak 0.00886 0.00000  test
               sse3_ChirpData_ak 0.00829 0.00000  test
               sse3_ChirpData_ak 0.00829 0.00000  choice

                     v_Transpose 0.00389 0.00000  test
                    v_Transpose2 0.00476 0.00000  test
                    v_Transpose4 0.00464 0.00000  test
                    v_Transpose8 0.01212 0.00000  test
                  v_pfTranspose2 0.00397 0.00000  test
                  v_pfTranspose4 0.00477 0.00000  test
                  v_pfTranspose8 0.01263 0.00000  test
                   v_vTranspose4 0.00396 0.00000  test
                 v_vTranspose4np 0.00585 0.00000  test
                v_vTranspose4ntw 0.00690 0.00000  test
              v_vTranspose4x8ntw 0.00649 0.00000  test
             v_vTranspose4x16ntw 0.00532 0.00000  test
            v_vpfTranspose8x4ntw 0.00568 0.00000  test
                     v_Transpose 0.00389 0.00000  choice

                 FPU opt folding 0.00194 0.00000  test
                  AK SSE folding 0.00072 0.00000  test
                  BH SSE folding 0.00071 0.00000  test
                  BH SSE folding 0.00071 0.00000  choice

                   Test duration    4.21 seconds

Ftst_v7 completed successfully.

Claggy

* stderr.7z (0.99 KB - downloaded 101 times.)
« Last Edit: 01 May 2011, 08:13:35 pm by Claggy » Logged
Pages: 1 [2] 3 4 ... 11 Go Up Print 
Seti@Home optimized science apps and information  |  Optimized Seti@Home apps  |  Discussion Forum  |  Topic: AVX Optimized App Development « previous next »
Jump to:  


Quote!
Reality is merely an illusion, albeit a very persistent one.
- Albert Einstein

 
Site Statistics
Total Members:96
Total Posts:54,834
Total Topics:1,549
Downloads
..Some PHP stuff ToDo
Pages served
Today:927
Total:19,790,746
(since 6/26/2006)
Latest Member:
Just Will Lite
 
 
Seti@Home optimized science apps and information | Powered by Enigma 2.0 (RC1).
© 2003-2014, LSP Dev Team. All Rights Reserved.
Seti@Home optimized science apps and information Forums | Powered by SMF.
© 2005, Simple Machines LLC. All Rights Reserved.
Powered by MySQL Powered by PHP Valid XHTML 1.0! Valid CSS!