|
|
Author
|
Topic: AVX Optimized App Development (Read 33664 times)
|
Raistmer
Working Code Wizard
Volunteer Developer
Knight who says 'Ni!'
   
Offline
Posts: 11022
|
It can depend on how much cycles CPU use to do same operation via AVX register and via XMM register. Even if it will do same 4 operations speed could be different. Instruction set per se, w/o knowledge about cost of each operation in CPU cycles, means nothing.
|
|
|
|
|
Logged
|
|
|
|
|
Frizz
|
It can depend on how much cycles CPU use to do same operation via AVX register and via XMM register. Even if it will do same 4 operations speed could be different. Instruction set per se, w/o knowledge about cost of each operation in CPU cycles, means nothing.
Thats true. Assuming both architectures use about the same amount of CPU cycles, Bulldozer has at least the potential to be 2x faster - compared to "old" SSE. While for Intel it won't matter. By the way ... I'm still thinking about Jasons comment ("16x or 8x 32 bit wide FPUs working on this code would be starving either way") ... so true. And I still have to get used to it ... what I've learnt from my OpenCL experiments: "Keep the ALUs busy at all cost - avoid memory access"  ... guess that will be true for SSE/AVX too.
|
|
|
|
|
Logged
|
|
|
|
|
|
|
Josef W. Segur
|
Sandy Bridge AVX does have 256 bit packed single float operations, basically the VEX.256 encoding is available for all mathematical functions we might use. But I agree with Jason that the difficulty will be getting the data to and from memory. And I think it would be a mistake to believe Intel marketing hype and expect Sandy Bridge to challenge GPUs for S@H processing.
Still, there are parts of the vectorized code which are probably compute bound and will benefit from AVX, such as the MB dechirping. For the stock code, an analyzeFuncs_avx.cpp with dechirping and perhps 8x8 transpose functions would be fairly straightforward. Joe
|
|
|
|
|
Logged
|
|
|
|
|
Frizz
|
I checked Intels AVX examples on their web page and they really can operate on 8 x float in parallel ... stupid me, what was I thinking? Sorry for getting confused yesterday  It all comes down to this here: Intel Sandy Bridge: 1 x 128 bit (SSE) or 1 x 256 bit (AVX) per clock cycle AMD Bulldozer: 2 x 128 bit (SSE) or 1 x 256 bit (AVX) per clock cycle
|
|
|
|
« Last Edit: 15 Feb 2011, 03:55:52 am by Frizz »
|
Logged
|
|
|
|
Raistmer
Working Code Wizard
Volunteer Developer
Knight who says 'Ni!'
   
Offline
Posts: 11022
|
And now, are you sure for "per clock cycle" for both? AMD is known for very poor initial SSE3 implementation where SSE3 instruction, while supported, took too many cycles (cause internaly they were computed as 2x64 instead of 1x128) to be useful...
|
|
|
|
|
Logged
|
|
|
|
|
Frizz
|
And now, are you sure for "per clock cycle" for both?
As sure as I can be without having the actual piece of hardware in my hands John Fruehe/AMD: "The Flex FP unit is built on two 128-bit FMAC units. The FMAC building blocks are quite robust on their own. Each FMAC can do an FMAC, FADD or a FMUL per cycle." computerbase.de: "Bei „Sandy Bridge“ heißt es also: Je Funktionseinheit und Takt können wahlweise 1× 128 Bit (SEE) oder 1× 256 Bit (AVX) breite Befehle verarbeitet werden. Die erwartete Konkurrenz in Form von AMD ist hier geschickter:„Bulldozer“ spricht in einem Zyklus wahlweise volle 256 oder 2× 128 Bit pro Takt an – die Flex-FP genannte Einheit teilen sich jedoch zwei Cores innerhalb eines „Bulldozer“-Moduls." EDIT: Who knows what will happen to AMD, Bulldozer, etc. in the near future ( AMD Pops 5 % On Dell Takeover Rumor)
|
|
|
|
« Last Edit: 15 Feb 2011, 06:06:09 am by Frizz »
|
Logged
|
|
|
|
|
Josef W. Segur
|
I've done some coding using AVX intrinsics for possible addition to the S@H v7 at S@H Beta, and of course here too. But I have not yet succeeded in getting either of the emulation capabilities from Intel working, so I'm just going to post a test here. It's basically the 'optimal function test' section of the stock code separated out, runs like this on my Win2k Pentium-M laptop: ========================================================= Ftst_v7 started.
Optimal function choices: ------------------------------------------------------- name timing error ------------------------------------------------------- v_BaseLineSmooth (no other)
v_GetPowerSpectrum 0.00129 0.00000 test v_vGetPowerSpectrum 0.00076 0.00000 test v_vGetPowerSpectrum2 0.00126 0.00000 test v_vGetPowerSpectrumUnrolled 0.00073 0.00000 test v_vGetPowerSpectrumUnrolled2 0.00126 0.00000 test v_vGetPowerSpectrumUnrolled 0.00073 0.00000 choice
v_ChirpData 0.05096 0.00000 test fpu_ChirpData 0.05843 0.00000 test fpu_opt_ChirpData 0.05117 0.00000 test v_vChirpData_x86_64 0.16249 0.00000 test sse1_ChirpData_ak 0.03466 0.00000 test sse2_ChirpData_ak 0.02976 0.00000 test sse2_ChirpData_ak 0.02976 0.00000 choice
v_Transpose 0.12368 0.00000 test v_Transpose2 0.06344 0.00000 test v_Transpose4 0.03413 0.00000 test v_Transpose8 0.05463 0.00000 test v_pfTranspose2 0.06328 0.00000 test v_pfTranspose4 0.03372 0.00000 test v_pfTranspose8 0.05253 0.00000 test v_vTranspose4 0.03367 0.00000 test v_vTranspose4np 0.03455 0.00000 test v_vTranspose4ntw 0.02493 0.00000 test v_vTranspose4x8ntw 0.02046 0.00000 test v_vTranspose4x16ntw 0.02077 0.00000 test v_vpfTranspose8x4ntw 0.02486 0.00000 test v_vTranspose4x8ntw 0.02046 0.00000 choice
FPU opt folding 0.00624 0.00000 test AK SSE folding 0.00266 0.00000 test BH SSE folding 0.00248 0.00000 test BH SSE folding 0.00248 0.00000 choice
Test duration 13.79 seconds
Ftst_v7 completed successfully. That output is appended to a stderr.txt file for each invocation of the program. With an AVX capable CPU and Win7 SP1 there should also be an AVX PowerSpectrum function, two AVX Chirp functions, and two AVX Transpose functions. It's a 32 bit console mode program, after extracting it from the 7zip archive to a convenient folder you can just double click and it will create a console window with "Ftst_v7 starting...." at the top. In that case when the program finishes its window will close. If you prefer to first open an "MS-DOS prompt" window and run from there you'd see something like: C:\Test>Ftst_v7_6.91_J28_W32 Ftst_v7 starting.... Ftst_v7 completed, details appended to stderr.txt.
C:\Test>Assuming it runs and doesn't crash on appropriate systems, I'm interested in seeing whether there's a significant speedup and whether I've gotten the right output data where it should go so the 'error' terms are acceptable. It runs at normal priority, so won't be impacted by CPU tasks being run by BOINC but GPU tasks with the -hp priority boost some of Raistmer's builds support could affect timings. Just run it several times in that case. Joe Edit: attachment deleted, see later post for an updated test.
|
|
|
|
« Last Edit: 01 May 2011, 12:03:19 am by Josef W. Segur »
|
Logged
|
|
|
|
|
Jason G
|
oooh, my wallet just twinged...
|
|
|
|
|
Logged
|
|
|
|
arkayn
Alpha Tester
Knight who says 'Ni!'
 
Online
Posts: 1036
Aaaarrrrgggghhhh
|
Runs fine on my Q8200 ========================================================= Ftst_v7 started.
Optimal function choices: ------------------------------------------------------- name timing error ------------------------------------------------------- v_BaseLineSmooth (no other)
v_GetPowerSpectrum 0.00050 0.00000 test v_vGetPowerSpectrum 0.00030 0.00000 test v_vGetPowerSpectrum2 0.00021 0.00000 test v_vGetPowerSpectrumUnrolled 0.00017 0.00000 test v_vGetPowerSpectrumUnrolled2 0.00020 0.00000 test v_vGetPowerSpectrumUnrolled 0.00017 0.00000 choice
v_ChirpData 0.01733 0.00000 test fpu_ChirpData 0.02611 0.00000 test fpu_opt_ChirpData 0.01718 0.00000 test v_vChirpData_x86_64 0.08318 0.00000 test sse1_ChirpData_ak 0.01189 0.00000 test sse2_ChirpData_ak 0.01225 0.00000 test sse3_ChirpData_ak 0.01158 0.00000 test sse3_ChirpData_ak 0.01158 0.00000 choice
v_Transpose 0.04329 0.00000 test v_Transpose2 0.02241 0.00000 test v_Transpose4 0.01175 0.00000 test v_Transpose8 0.01840 0.00000 test v_pfTranspose2 0.02277 0.00000 test v_pfTranspose4 0.01191 0.00000 test v_pfTranspose8 0.01807 0.00000 test v_vTranspose4 0.01170 0.00000 test v_vTranspose4np 0.01159 0.00000 test v_vTranspose4ntw 0.00818 0.00000 test v_vTranspose4x8ntw 0.00862 0.00000 test v_vTranspose4x16ntw 0.00624 0.00000 test v_vpfTranspose8x4ntw 0.00836 0.00000 test v_vTranspose4x16ntw 0.00624 0.00000 choice
FPU opt folding 0.00344 0.00000 test AK SSE folding 0.00124 0.00000 test BH SSE folding 0.00121 0.00000 test BH SSE folding 0.00121 0.00000 choice
Test duration 6.02 seconds
Ftst_v7 completed successfully.
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
Similar result here on the E8400 (of course). Darn, now I'm CPU shopping 
|
|
|
|
Logged
|
|
|
|
|
Josef W. Segur
|
Runs fine on my Q8200 ... Thanks, that's a better basis for comparison since it includes the SSE3 chirp which 'most everyone will see. And although I'm not particularly concerned about the 13 lines of assembly code which checks CPU and OS to decide whether AVX is supported, confirmation that Win7 SP1 by itself isn't enough is good. Joe
|
|
|
|
|
Logged
|
|
|
|
|
Josef W. Segur
|
From dnolan via PM at NC, result on his i7 2600 w/W7 64 SP1: Ftst_v7 started. Optimal function choices: ------------------------------------------------------- name timing error ------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.00010 0.00000 test v_vGetPowerSpectrum 0.00005 0.00000 test v_vGetPowerSpectrum2 0.00006 0.00000 test v_vGetPowerSpectrumUnrolled 0.00005 0.00000 test v_vGetPowerSpectrumUnrolled2 0.00007 0.00000 test v_avxGetPowerSpectrum 0.00004 38.07197 test v_vGetPowerSpectrumUnrolled 0.00005 0.00000 choice v_ChirpData 0.00444 0.00000 test fpu_ChirpData 0.01053 0.00000 test fpu_opt_ChirpData 0.00444 0.00000 test v_vChirpData_x86_64 0.05060 0.00000 test sse1_ChirpData_ak 0.00590 0.00000 test sse2_ChirpData_ak 0.00567 0.00000 test sse3_ChirpData_ak 0.00556 0.00000 test avx_ChirpData_a 0.00230 0.85637 test avx_ChirpData_b 0.00231 0.85637 test v_ChirpData 0.00444 0.00000 choice v_Transpose 0.00270 0.00000 test v_Transpose2 0.00292 0.00000 test v_Transpose4 0.00149 0.00000 test v_Transpose8 0.00271 0.00000 test v_pfTranspose2 0.00161 0.00000 test v_pfTranspose4 0.00149 0.00000 test v_pfTranspose8 0.00313 0.00000 test v_vTranspose4 0.00088 0.00000 test v_vTranspose4np 0.00114 0.00000 test v_vTranspose4ntw 0.00716 0.00000 test v_vTranspose4x8ntw 0.00298 0.00000 test v_vTranspose4x16ntw 0.00085 0.00000 test v_vpfTranspose8x4ntw 0.00719 0.00000 test v_avxTranspose8x4ntw 0.00299 0.00000 test v_avxTranspose8x8ntw 0.00232 9696326.77324 test v_vTranspose4x16ntw 0.00085 0.00000 choice FPU opt folding 0.00204 0.00000 test AK SSE folding 0.00045 0.00000 test BH SSE folding 0.00043 0.00000 test BH SSE folding 0.00043 0.00000 choice Test duration 2.53 seconds Ftst_v7 completed successfully. Nice speedups on the Chirp functions, but I obviously need to rework data shuffling. Joe
|
|
|
|
|
Logged
|
|
|
|
|
Jason G
|
Nice speedups on the Chirp functions, but I obviously need to rework data shuffling. Numbered bottlecaps help with that for me. Good to see some hints that with work the architecture additions may perform very well. Jason
|
|
|
|
|
Logged
|
|
|
|
Claggy
Alpha Tester
Knight who says 'Ni!'
 
Offline
Posts: 2495
|
Similar result here on the E8400 (of course). Darn, now I'm CPU shopping  This is what an E8500 @ 4.14GHz gets (with Boinc, v7 Seti Beta CPU apps, an NV Seti Cuda MB app and an ATI OpenCL Seti MB app running)(ran it 5 times): Ftst_v7 started.
Optimal function choices: ------------------------------------------------------- name timing error ------------------------------------------------------- v_BaseLineSmooth (no other)
v_GetPowerSpectrum 0.00013 0.00000 test v_vGetPowerSpectrum 0.00006 0.00000 test v_vGetPowerSpectrum2 0.00006 0.00000 test v_vGetPowerSpectrumUnrolled 0.00005 0.00000 test v_vGetPowerSpectrumUnrolled2 0.00006 0.00000 test v_vGetPowerSpectrumUnrolled 0.00005 0.00000 choice
v_ChirpData 0.03146 0.00000 test fpu_ChirpData 0.01685 0.00000 test fpu_opt_ChirpData 0.02659 0.00000 test v_vChirpData_x86_64 0.04977 0.00000 test sse1_ChirpData_ak 0.00881 0.00000 test sse2_ChirpData_ak 0.00886 0.00000 test sse3_ChirpData_ak 0.00829 0.00000 test sse3_ChirpData_ak 0.00829 0.00000 choice
v_Transpose 0.00389 0.00000 test v_Transpose2 0.00476 0.00000 test v_Transpose4 0.00464 0.00000 test v_Transpose8 0.01212 0.00000 test v_pfTranspose2 0.00397 0.00000 test v_pfTranspose4 0.00477 0.00000 test v_pfTranspose8 0.01263 0.00000 test v_vTranspose4 0.00396 0.00000 test v_vTranspose4np 0.00585 0.00000 test v_vTranspose4ntw 0.00690 0.00000 test v_vTranspose4x8ntw 0.00649 0.00000 test v_vTranspose4x16ntw 0.00532 0.00000 test v_vpfTranspose8x4ntw 0.00568 0.00000 test v_Transpose 0.00389 0.00000 choice
FPU opt folding 0.00194 0.00000 test AK SSE folding 0.00072 0.00000 test BH SSE folding 0.00071 0.00000 test BH SSE folding 0.00071 0.00000 choice
Test duration 4.21 seconds
Ftst_v7 completed successfully. Claggy
|
|
|
« Last Edit: 01 May 2011, 08:13:35 pm by Claggy »
|
Logged
|
|
|
|
|
|
Quote!
Whenever you set out to do something, something else must be done first.- Murphy's Law
|
 |  |  |
| |
Online users/last 15m
26 Guests, 2 Users
arkayn, ML1 16 Members/last 24harkayn, ML1, Josef W. Segur, Byron Leigh Hatch @ team Carl Sagan, Hans Dorn, Raistmer, Claggy, Richard Haselgrove, Urs Echternacht, Mike, PatrickV2, corsair, Morten, mr.mac52, KarVi, Pizzadude
| |
 | |  |
|