Visual C 2005 more than 3 times slower than Visual C 2003 !

While recompiling my cryptographic library under Visual C 2005, which implements exUSSR gost algorithm, I've found that optimization in 2005 almost absent at all !

here is result of benchmarks:

P4-2.0 GHz:
MSVC 2003: 33433 Kbytes/sec
MSVC 2005: 8960 Kbytes/sec
Ratio: 3.73 times slower!

P4-3.4 GHz
MSVC 2003: 48691 Kbytes/sec
MSVC 2005: 11520 Kbytes/sec
Ratio: 4.22 times slower!

It even slower than MSVC 6.0 ! Code using 2003 compiler using blend CPU optimization, with P4 optimizations the difference even bigger !

Here is my testcase: http://mike.qnx.org.ru/temp/perfomance.zip which reproduces the problem described above. This archive have the project and solution files for the 2003 and 2005 Visual Cs to let help the Visual C Team check which optimization options are enabled. It contain source, executables and objects.

Any comments Thanks in advance !



Answer this question

Visual C 2005 more than 3 times slower than Visual C 2003 !

  • Stage_0

    I looked at the disassembly and I found that it uses 3 times the number of jump statements.  2005 uses "js", "jb" and "jg" whereas 2003 just uses "jl" (see http://www.jegerlehner.ch/intel/opcode.html )

    i.e. in 2003

     if ((time(NULL)-starttime)>=10)

    004015BF push 0

    004015C1 inc edi

    004015C2 call time (4015FBh)

    004015C7 sub eax,ebx

    004015C9 add esp,4

    004015CC cmp eax,0Ah

    004015CF jl main+40h (4015A0h)

     

     in 2005:

    if ((time(NULL)-starttime)>=10)

    0040154B push 0

    0040154D add ebx,1

    00401550 call edi

    00401552 add esp,4

    00401555 sub eax,ebp

    00401557 sbb edx,dword ptr [esp+14h]

    0040155B mov dword ptr [esp+1Ch],edx

    0040155F js main+50h (401530h)

    00401561 jg main+88h (401568h)

    00401563 cmp eax,0Ah

    00401566 jb main+50h (401530h)

     


  • Bent

    Seems accessing the key array is very expensive in 2005, even if I declare locally and as const. If I substitute the values in the key array directly in the code, i.e.

    n2 ^= f(n1+0);
    n1 ^= f(n2+1);
    n2 ^= f(n1+0);
    n1 ^= f(n2+0);
    n2 ^= f(n1+0);
    n1 ^= f(n2+0);
    n2 ^= f(n1+0);
    n1 ^= f(n2+0);

    then the speed increases to faster than 2003. In 2003, the speed is identical regardless of whether I access the key array or substitute directly.


  • bill1000

    Here's the response that I got back from one of the developers on the optimizer team:

    ==========

    My guess is that we (a) hit store forwarding problem, (b) use memory when earlier we used registers only.

    Whidbey generates following code for the inner loop – instead of mov/shift/extract it stores DWORD and than loads part of it:

    lea edx, DWORD PTR [edi+ecx]

    mov DWORD PTR tv1192[esp+12], edx

    movzx ebx, BYTE PTR tv1192[esp+14]

    mov ebx, DWORD PTR _STable[ebx*4+2048]

    movzx ebp, dh

    xor ebx, DWORD PTR _STable[ebp*4+1024]

    mov ebp, edx

    shr ebp, 24 ; 00000018H

    xor ebx, DWORD PTR _STable[ebp*4+3072]

    and edx, 255 ; 000000ffH

    xor ebx, DWORD PTR _STable[edx*4]

    Everett uses mov/shift:

    lea edx, DWORD PTR [edi+ecx]

    mov ebx, edx

    shr ebx, 16 ; 00000010H

    movzx ebx, bl

    mov ebx, DWORD PTR _STable[ebx*4+2048]

    movzx ebp, dh

    xor ebx, DWORD PTR _STable[ebp*4+1024]

    mov ebp, edx

    shr ebp, 24 ; 00000018H

    xor ebx, DWORD PTR _STable[ebp*4+3072]

    mov DWORD PTR tv1192[esp+12], edx

    and edx, 255 ; 000000ffH

    xor ebx, DWORD PTR _STable[edx*4]

    As you can see, Everett generates more instructions, but they are “better” ones.

    To test I run that problem on both P4 and Opteron – on Opteron there is much less store forward stalls than on P4. On P4 Whidbey-compiled program runs 72% slower than Everett-compiled one. On Opteron (in 32-bit mode) it runs “only” 32% slower. I believe remaining slowdown is caused by use of memory instead of registers – mov/shift/extract has smaller latency than store/load.

    This is definitely a regression, and it can hit other cryptographic code as well.

    ===========

    They also believe that they have a fix - and with the fix the code runs 10% faster than with Visual C++ 2003.



  • rajveer31

    I've changed all time() functions calls to the GetTickCounter() it gives to me near 400Kb per second in additional - but IT DOESN'T MATTER in my case, please compare 12Mb/sec and 48Mb/sec. My benchmark testcase now uses only one CRT function - printf two times - before loop and after loop.


  • Reactive

     Ted. wrote:

    Seems accessing the key array is very expensive in 2005, even if I declare locally and as const. If I substitute the values in the key array directly in the code, i.e.

          n2 ^= f(n1+0);
          n1 ^= f(n2+1);
          n2 ^= f(n1+0);
          n1 ^= f(n2+0);
          n2 ^= f(n1+0);
          n1 ^= f(n2+0);
          n2 ^= f(n1+0);
          n1 ^= f(n2+0);

    then the speed increases to faster than 2003.  In 2003, the speed is identical regardless of whether I access the key array or substitute directly.

    On my machine (AthlonXP 2500+), the performance diff is about 50%

    disassembly the code and find that the code generated by 2005 (notice that the Linktime code generation is on) already substitute the content of testkey and inBlock (In fact, the parameter of GostEncryptBlock is changed to GostEncryptBlock(unsigned char* const outBlock) in 2005)

    Code block in 2005 is in fact as follows (dword_403018 is the address of testkey, ecx holds the content of n1):

    .text:00401100                 mov     edx, dword_403018
    .text:00401106                 add     edx, ecx
    .text:00401108                 mov     [esp+18h+var_4], edx
    .text:0040110C                 movzx   ebx, byte ptr [esp+18h+var_4+2]
    .text:00401111                 mov     ebx, dword_403BC0[ebx*4]
    .text:00401118                 movzx   ebp, dh
    .text:0040111B                 xor     ebx, dword_4037C0[ebp*4]
    .text:00401122                 mov     ebp, edx
    .text:00401124                 shr     ebp, 18h
    .text:00401127                 xor     ebx, dword_403FC0[ebp*4]
    .text:0040112E                 and     edx, 0FFh
    .text:00401134                 xor     ebx, dword_4033C0[edx*4]

    It seems that the only diff between 2003 & 2005 is movzx   ebx, byte ptr [esp+18h+var_4+2] (2005) versus shr     ebx, 10h     movzx   ebx, bl (2003)

    When use 0&1 directly, 2005 will use shr     ebx, 10h     and  ebx ,0xff . So I think the usage of byte ptr [esp+18h+var_4+2] is the cause of the slow down since the address is not aligned by 4


  • LOURENÇO MANTOVANI

    Read this first:

    http://www.codeproject.com/cpp/improved2005crt.asp

    Also the compiler intrinsics might be much slower than the CRT code!

    http://lab.msdn.microsoft.com/ProductFeedback/viewFeedback.aspx feedbackId=FDBK47075

    Try and use

    #pragma function(strcmp, labs, strcpy, _rotl, memcmp, strlen, _rotr, memcpy, _lrotl, _strset, memset, _lrotr, abs, strcat)



  • jhermiz

    2Jonathan Caves - MSFT:

    So I must post this regression report to the labs anyway


  • Samuelson Ido

    Ya, sorry, you're right, I was just about to post that after converting to GetTickCount like you did, there's no difference at all in that part of the disassembly.  I'm now looking at the area you mentioned. 

    in 2005 

     n2 ^= f(n1+MX_SWAPENDIAN(key[0]));

    00401100  lea         edx,[edi+ecx]
    00401103  mov         dword ptr [esp+14h],edx
    00401107  movzx       ebx,byte ptr [esp+16h]
    0040110C  mov         ebx,dword ptr [ebx*4+403838h]
    00401113  movzx       ebp,dh
    00401116  xor         ebx,dword ptr STable+400h (403438h)[ebp*4]
    0040111D  mov         ebp,edx
    0040111F  shr         ebp,18h
    00401122  xor         ebx,dword ptr [ebp*4+403C38h]
    00401129  and         edx,0FFh
    0040112F  xor         ebx,dword ptr STable (403038h)[edx*4]

    in 2003

    n2 ^= f(n1+MX_SWAPENDIAN(key[0]));

    00401160  lea         edx,[edi+ecx]
    00401163  mov         ebx,edx
    00401165  shr         ebx,10h
    00401168  movzx       ebx,bl
    0040116B  mov         ebx,dword ptr [ebx*4+403838h]
    00401172  movzx       ebp,dh
    00401175  xor         ebx,dword ptr [ebp*4+403438h]
    0040117C  mov         ebp,edx
    0040117E  shr         ebp,18h
    00401181  xor         ebx,dword ptr [ebp*4+403C38h]
    00401188  mov         dword ptr [esp+14h],edx
    0040118C  and         edx,0FFh
    00401192  xor         ebx,dword ptr [edx*4+403038h]

     Also, strangely, if I compile the 2005 version as a debug version (no optimizations at all) I get slightly better performance than the release 2005 version!

     

     


  • Ben Taylor UK

    Did you ever get a look into my source code I'm using two printf()s and one time() functions. Iterations in code are big enough to remove the fault of the bad CRT code.


  • balo

    That's above-water part of iceberg. This loop on my P4 3.4GHz executes near 280 times per 10 seconds, see what Visual C 2005 did in the main GostEncryptBlock() function (which is real bottle-neck in this testcase). It constantly reloaded all registers from memory, doing movzx/movsx, doing silly operations with registers, like xor eax, eax; mov eax,10; etc.


  • misteloe

    And thank you all: Ted., Jonathan Caves, vbvan for the help !!!
  • Eunsu

    I can confirm that there is a significant slow down.

    You should report this issue at:

    http://lab.msdn.microsoft.com/ProductFeedback/

    [ The reason for this is the customer bugs are given more "weight" than internal bugs ]

    But in the mean time I'll give the optimizer guys a heads up and see if they have any ideas on why there is such a slow down.



  • Rod Kimmel

    When I get into the office this morning I'll take a look at this and if I find anything I'll also get the optimizer team to also take a look. We do know that there are some very specific code patterns that can cause the optimizer to generate sub-optimal code.

  • Jeff Lundstrom

    I would highly recommend posting to labs.msdn.microsoft.com/productfeedback. Then others can vote on and validate it.  Even though they found and fixed the problem, it may only make it into Orcas (a year or so away), instead of VC2005 SP1 (few months away).  Posting it will make it more likely it will be included in SP1.  You may also want to try to go through Microsoft PSS and get a hotfix for it (making even more likely to be included).
  • Visual C 2005 more than 3 times slower than Visual C 2003 !