While recompiling my cryptographic library under Visual C 2005, which implements exUSSR gost algorithm, I've found that optimization in 2005 almost absent at all !
here is result of benchmarks:
P4-2.0 GHz:
MSVC 2003: 33433 Kbytes/sec
MSVC 2005: 8960 Kbytes/sec
Ratio: 3.73 times slower!
P4-3.4 GHz
MSVC 2003: 48691 Kbytes/sec
MSVC 2005: 11520 Kbytes/sec
Ratio: 4.22 times slower!
It even slower than MSVC 6.0 ! Code using 2003 compiler using blend CPU optimization, with P4 optimizations the difference even bigger !
Here is my testcase: http://mike.qnx.org.ru/temp/perfomance.zip which reproduces the problem described above. This archive have the project and solution files for the 2003 and 2005 Visual Cs to let help the Visual C Team check which optimization options are enabled. It contain source, executables and objects.
Any comments Thanks in advance !

Visual C 2005 more than 3 times slower than Visual C 2003 !
Stage_0
I looked at the disassembly and I found that it uses 3 times the number of jump statements. 2005 uses "js", "jb" and "jg" whereas 2003 just uses "jl" (see http://www.jegerlehner.ch/intel/opcode.html )
i.e. in 2003
if ((time(NULL)-starttime)>=10)
004015BF push 0
004015C1 inc edi
004015C2 call time (4015FBh)
004015C7 sub eax,ebx
004015C9 add esp,4
004015CC cmp eax,0Ah
004015CF jl main+40h (4015A0h)
in 2005:
if ((time(NULL)-starttime)>=10)
0040154B push 0
0040154D add ebx,1
00401550 call edi
00401552 add esp,4
00401555 sub eax,ebp
00401557 sbb edx,dword ptr [esp+14h]
0040155B mov dword ptr [esp+1Ch],edx
0040155F js main+50h (401530h)
00401561 jg main+88h (401568h)
00401563 cmp eax,0Ah
00401566 jb main+50h (401530h)
Bent
Seems accessing the key array is very expensive in 2005, even if I declare locally and as const. If I substitute the values in the key array directly in the code, i.e.
n2 ^= f(n1+0);
n1 ^= f(n2+1);
n2 ^= f(n1+0);
n1 ^= f(n2+0);
n2 ^= f(n1+0);
n1 ^= f(n2+0);
n2 ^= f(n1+0);
n1 ^= f(n2+0);
then the speed increases to faster than 2003. In 2003, the speed is identical regardless of whether I access the key array or substitute directly.
bill1000
Here's the response that I got back from one of the developers on the optimizer team:
==========
My guess is that we (a) hit store forwarding problem, (b) use memory when earlier we used registers only.
Whidbey generates following code for the inner loop – instead of mov/shift/extract it stores DWORD and than loads part of it:
lea edx, DWORD PTR [edi+ecx]
mov DWORD PTR tv1192[esp+12], edx
movzx ebx, BYTE PTR tv1192[esp+14]
mov ebx, DWORD PTR _STable[ebx*4+2048]
movzx ebp, dh
xor ebx, DWORD PTR _STable[ebp*4+1024]
mov ebp, edx
shr ebp, 24 ; 00000018H
xor ebx, DWORD PTR _STable[ebp*4+3072]
and edx, 255 ; 000000ffH
xor ebx, DWORD PTR _STable[edx*4]
Everett uses mov/shift:
lea edx, DWORD PTR [edi+ecx]
mov ebx, edx
shr ebx, 16 ; 00000010H
movzx ebx, bl
mov ebx, DWORD PTR _STable[ebx*4+2048]
movzx ebp, dh
xor ebx, DWORD PTR _STable[ebp*4+1024]
mov ebp, edx
shr ebp, 24 ; 00000018H
xor ebx, DWORD PTR _STable[ebp*4+3072]
mov DWORD PTR tv1192[esp+12], edx
and edx, 255 ; 000000ffH
xor ebx, DWORD PTR _STable[edx*4]
As you can see, Everett generates more instructions, but they are “better” ones.
To test I run that problem on both P4 and Opteron – on Opteron there is much less store forward stalls than on P4. On P4 Whidbey-compiled program runs 72% slower than Everett-compiled one. On Opteron (in 32-bit mode) it runs “only” 32% slower. I believe remaining slowdown is caused by use of memory instead of registers – mov/shift/extract has smaller latency than store/load.
This is definitely a regression, and it can hit other cryptographic code as well.
===========
They also believe that they have a fix - and with the fix the code runs 10% faster than with Visual C++ 2003.
rajveer31
I've changed all time() functions calls to the GetTickCounter() it gives to me near 400Kb per second in additional - but IT DOESN'T MATTER in my case, please compare 12Mb/sec and 48Mb/sec. My benchmark testcase now uses only one CRT function - printf two times - before loop and after loop.
Reactive
On my machine (AthlonXP 2500+), the performance diff is about 50%
I disassembly the code and find that the code generated by 2005 (notice that the Linktime code generation is on) already substitute the content of testkey and inBlock (In fact, the parameter of GostEncryptBlock is changed to GostEncryptBlock(unsigned char* const outBlock) in 2005)
Code block in 2005 is in fact as follows (dword_403018 is the address of testkey, ecx holds the content of n1):
.text:00401100 mov edx, dword_403018
.text:00401106 add edx, ecx
.text:00401108 mov [esp+18h+var_4], edx
.text:0040110C movzx ebx, byte ptr [esp+18h+var_4+2]
.text:00401111 mov ebx, dword_403BC0[ebx*4]
.text:00401118 movzx ebp, dh
.text:0040111B xor ebx, dword_4037C0[ebp*4]
.text:00401122 mov ebp, edx
.text:00401124 shr ebp, 18h
.text:00401127 xor ebx, dword_403FC0[ebp*4]
.text:0040112E and edx, 0FFh
.text:00401134 xor ebx, dword_4033C0[edx*4]
It seems that the only diff between 2003 & 2005 is movzx ebx, byte ptr [esp+18h+var_4+2] (2005) versus shr ebx, 10h movzx ebx, bl (2003)
When use 0&1 directly, 2005 will use shr ebx, 10h and ebx ,0xff . So I think the usage of byte ptr [esp+18h+var_4+2] is the cause of the slow down since the address is not aligned by 4
LOURENÇO MANTOVANI
Read this first:
http://www.codeproject.com/cpp/improved2005crt.asp
Also the compiler intrinsics might be much slower than the CRT code!
http://lab.msdn.microsoft.com/ProductFeedback/viewFeedback.aspx feedbackId=FDBK47075
Try and use
#pragma function(strcmp, labs, strcpy, _rotl, memcmp, strlen, _rotr, memcpy, _lrotl, _strset, memset, _lrotr, abs, strcat)
jhermiz
2Jonathan Caves - MSFT:
So I must post this regression report to the labs anyway
Samuelson Ido
Ya, sorry, you're right, I was just about to post that after converting to GetTickCount like you did, there's no difference at all in that part of the disassembly. I'm now looking at the area you mentioned.
in 2005
n2 ^= f(n1+MX_SWAPENDIAN(key[0]));
00401100 lea edx,[edi+ecx]
00401103 mov dword ptr [esp+14h],edx
00401107 movzx ebx,byte ptr [esp+16h]
0040110C mov ebx,dword ptr [ebx*4+403838h]
00401113 movzx ebp,dh
00401116 xor ebx,dword ptr STable+400h (403438h)[ebp*4]
0040111D mov ebp,edx
0040111F shr ebp,18h
00401122 xor ebx,dword ptr [ebp*4+403C38h]
00401129 and edx,0FFh
0040112F xor ebx,dword ptr STable (403038h)[edx*4]
in 2003
n2 ^= f(n1+MX_SWAPENDIAN(key[0]));
00401160 lea edx,[edi+ecx]
00401163 mov ebx,edx
00401165 shr ebx,10h
00401168 movzx ebx,bl
0040116B mov ebx,dword ptr [ebx*4+403838h]
00401172 movzx ebp,dh
00401175 xor ebx,dword ptr [ebp*4+403438h]
0040117C mov ebp,edx
0040117E shr ebp,18h
00401181 xor ebx,dword ptr [ebp*4+403C38h]
00401188 mov dword ptr [esp+14h],edx
0040118C and edx,0FFh
00401192 xor ebx,dword ptr [edx*4+403038h]
Also, strangely, if I compile the 2005 version as a debug version (no optimizations at all) I get slightly better performance than the release 2005 version!
Ben Taylor UK
Did you ever get a look into my source code I'm using two printf()s and one time() functions. Iterations in code are big enough to remove the fault of the bad CRT code.
balo
That's above-water part of iceberg. This loop on my P4 3.4GHz executes near 280 times per 10 seconds, see what Visual C 2005 did in the main GostEncryptBlock() function (which is real bottle-neck in this testcase). It constantly reloaded all registers from memory, doing movzx/movsx, doing silly operations with registers, like xor eax, eax; mov eax,10; etc.
misteloe
Eunsu
I can confirm that there is a significant slow down.
You should report this issue at:
http://lab.msdn.microsoft.com/ProductFeedback/
[ The reason for this is the customer bugs are given more "weight" than internal bugs ]
But in the mean time I'll give the optimizer guys a heads up and see if they have any ideas on why there is such a slow down.
Rod Kimmel
Jeff Lundstrom