Performance of exported template classes

Discussion:

(too old to reply)

g***@hotmail.com

2010-03-11 16:42:59 UTC

Hello all,

this may be a difficult to explain problem, and I need some assembly
to show the difference. In a DLL we export some STL containers to
minimize code bloat, like:

template class __declspec(dllexport) std::vector<int>;
typedef std::vector<int> int_vector;

In a simple test probgram I see now a huge difference in performance.
The c++ function is as follows (same as std::fill, but this is just
example):

void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop)
{
for (size_t n = 0; n != nLoop; ++n)
{
const int_vector::iterator itEnd = pVector->end();

for (int_vector::iterator it = pVector->begin(); it != itEnd; +
+it)
{
*it = nValue;
}
}
}

In the assembly code somehow exception handling has been put in, and
this gets updated in the loop, which is major performance issue (see
'//! <- difference'):

void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop)
{
00401D30 push 0FFFFFFFFh
00401D32 push offset __ehhandler$?PrfMemoryIterator@@YAXPAV?
$***@HV?$***@H@std@@@std@@***@Z (403718h)
00401D37 mov eax,dword ptr fs:[00000000h]
00401D3D push eax
00401D3E mov dword ptr fs:[0],esp
00401D45 sub esp,4Ch
00401D48 mov eax,dword ptr [___security_cookie (406270h)]
00401D4D xor eax,esp
00401D4F push edi
00401D50 mov edi,ecx

<snip>

for (int_vector::iterator it = pVector->begin(); it != itEnd; +
+it)
00401D7D lea ecx,[esp+4]
00401D81 push ecx
00401D82 mov ecx,ebx
00401D84 call dword ptr
[__imp_std::vector<int,std::allocator<int> >::begin (404004h)]
00401D8A mov eax,dword ptr [esp+4]
00401D8E cmp eax,dword ptr [esp+8]
00401D92 je PrfMemoryIterator+79h (401DA9h)
{
*it = nValue;
00401D94 mov dword ptr [eax],esi
00401D96 mov eax,dword ptr [esp+4] //! <- difference
00401D9A mov ecx,dword ptr [esp+8] //! <- difference
00401D9E add eax,4
00401DA1 cmp eax,ecx
00401DA3 mov dword ptr [esp+4],eax //! <- difference
00401DA7 jne PrfMemoryIterator+64h (401D94h)

However if we not export the STL containers, the generated code is
different:

void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop)
{
00401F60 sub esp,44h
00401F63 mov eax,dword ptr [___security_cookie (406290h)]
00401F68 xor eax,esp
00401F6A push edi
00401F6B mov edi,ecx

<snip>

for (int_vector::iterator it = pVector->begin(); it != itEnd; +
+it)
00401F86 mov eax,dword ptr [ebx+4]
00401F89 cmp eax,ecx
00401F8B je PrfMemoryIterator+39h (401F99h)
00401F8D lea ecx,[ecx]
{
*it = nValue;
00401F90 mov dword ptr [eax],esi
00401F92 add eax,4
00401F95 cmp eax,ecx
00401F97 jne PrfMemoryIterator+30h (401F90h)

I use vstudio 2003 here, but I noticed something similar with the
_SECURE_SCL option in vstudio 2008, which also makes a difference from
a performance perspective .

Can anyone help? It is probably somewhere in the exception handling
corner, however why would this make a difference when using exported
classes or not?

Thx in advance.

Alexander Grigoriev

2010-03-12 03:50:21 UTC

Permalink

Normally, the STL-generated code can get heavily optimized and inlined. But
if you export the code, the no-inline functions will be used.

Post by g***@hotmail.com
Hello all,
this may be a difficult to explain problem, and I need some assembly
to show the difference. In a DLL we export some STL containers to
template class __declspec(dllexport) std::vector<int>;
typedef std::vector<int> int_vector;
In a simple test probgram I see now a huge difference in performance.
The c++ function is as follows (same as std::fill, but this is just
void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop)
{
for (size_t n = 0; n != nLoop; ++n)
{
const int_vector::iterator itEnd = pVector->end();
for (int_vector::iterator it = pVector->begin(); it != itEnd; +
+it)
{
*it = nValue;
}
}
}
In the assembly code somehow exception handling has been put in, and
this gets updated in the loop, which is major performance issue (see
void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop)
{
00401D30 push 0FFFFFFFFh
00401D37 mov eax,dword ptr fs:[00000000h]
00401D3D push eax
00401D3E mov dword ptr fs:[0],esp
00401D45 sub esp,4Ch
00401D48 mov eax,dword ptr [___security_cookie (406270h)]
00401D4D xor eax,esp
00401D4F push edi
00401D50 mov edi,ecx
<snip>
for (int_vector::iterator it = pVector->begin(); it != itEnd; +
+it)
00401D7D lea ecx,[esp+4]
00401D81 push ecx
00401D82 mov ecx,ebx
00401D84 call dword ptr
[__imp_std::vector<int,std::allocator<int> >::begin (404004h)]
00401D8A mov eax,dword ptr [esp+4]
00401D8E cmp eax,dword ptr [esp+8]
00401D92 je PrfMemoryIterator+79h (401DA9h)
{
*it = nValue;
00401D94 mov dword ptr [eax],esi
00401D96 mov eax,dword ptr [esp+4] //! <- difference
00401D9A mov ecx,dword ptr [esp+8] //! <- difference
00401D9E add eax,4
00401DA1 cmp eax,ecx
00401DA3 mov dword ptr [esp+4],eax //! <- difference
00401DA7 jne PrfMemoryIterator+64h (401D94h)
However if we not export the STL containers, the generated code is
void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop)
{
00401F60 sub esp,44h
00401F63 mov eax,dword ptr [___security_cookie (406290h)]
00401F68 xor eax,esp
00401F6A push edi
00401F6B mov edi,ecx
<snip>
for (int_vector::iterator it = pVector->begin(); it != itEnd; +
+it)
00401F86 mov eax,dword ptr [ebx+4]
00401F89 cmp eax,ecx
00401F8B je PrfMemoryIterator+39h (401F99h)
00401F8D lea ecx,[ecx]
{
*it = nValue;
00401F90 mov dword ptr [eax],esi
00401F92 add eax,4
00401F95 cmp eax,ecx
00401F97 jne PrfMemoryIterator+30h (401F90h)
I use vstudio 2003 here, but I noticed something similar with the
_SECURE_SCL option in vstudio 2008, which also makes a difference from
a performance perspective .
Can anyone help? It is probably somewhere in the exception handling
corner, however why would this make a difference when using exported
classes or not?
Thx in advance.

g***@hotmail.com

2010-03-12 08:18:22 UTC

Permalink

Post by Alexander Grigoriev
Normally, the STL-generated code can get heavily optimized and inlined. But
if you export the code, the no-inline functions will be used.

00401D92 je PrfMemoryIterator+79h (401DA9h)
{
*it = nValue;
00401D94 mov dword ptr [eax],esi
00401D96 mov eax,dword ptr [esp+4] //! <- difference
00401D9A mov ecx,dword ptr [esp+8] //! <- difference
00401D9E add eax,4
00401DA1 cmp eax,ecx
00401DA3 mov dword ptr [esp+4],eax //! <- difference
00401DA7 jne PrfMemoryIterator+64h (401D94h)

Yes but an optimizer could conclude from the assembly code that it
stores and loads the value of the eax again and again in [esp + 4].
Even the ecx register gets reloaded all the time, with being changed
in the loop. So my conclusion would be that it somehow is essential
that this eax value gets written back to [esp + 4] in the loop or
otherwise it may be a bug. I also do not use the volatile keyword, so
the optimizer is freely to use all its power.

g***@hotmail.com

2010-03-14 23:48:16 UTC

Permalink

I made 2 changes to the original code:
1) use const_iterator as end iterator
2) pulled iterator out of loop

And now the values of the iterator aren't reloaded again and again in
the for loop. No idea why; a compiler specialist could help here?

void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop)
{
PRF_FUNCTION();

for (size_t n = 0; n != nLoop; ++n)
{
const int_vector::const_iterator itEnd = pVector->end();
int_vector::iterator it;

for (it = pVector->begin(); it != itEnd; ++it)
{
*it = nValue;
}
}
}

I saw alos another nice effect (which may or may not be related):
'Inconsistent inlining of C++ class template member functions across
DLLs'
https://connect.microsoft.com/VisualStudio/feedback/details/511979/inconsistent-inlining-of-c-class-template-member-functions-across-dlls