Martin B.
2010-06-09 20:04:37 UTC
MultiByteToWideChar does not correctly detect an invalid UTF-8 String
with some single byte values that are valid Latin1 charactes, but
invalid UTF-8 encodings.
If you hand a Latin1 string to MultiByteToWideChar and that string e.g.
only contains U+00C4 / U+00D6 / U+00E4 / ... as characters > 128 then
MB2WC will *not* fail with CP_UTF8+MB_ERR_INVALID_CHARS - contrary to
what the docs state. (U+00A* and U+00F* bytes seem to be detected
correctly as invalid UTF-8)
(Find a complete source code example at the end.)
From the MSDN entry for MultiByteToWideChar:
(as seen here:
http://msdn.microsoft.com/query/dev10.query?appId=Dev10IDEF1&l=EN-US&k=k%28MULTIBYTETOWIDECHAR%29;k%28DevLang-%22C%2B%2B%22%29&rd=true
)
----
MultiByteToWideChar Function
(...)
MB_ERR_INVALID_CHARS - (...) Windows XP: Fail if an invalid input
character is encountered. (...)
(...)
Note For UTF-8 (...) dwFlags must be set to either 0 or
MB_ERR_INVALID_CHARS. (...)
The function fails if MB_ERR_INVALID_CHARS is set and an invalid
character is encountered (...)
Starting with Windows Vista, this function fully conforms with the
Unicode 4.1 specification for UTF-8 and UTF-16. The function used on
earlier operating systems encodes or decodes lone surrogate halves or
mismatched surrogate pairs (...)
----
Since I currently do not have access to a Vista Test box, only XP here,
can anyone tell me if the behaviour is corrected on Vista/Win7 or if it
still incorrectly swallows some Latin1 bytes?
cheers,
Martin
--- Example : main.cpp ---
// Tested with VS 2005 + VS 2010 Express
#include <Windows.h>
#include <stdio.h>
const char* const test_str[] ={
"Copyright : U+00A9 (A9) : **[\xA9]**",
"Degree : U+00B0 (B0) : **[\xB0]**",
"Micro : U+00B5 (B5) : **[\xB5]**",
"Capital A with diaeresis : U+00C4 (C4) : **[\xC4]**",
"Capital O with diaeresis : U+00D6 (D6) : **[\xD6]**",
"Small a with diaeresis : U+00E4 (E4) : **[\xE4]**",
"Small u with diaeresis : U+00FC (FC) : **[\xFC]**",
"Small letter y with diaeresis : U+00FF (FF) : **[\xFF]**"
};
int main()
{
printf("Testing MultiByteToWideChar with Latin1 strings and UTF-8
conversion ...\n");
printf("Each of the following strings should fail an UTF-8 to UTF-16
conversion as they are *not* valid UTF-8 ...\n");
for(int i=0, e=_countof(test_str); i!=e; ++i) {
wchar_t wbuf[1024];
int conv_succeeded = MultiByteToWideChar(CP_UTF8,
MB_ERR_INVALID_CHARS, test_str[i], -1, wbuf, _countof(wbuf));
if(conv_succeeded) {
printf("(ERR) Conversion succeeded for string [%s]\n", test_str[i]);
wprintf(L" Result: [%s]\n", wbuf);
}
else {
DWORD err = GetLastError();
printf("( OK) Conversion Failed with error (%u) for string
[%s]\n", err, test_str[i]);
}
printf("\n");
}
return 0;
}
--- ---
with some single byte values that are valid Latin1 charactes, but
invalid UTF-8 encodings.
If you hand a Latin1 string to MultiByteToWideChar and that string e.g.
only contains U+00C4 / U+00D6 / U+00E4 / ... as characters > 128 then
MB2WC will *not* fail with CP_UTF8+MB_ERR_INVALID_CHARS - contrary to
what the docs state. (U+00A* and U+00F* bytes seem to be detected
correctly as invalid UTF-8)
(Find a complete source code example at the end.)
From the MSDN entry for MultiByteToWideChar:
(as seen here:
http://msdn.microsoft.com/query/dev10.query?appId=Dev10IDEF1&l=EN-US&k=k%28MULTIBYTETOWIDECHAR%29;k%28DevLang-%22C%2B%2B%22%29&rd=true
)
----
MultiByteToWideChar Function
(...)
MB_ERR_INVALID_CHARS - (...) Windows XP: Fail if an invalid input
character is encountered. (...)
(...)
Note For UTF-8 (...) dwFlags must be set to either 0 or
MB_ERR_INVALID_CHARS. (...)
The function fails if MB_ERR_INVALID_CHARS is set and an invalid
character is encountered (...)
Starting with Windows Vista, this function fully conforms with the
Unicode 4.1 specification for UTF-8 and UTF-16. The function used on
earlier operating systems encodes or decodes lone surrogate halves or
mismatched surrogate pairs (...)
----
Since I currently do not have access to a Vista Test box, only XP here,
can anyone tell me if the behaviour is corrected on Vista/Win7 or if it
still incorrectly swallows some Latin1 bytes?
cheers,
Martin
--- Example : main.cpp ---
// Tested with VS 2005 + VS 2010 Express
#include <Windows.h>
#include <stdio.h>
const char* const test_str[] ={
"Copyright : U+00A9 (A9) : **[\xA9]**",
"Degree : U+00B0 (B0) : **[\xB0]**",
"Micro : U+00B5 (B5) : **[\xB5]**",
"Capital A with diaeresis : U+00C4 (C4) : **[\xC4]**",
"Capital O with diaeresis : U+00D6 (D6) : **[\xD6]**",
"Small a with diaeresis : U+00E4 (E4) : **[\xE4]**",
"Small u with diaeresis : U+00FC (FC) : **[\xFC]**",
"Small letter y with diaeresis : U+00FF (FF) : **[\xFF]**"
};
int main()
{
printf("Testing MultiByteToWideChar with Latin1 strings and UTF-8
conversion ...\n");
printf("Each of the following strings should fail an UTF-8 to UTF-16
conversion as they are *not* valid UTF-8 ...\n");
for(int i=0, e=_countof(test_str); i!=e; ++i) {
wchar_t wbuf[1024];
int conv_succeeded = MultiByteToWideChar(CP_UTF8,
MB_ERR_INVALID_CHARS, test_str[i], -1, wbuf, _countof(wbuf));
if(conv_succeeded) {
printf("(ERR) Conversion succeeded for string [%s]\n", test_str[i]);
wprintf(L" Result: [%s]\n", wbuf);
}
else {
DWORD err = GetLastError();
printf("( OK) Conversion Failed with error (%u) for string
[%s]\n", err, test_str[i]);
}
printf("\n");
}
return 0;
}
--- ---