Discussion:
wcout and unicode?
(too old to reply)
David Webber
2009-10-08 14:57:33 UTC
Permalink
I have a unicode console app.

It has the usual _tmain() [where t means unicode as the unicode setting is
on] with argc and argv.

A one of my arguments I have the string Iván, where the third letter is
produced with AltGr+A on my British keyboard. (U+00E1)

Suppose

CString sName;

is set to contain this string.

1. the debugger watch window shows it correctly.
2. If I output it in a message box, it is fine: exactly what I'd expect.
3. However

std::wcout << sName;

or

std::wcout << (LPCTSTR)sName;

buggers it up. Running the program

myprog "Iván"

in a console window gives me Ivßn (third character is German "scharfes s" =
U+00DF).

The same thing happens if I do

myprog "Iván" >> newfile.txt

(and the file which is created is ascii).

Something strange seems to be happening with wcout. Any ideas?

[My machine is English Windows Vista with default code page (I believe) set
to 1252 = Western European.]

Dave
--
David Webber
Author of 'Mozart the Music Processor'
http://www.mozart.co.uk
For discussion/support see
http://www.mozart.co.uk/mozartists/mailinglist.htm
David Webber
2009-10-08 16:07:15 UTC
Permalink
Post by David Webber
myprog "Iván"
in a console window gives me Ivßn (third character is German "scharfes s"
= U+00DF).
I have found what must be part of an answer in the code page tables of
Nadine Kano's book (written when win95 and NT3.51 were the latest thing
since sliced bread).

In code page 1252 (Western European = Windows latin 1) 'á' is character 225
(=0xE1)
(and it is unicode U+00E1)

In code page 850 ("MS-DOS Latin 1") character 225 is 'ß'.

Bingo!

It appears that wcout << takes wide character strings and dumps them in a
console window or ASCII text file using code page 850?

Is that possible? (I haven't AFAIK set any funny code pages as default on
my machine.) Hasn't Vista sort of forgotten about MS-DOS or have i been
misled!?????

Dave
--
David Webber
Author of 'Mozart the Music Processor'
http://www.mozart.co.uk
For discussion/support see
http://www.mozart.co.uk/mozartists/mailinglist.htm
Igor Tandetnik
2009-10-08 17:29:16 UTC
Permalink
Post by David Webber
Post by David Webber
myprog "Iván"
in a console window gives me Ivßn (third character is German
"scharfes s" = U+00DF).
I have found what must be part of an answer in the code page tables of
Nadine Kano's book (written when win95 and NT3.51 were the latest
thing since sliced bread).
In code page 1252 (Western European = Windows latin 1) 'á' is
character 225 (=0xE1)
(and it is unicode U+00E1)
In code page 850 ("MS-DOS Latin 1") character 225 is 'ß'.
Bingo!
It appears that wcout << takes wide character strings and dumps them
in a console window or ASCII text file using code page 850?
No. It takes a character code 225 and dumps it as character code 225. The console chooses to interpret this code according to CP
850, and shows you a character other than the one you expected to see.

Console uses OEM code pages by default, has always done so, and will very likely continue to do so for the foreseeable future. See
also

http://blogs.msdn.com/michkap/archive/2005/02/08/369197.aspx
http://blogs.msdn.com/michkap/archive/2009/08/05/9857486.aspx

How do you examine the text file? Do you echo it to the console, by any chance? Try looking at it in a hex editor.
--
With best wishes,
Igor Tandetnik

With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea. It is hard to be sure where they are going
to land, and it could be dangerous sitting under them as they fly overhead. -- RFC 1925
Alex Blekhman
2009-10-08 17:33:25 UTC
Permalink
Post by David Webber
It appears that wcout << takes wide character strings and dumps
them in a console window or ASCII text file using code page 850?
Is that possible? (I haven't AFAIK set any funny code pages as
default on my machine.) Hasn't Vista sort of forgotten about
MS-DOS or have i been misled!?????
Yes, console window still employs funny codepages. Here's a bit
more info about the phenomenon:

"File redirection corruption?"
http://blogs.msdn.com/michkap/archive/2006/04/07/570980.aspx

"Why is the default console codepage called "OEM"?"
http://blogs.msdn.com/oldnewthing/archive/2005/08/29/457483.aspx

Usually explicitly running CMD with Unicode switch (/U) helps:

CMD /U myprog "Iván" >> newfile.txt


HTH
Alex
Jochen Kalmbach [MVP]
2009-10-08 18:05:02 UTC
Permalink
Hi David!
Post by David Webber
Post by David Webber
in a console window gives me Ivßn (third character is German
"scharfes s" = U+00DF).
I have found what must be part of an answer in the code page tables of
Nadine Kano's book (written when win95 and NT3.51 were the latest thing
since sliced bread).
If will work, if you specify the codepage for the "translation":
http://blog.kalmbachnet.de/?postid=98

This only works starting with VS2005.
--
Greetings
Jochen

My blog about Win32 and .NET
http://blog.kalmbachnet.de/
David Webber
2009-10-09 14:50:17 UTC
Permalink
Thanks for the answers chaps. I'm looking into it, and getting part way
there with your help.

In particular Jochen's answer half works.

If I insert (near the start of my program):

_setmode( _fileno(stdout), _O_U16TEXT );

The outputting Unicode character 225 (á = a-acute) = U+00E1 via

wcout << szWideString;

does indeed send 'á' correctly to the console. I thought my troubles were
over. But

myprog.exe >> testfile.txt

still doesn't work.

Instead of an ASCII text file it now creates a wide character text file with
no BOM. That's ok as the comments in <fcntl.h> say that's what it does.
The flag _O_U16TEXT is not actually mentioned in the documetation of
_setmode() [the version which came with VS 2008] but the comments in the
source code are clear. [It looks like _O_WTEXT will create a file with a
BOM.]

However the wide character which ends up in the file is still 223 (U+00DF)
just as previously (without the _setmode()) it left an Ascii character
223.

Aaaaaaaaaaaaaaaargghhhhh!!!!!

Now of course I have just redirected output from stdout to a file, so it
looks like the argument _fileno(stdout) does not apply to this situation.
A (FILE *) can be used instead, but of course I don't have a FILE *: the
program thinks it's using stdout and my batch file redirects this to a file.

Any ideas?

Dave
--
David Webber
Author of 'Mozart the Music Processor'
http://www.mozart.co.uk
For discussion/support see
http://www.mozart.co.uk/mozartists/mailinglist.htm

PS I am trying in vain to understand what is happening! It looks like

wcout << szWideString;

must now be sending a U+00E1. I know it sounds daft, but somewhere before
it ends up in the text file something is saying "Aha this is character
225", looking it up in code page 850 (MS_DOS latin) or similar, finding
it's a scharfes-s, and then saying "now where is scharfes-s in Unicode? Aha
code point U+00DF" and dropping that into the text file!
Jochen Kalmbach [MVP]
2009-10-09 15:02:31 UTC
Permalink
Hi David!
Post by David Webber
Now of course I have just redirected output from stdout to a file, so it
looks like the argument _fileno(stdout) does not apply to this
situation. A (FILE *) can be used instead, but of course I don't have a
FILE *: the program thinks it's using stdout and my batch file
redirects this to a file.
Any ideas?
You can dig deepter into the CRT source-code. There is some detection of
"file-redirection" of the console output... and maybe you need to change
some settings ;)

By the way: What do you expect to be stored in the redirected file?

Greetings
Jochen
David Webber
2009-10-09 16:42:25 UTC
Permalink
Post by Jochen Kalmbach [MVP]
Hi David!
Post by David Webber
Now of course I have just redirected output from stdout to a file, so it
looks like the argument _fileno(stdout) does not apply to this
situation. A (FILE *) can be used instead, but of course I don't have a
FILE *: the program thinks it's using stdout and my batch file redirects
this to a file.
Any ideas?
You can dig deepter into the CRT source-code. There is some detection of
"file-redirection" of the console output... and maybe you need to change
some settings ;)
Sounds complicated!
Post by Jochen Kalmbach [MVP]
By the way: What do you expect to be stored in the redirected file?
I'm generating (User-name, Serial-number) pairs for my software. (As a
minimal degree of security, the installer will ask for the user-name and
serial number.) I have an algorithm to generate the serial number for any
user from his user name (including accented characters, but probably not yet
non-Latin scripts).

For example if I get a telephone order, my simple program implements this so
I can type

MyAlgorithm.exe "user-name" >> textfile.txt

and it adds a line to the text file of the form:

user-name serial-number

Simple! Well I thought so. :-)

The serial number is coming out fine, but testing it with the name of my
Panamanian friend Iván, I can't get out the user name I typed in! I
haven't even started testing on characters which are not present in
code-page 1252 (my next test will be my Czech friend Frantisek). I really
thought that Unicode was supposed to be a done deal in Vista and VS2008
!!!!!! :-(

And I'm nowhere near doing the same thing in php for on-line orders (the
next step). That works fine too for unaccented characters, and rumour has
it that php can handle Unicode via UTF-8. But I'm a bit of newbie at php,
so I thought I'd make it work in C++ first (at which I'm a relatively old
hand). But I guess php is another discussion group: at the moment it's the
C++ which is incredibly frustrating.

I suppose if the worst comes to the worst I can open the file explicitly,
append the data - eg by writing to a (FILE *) for which I've done a
setmode - and then closing it. But that's horribly inelegant and more work
:-(

Dave
--
David Webber
Author of 'Mozart the Music Processor'
http://www.mozart.co.uk
For discussion/support see
http://www.mozart.co.uk/mozartists/mailinglist.htm
Loading...