Discussion:
Writing unsigned char to std::ostream
(too old to reply)
David Wilkinson
2007-09-02 16:37:44 UTC
Permalink
In my MFC project I have a piece of code like

void XMLHelper::WriteHeader(std::ostream& ostrm)
{
ostrm.put(unsigned char(0xEF)); // 1
ostrm.put(unsigned char(0xBB));
ostrm.put(unsigned char(0xBF));
ostrm << "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n";
}

where (if it matters) ostrm is opened in text mode. This code compiles
and (I think) runs as I intended on VC6-VC9.

However, I have been playing around with VMWare and Linux (Ubuntu) and
got it into my head to compile the non-GUI classes in my app using g++
(version 4.1.2). It won't compile.

I now see that my code was wrong (undefined behavior), because it
requires conversion of unsigned char to char, which is
implementation-defined (I just learned).

But this is not why it fails to compile on g++. It fails due to the use
of unsigned char as a type in this context. Both of the following
compile correctly (not sure about running):

ostrm.put(unsigned(0xEF)); // 2
ostrm.put((unsigned char)0xEF); // 3

The same thing happens on Comeau (1 fails, 2 and 3 compile).

Questions:

1. Is this a parsing bug in g++/Comeau?

2. Do you think the following is correct and portable?

void XMLHelper::WriteHeader(std::ostream& ostrm)
{
ostrm << static_cast<unsigned char>(0xEFU);
ostrm << static_cast<unsigned char>(0xBBU);
ostrm << static_cast<unsigned char>(0xBFU);
ostrm << "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n";
}

Thanks.
--
David Wilkinson
Visual C++ MVP
David Webber
2007-09-02 17:00:01 UTC
Permalink
...
ostrm.put(unsigned char(0xEF)); // 1
...
It fails due to the use of unsigned char as a type in this context. Both
ostrm.put(unsigned(0xEF)); // 2
ostrm.put((unsigned char)0xEF); // 3
My 2d-worth: I'd have used the last, or maybe even

ostrm.put( (unsigned char)(0xEF) );

just because it looks like stretching things to destruction to have a
function-style cast with an apparent "function name" containing a space.
1. Is this a parsing bug in g++/Comeau?
If it is, it's not one which surprises me!
2. Do you think the following is correct and portable?
void XMLHelper::WriteHeader(std::ostream& ostrm)
{
ostrm << static_cast<unsigned char>(0xEFU);
ostrm << static_cast<unsigned char>(0xBBU);
ostrm << static_cast<unsigned char>(0xBFU);
ostrm << "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n";
}
Not sure - but it looks ugly with all those <<<< :-)

My instinct for clarity and security would be

ostrm.put( (unsigned char)(0xEF) );

Alternatively, how about

typedef unsigned char __uint8;

ostrm.put( __uint8(0xEF) );

:-)

Dave
--
David Webber
Author of 'Mozart the Music Processor'
http://www.mozart.co.uk
For discussion/support see
http://www.mozart.co.uk/mzusers/mailinglist.htm
David Wilkinson
2007-09-02 17:13:28 UTC
Permalink
Post by David Webber
...
ostrm.put(unsigned char(0xEF)); // 1
...
It fails due to the use of unsigned char as a type in this context.
ostrm.put(unsigned(0xEF)); // 2
ostrm.put((unsigned char)0xEF); // 3
My 2d-worth: I'd have used the last, or maybe even
ostrm.put( (unsigned char)(0xEF) );
just because it looks like stretching things to destruction to have a
function-style cast with an apparent "function name" containing a space.
1. Is this a parsing bug in g++/Comeau?
If it is, it's not one which surprises me!
2. Do you think the following is correct and portable?
void XMLHelper::WriteHeader(std::ostream& ostrm)
{
ostrm << static_cast<unsigned char>(0xEFU);
ostrm << static_cast<unsigned char>(0xBBU);
ostrm << static_cast<unsigned char>(0xBFU);
ostrm << "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n";
}
Not sure - but it looks ugly with all those <<<< :-)
My instinct for clarity and security would be
ostrm.put( (unsigned char)(0xEF) );
Alternatively, how about
typedef unsigned char __uint8;
ostrm.put( __uint8(0xEF) );
Dave:

I think the trouble with your suggestions (and my original code) is that
ostream::put() takes a char argument, and conversion from unsigned char
to char is undefined behavior.

My immediate instinct was to change the "function-style cast" to a
"C-style cast". This certainly compiles, but may not run as intended on
some systems (in VC it works I think).
--
David Wilkinson
Visual C++ MVP
Ulrich Eckhardt
2007-09-03 09:12:29 UTC
Permalink
Post by David Wilkinson
Post by David Webber
...
ostrm.put(unsigned char(0xEF)); // 1
...
It fails due to the use of unsigned char as a type in this context.
ostrm.put(unsigned(0xEF)); // 2
ostrm.put((unsigned char)0xEF); // 3
My 2d-worth: I'd have used the last, or maybe even
ostrm.put( (unsigned char)(0xEF) );
just because it looks like stretching things to destruction to have a
function-style cast with an apparent "function name" containing a space.
1. Is this a parsing bug in g++/Comeau?
If it is, it's not one which surprises me!
My personal gut feeling is that Comeau and g++ are both correct and MSC is
jumping through some loops for user-convenience.
Post by David Wilkinson
Post by David Webber
2. Do you think the following is correct and portable?
void XMLHelper::WriteHeader(std::ostream& ostrm)
{
ostrm << static_cast<unsigned char>(0xEFU);
ostrm << static_cast<unsigned char>(0xBBU);
ostrm << static_cast<unsigned char>(0xBFU);
ostrm << "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n";
}
The static_cast is the same as the former function-style cast, but it works
correctly and portably even with the space in the type's name.

BTW:

unsigned char const utf8_bom[] = { 0xef, 0xbb, 0xbf, 0};
ostrm << utf8_bom << header << std::endl;

Surprisingly, there are in fact overloads for signed and unsigned char in
iostreams.
Post by David Wilkinson
Post by David Webber
My instinct for clarity and security would be
ostrm.put( (unsigned char)(0xEF) );
Well, that's the same as a static_cast, except that C-style casts are
frowned upon.
Post by David Wilkinson
Post by David Webber
Alternatively, how about
typedef unsigned char __uint8;
ostrm.put( __uint8(0xEF) );
This is really bad advise. Anything with two consecutive underscores is
reserved and you should never create symbols in that namespace and always
be careful when using them, because their meaning is typically
non-portable.
Post by David Wilkinson
I think the trouble with your suggestions (and my original code) is that
ostream::put() takes a char argument, and conversion from unsigned char
to char is undefined behavior.
Wasn't that implementation-defined? Implementation-defined is something I'll
live with, but undefined is something I'd rather avoid...
Post by David Wilkinson
My immediate instinct was to change the "function-style cast" to a
"C-style cast".
Never use C-style casts in C++, they only serve to hide broken code and move
the error detection from compile-time to runtime.
Post by David Wilkinson
This certainly compiles, but may not run as intended on
some systems (in VC it works I think).
Well, that's what unittests are for.

Uli
Giovanni Dicanio
2007-09-03 09:33:46 UTC
Permalink
Post by Ulrich Eckhardt
Never use C-style casts in C++, they only serve to hide broken code and move
the error detection from compile-time to runtime.
Hi,

I think that we should prefer C++-style casts, as you wrote, but there are
(rare) cases when IMHO C-style casts could be OK, exspecially when working
with C APIs, e.g. consider the following code to add a string to
list-control (LVITEM needs a LPTSTR, and we have a CString as input):

http://www.codeproject.com/listctrl/listctrldemo.asp

<code>
LVITEM lvi;
CString strItem;
lvi.pszText = (LPTSTR)(LPCTSTR)(strItem);
</code>

What's the C++ version?
Is the following correct: ?

lvi.pszText = const_cast< LPTSTR >( static_cast<LPCTSTR>(strItem) );

...a bit more verbose.

OK, but "the beauty is in the eye of the beholder" :)

Giovanni
Ulrich Eckhardt
2007-09-03 10:47:13 UTC
Permalink
Post by Giovanni Dicanio
Post by Ulrich Eckhardt
Never use C-style casts in C++, they only serve to hide broken code and move
the error detection from compile-time to runtime.
Hi,
I think that we should prefer C++-style casts, as you wrote, but there are
(rare) cases when IMHO C-style casts could be OK, exspecially when working
with C APIs[...]
The only really valid case I have found is when casting a void pointer to a
function pointer type, e.g. in the context of GetProcAddress().
Post by Giovanni Dicanio
, e.g. consider the following code to add a string to
http://www.codeproject.com/listctrl/listctrldemo.asp
<code>
LVITEM lvi;
CString strItem;
lvi.pszText = (LPTSTR)(LPCTSTR)(strItem);
</code>
What's the C++ version?
// Note: I'm not sure about the method name
lvi.pszText = strItem.GetBuffer();
Post by Giovanni Dicanio
Is the following correct: ?
lvi.pszText = const_cast< LPTSTR >( static_cast<LPCTSTR>(strItem) );
...a bit more verbose.
OK, but "the beauty is in the eye of the beholder" :)
I think you could have written 'const_cast<LPTSTR>(strItem.operator
LPCTSTR())'. Anyway, isn't there a CListBox class that gives a seamless
integration with CString? In such a class, I would even find it okay to use
a slightly more verbose syntax, as long as it works correctly and the
syntax for the (high-level) code built on it is improved.

This is not a black/white thing though, it's definitely an area that is
gray.

regards

Uli
David Wilkinson
2007-09-03 10:51:47 UTC
Permalink
Post by Ulrich Eckhardt
My personal gut feeling is that Comeau and g++ are both correct and MSC is
jumping through some loops for user-convenience.
unsigned char const utf8_bom[] = { 0xef, 0xbb, 0xbf, 0};
ostrm << utf8_bom << header << std::endl;
Surprisingly, there are in fact overloads for signed and unsigned char in
iostreams.
Post by David Wilkinson
I think the trouble with your suggestions (and my original code) is that
ostream::put() takes a char argument, and conversion from unsigned char
to char is undefined behavior.
Wasn't that implementation-defined? Implementation-defined is something I'll
live with, but undefined is something I'd rather avoid...
Post by David Wilkinson
My immediate instinct was to change the "function-style cast" to a
"C-style cast".
Never use C-style casts in C++, they only serve to hide broken code and move
the error detection from compile-time to runtime.
Post by David Wilkinson
This certainly compiles, but may not run as intended on
some systems (in VC it works I think).
Well, that's what unittests are for.
Hi Ulrich:

Thanks for the reply.

Yes, sorry, I meant that conversion of unsigned char to char is
implementation defined (not undefined). One of the most stupid and
dangerous features in the language, IMHO.

I think there are really two issues here:

1. The use of "constructor syntax" with the unsigned char type.

2. How does ostream deal with unsigned char?

I actually got into this due to Issue 1 (which is easily worked around).
But Issue 2 is much more troubling.

Issue 1
-------

On Comeau, if I write

char c = char(1);

it compiles. But if I write

unsigned char uc = unsigned char(1); // 1

it does not. The same thing happens for other "compound types" like long
int. However, if I do

typedef unsigned char unsigned_char;
unsigned_char uc = unsigned_char(1); // 2

then it works. VC allows both 1 and 2, which seems correct to me.

Issue 2
-------

ostream::put() can only take a char as argument, so using it to output
an unsigned char is undefined behavior. This is why I switched to
operator << (which is overloaded for unsigned char). But what does it do?

In the VC source (in <ostream>) it just casts the unsigned char to char
(using a C-style cast), and then uses ostream::put() This works (for me)
on VC because 0xFF gives -1 when converted to char. But on some systems
(apparently) 0xFF converts to 127, which seems just crazy to me.

Getting back to my original problem, it seems that I should use your idea

unsigned char const utf8_bom[] = { 0xef, 0xbb, 0xbf, 0};
ostrm << utf8_bom << header << std::endl;

which works because the ostream implementation just casts the pointer to
const char*, not the bits of the individual characters. Thanks for this.

It seems that the ostream overload should be defined as

ostream& ostream::operator << (ostream& ostrm, unsigned char uc)
{
const char* p = reinterpret_cast<const char*>(&uc);
return ostrm.put(*p);
}

but it is not.
--
David Wilkinson
Visual C++ MVP
Hendrik Schober
2007-09-03 14:00:17 UTC
Permalink
Post by David Wilkinson
[...]
Issue 1
-------
On Comeau, if I write
char c = char(1);
it compiles. But if I write
unsigned char uc = unsigned char(1); // 1
it does not. The same thing happens for other "compound types" like long
int. However, if I do
typedef unsigned char unsigned_char;
unsigned_char uc = unsigned_char(1); // 2
then it works. VC allows both 1 and 2, which seems correct to me.
I'm on Ulrich's side on this. I didn't even know VC would
accept this. I used to keep forgetting to put parentheses
around compund type names and various compilers barked at
me. (Nowadays I use the new cast syntax.) Also, IME, if VC
and Comeau disagree on something, I have never, ever found
Comeau to be wrong, but always VC.
Post by David Wilkinson
Issue 2
-------
ostream::put() can only take a char as argument, so using it to output
an unsigned char is undefined behavior. This is why I switched to
operator << (which is overloaded for unsigned char). But what does it do?
'ostream::put()' is for unformatted output, 'operator<<()'
for formatted output.
Post by David Wilkinson
In the VC source (in <ostream>) it just casts the unsigned char to char
(using a C-style cast), and then uses ostream::put() This works (for me)
on VC because 0xFF gives -1 when converted to char. But on some systems
(apparently) 0xFF converts to 127, which seems just crazy to me.
Whether 'char' is signed or unsigned is implementation-
defined.
Post by David Wilkinson
[...]
Schobi
--
***@gmx.de is never read
I'm HSchober at gmx dot de
"A patched buffer overflow doesn't mean that there's one less way attackers
can get into your system; it means that your design process was so lousy
that it permitted buffer overflows, and there are probably thousands more
lurking in your code."
Bruce Schneier
David Wilkinson
2007-09-03 14:44:42 UTC
Permalink
Post by Hendrik Schober
Post by David Wilkinson
[...]
Issue 1
-------
On Comeau, if I write
char c = char(1);
it compiles. But if I write
unsigned char uc = unsigned char(1); // 1
it does not. The same thing happens for other "compound types" like long
int. However, if I do
typedef unsigned char unsigned_char;
unsigned_char uc = unsigned_char(1); // 2
then it works. VC allows both 1 and 2, which seems correct to me.
I'm on Ulrich's side on this. I didn't even know VC would
accept this. I used to keep forgetting to put parentheses
around compund type names and various compilers barked at
me. (Nowadays I use the new cast syntax.) Also, IME, if VC
and Comeau disagree on something, I have never, ever found
Comeau to be wrong, but always VC.
Post by David Wilkinson
Issue 2
-------
ostream::put() can only take a char as argument, so using it to output
an unsigned char is undefined behavior. This is why I switched to
operator << (which is overloaded for unsigned char). But what does it do?
'ostream::put()' is for unformatted output, 'operator<<()'
for formatted output.
Post by David Wilkinson
In the VC source (in <ostream>) it just casts the unsigned char to char
(using a C-style cast), and then uses ostream::put() This works (for me)
on VC because 0xFF gives -1 when converted to char. But on some systems
(apparently) 0xFF converts to 127, which seems just crazy to me.
Whether 'char' is signed or unsigned is implementation-
defined.
Hi Schobi:

I guess I didn't know that these compound names behaved differently. I
thought "unsigned char" was a single token. I thought I was "doing good"
by using constructor syntax rather than C-cast syntax, but I guess not.
As you say, static_cast is the way to go.

We are not talking about whether plain char is signed or not. We are
talking about how unsigned values are converted to signed ones when they
are "too big".
--
David Wilkinson
Visual C++ MVP
Hendrik Schober
2007-09-03 14:51:42 UTC
Permalink
Post by David Wilkinson
[...]
Post by Hendrik Schober
Post by David Wilkinson
In the VC source (in <ostream>) it just casts the unsigned char to char
(using a C-style cast), and then uses ostream::put() This works (for me)
on VC because 0xFF gives -1 when converted to char. But on some systems
(apparently) 0xFF converts to 127, which seems just crazy to me.
Whether 'char' is signed or unsigned is implementation-
defined.
I guess I didn't know that these compound names behaved differently. I
thought "unsigned char" was a single token. [...]
I don't think tehre's a token containing whitespace
characters.
Post by David Wilkinson
We are not talking about whether plain char is signed or not. We are
talking about how unsigned values are converted to signed ones when they
are "too big".
I thought you were talking about converting 'unsigned
char' to 'char'. Might be my fault.

Schobi
--
***@gmx.de is never read
I'm HSchober at gmx dot de
"A patched buffer overflow doesn't mean that there's one less way attackers
can get into your system; it means that your design process was so lousy
that it permitted buffer overflows, and there are probably thousands more
lurking in your code."
Bruce Schneier
David Wilkinson
2007-09-03 15:08:41 UTC
Permalink
Post by Hendrik Schober
I thought you were talking about converting 'unsigned
char' to 'char'. Might be my fault.
Scobi:

Yes we are. But the ambiguity is not just whether char is signed or not
(compiler issue), but how unsigned char is converted to signed char
(which I think is driven by hardware).
--
David Wilkinson
Visual C++ MVP
Hendrik Schober
2007-09-03 15:14:27 UTC
Permalink
Post by David Wilkinson
Post by Hendrik Schober
I thought you were talking about converting 'unsigned
char' to 'char'. Might be my fault.
Yes we are.
OK.
Post by David Wilkinson
But the ambiguity is not just whether char is signed or not
(compiler issue), but how unsigned char is converted to signed char
(which I think is driven by hardware).
Wait. I thought you were talking about conversion
from 'unsigned char' to 'char'. Where did 'signed
char' came into this?

Schobi
--
***@gmx.de is never read
I'm HSchober at gmx dot de
"A patched buffer overflow doesn't mean that there's one less way attackers
can get into your system; it means that your design process was so lousy
that it permitted buffer overflows, and there are probably thousands more
lurking in your code."
Bruce Schneier
David Wilkinson
2007-09-03 15:31:10 UTC
Permalink
Post by Hendrik Schober
Post by David Wilkinson
Post by Hendrik Schober
I thought you were talking about converting 'unsigned
char' to 'char'. Might be my fault.
Yes we are.
OK.
Post by David Wilkinson
But the ambiguity is not just whether char is signed or not
(compiler issue), but how unsigned char is converted to signed char
(which I think is driven by hardware).
Wait. I thought you were talking about conversion
from 'unsigned char' to 'char'. Where did 'signed
char' came into this?
Schobi:

I know that char and signed char are separate types, but when plain char
is signed, the issue of converting unsigned char to char is the same as
with converting unsigned char to signed char (not all machines do it the
same).
--
David Wilkinson
Visual C++ MVP
David Webber
2007-09-03 09:22:04 UTC
Permalink
Post by David Wilkinson
I think the trouble with your suggestions (and my original code) is that
ostream::put() takes a char argument,
Ah, I had missed that.
Post by David Wilkinson
and conversion from unsigned char to char is undefined behavior.
Is it? I thought the usual wrap around with 0xFF <-> -1 etc was pretty
much prescribed (assuming char is 8 bits)? But...

I see my (ancient edition of) Stroustrup says

"When an integer is converted to an unsigned type, the value is the least
unsigned integer congruent to the signed integer (modulo 2^n where n is the
number of bits used to represent the unsigned type). In a two's complement
representation this conversion is conceptual and there is no change in the
bit pattern.

"When an integer is converted to a signed type, the value is unchanged if it
can be represented in the new type; otherwise the value is implementation
dependent."

Elsewhere he says that "char", "signed char", and "unsigned char" are three
separate types which must be the same size.

So it LOOKS to me as if 0xFF->-1 always but casting the other way does
*not* imply -1 ->0xFF. Thus

unsigned char a = 0xFF;
char b = (char)a; // Implemntation dependent
unsigned char c = (unsigned char)b;

is allowed to give c!=0xFF !!!!!!

In practice are there machines on which this happens? If so, I think it
will be a lot more than your cast which is broken!

BTW if "char", "signed char", "unsigned char" are 3 different types, are
there systems in which "char" and "unsigned char" are the same (but "signed
char" is different)? If so should we all be writing "signed char" if we
definitely want a signed variable?

Dave
--
David Webber
Author of 'Mozart the Music Processor'
http://www.mozart.co.uk
For discussion/support see
http://www.mozart.co.uk/mzusers/mailinglist.htm
Ulrich Eckhardt
2007-09-03 10:40:41 UTC
Permalink
Post by David Webber
So it LOOKS to me as if 0xFF->-1 always but casting the other way does
*not* imply -1 ->0xFF. Thus
unsigned char a = 0xFF;
char b = (char)a; // Implemntation dependent
unsigned char c = (unsigned char)b;
is allowed to give c!=0xFF !!!!!!
In practice are there machines on which this happens? If so, I think it
will be a lot more than your cast which is broken!
IIRC, there are machines that simply don't use two's complement but a
sign/magnitude representation for integers. On those machines, the
value 'minus zero' could then represent a so-called trap value, i.e. an
invalid value for an integer.
Post by David Webber
BTW if "char", "signed char", "unsigned char" are 3 different types, are
there systems in which "char" and "unsigned char" are the same (but
"signed char" is different)?
Yes, it is even pretty common. In fact most compilers allow you to switch
char between signed and unsigned, probably to accommodate for broken code
that assumes either variant.
Post by David Webber
If so should we all be writing "signed char" if we definitely want
a signed variable?
Yes, definitely! I'd be tempted to use a typedef though, in particular when
I'm referring to either octets or bytes.

Uli
David Webber
2007-09-03 12:13:57 UTC
Permalink
Post by Ulrich Eckhardt
Post by David Webber
BTW if "char", "signed char", "unsigned char" are 3 different types, are
there systems in which "char" and "unsigned char" are the same (but
"signed char" is different)?
Yes, it is even pretty common. In fact most compilers allow you to switch
char between signed and unsigned, probably to accommodate for broken code
that assumes either variant.
Thanks Ulrich. I must admit to wondering a long time ago why char was
signed - given that it seems reasonable to want to use all 256 values for
characters (even if the original ASCII set was only concerned with 128).
Post by Ulrich Eckhardt
Post by David Webber
If so should we all be writing "signed char" if we definitely want
a signed variable?
Yes, definitely! I'd be tempted to use a typedef though, in particular when
I'm referring to either octets or bytes.
I wonder how many do :-) Still I suppose doing 1-byte integer arithmetic
is not that common.

Dave
--
David Webber
Author of 'Mozart the Music Processor'
http://www.mozart.co.uk
For discussion/support see
http://www.mozart.co.uk/mzusers/mailinglist.htm
Christof Meerwald
2007-09-03 13:36:29 UTC
Permalink
Post by David Wilkinson
In my MFC project I have a piece of code like
void XMLHelper::WriteHeader(std::ostream& ostrm)
{
ostrm.put(unsigned char(0xEF)); // 1
ostrm.put(unsigned char(0xBB));
ostrm.put(unsigned char(0xBF));
Why don't you use:

ostrm << '\xef' << '\xbb' << '\xbf';

or even

ostrm << "\xef\xbb\xbf";


Christof
--
http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org
David Wilkinson
2007-09-03 14:50:21 UTC
Permalink
Post by Christof Meerwald
Post by David Wilkinson
In my MFC project I have a piece of code like
void XMLHelper::WriteHeader(std::ostream& ostrm)
{
ostrm.put(unsigned char(0xEF)); // 1
ostrm.put(unsigned char(0xBB));
ostrm.put(unsigned char(0xBF));
ostrm << '\xef' << '\xbb' << '\xbf';
or even
ostrm << "\xef\xbb\xbf";
Christof:

Because it seems it is not guaranteed to work.
--
David Wilkinson
Visual C++ MVP
Giovanni Dicanio
2007-09-03 21:04:03 UTC
Permalink
Post by David Wilkinson
Post by Christof Meerwald
ostrm << '\xef' << '\xbb' << '\xbf';
or even
ostrm << "\xef\xbb\xbf";
Because it seems it is not guaranteed to work.
David: it seems just fine (both compile and run) on both Windows and Kubuntu
(g++) ...

I don't understand the problem...

Giovanni
David Wilkinson
2007-09-03 21:54:06 UTC
Permalink
Post by Giovanni Dicanio
Post by David Wilkinson
Post by Christof Meerwald
ostrm << '\xef' << '\xbb' << '\xbf';
or even
ostrm << "\xef\xbb\xbf";
Because it seems it is not guaranteed to work.
David: it seems just fine (both compile and run) on both Windows and Kubuntu
(g++) ...
I don't understand the problem...
Giovanni:

I don't think it is a compiler issue so much as a hardware issue. Both
Windows and Kubuntu run on Intel platform, which has "sane" behavior.

But there are (apparently) systems where

char c = 0xFF;

creates a value that tests equal to 127. On such systems, it seems to
me, the BOM will not be written correctly. But Uli's method

unsigned char const utf8_bom[] = { 0xef, 0xbb, 0xbf, 0};
ostrm << utf8_bom << header << std::endl;

will work, because the bits in the individual characters are not changed
by the implementation of ostream::operator <<() for const unsigned char*
(which just casts to const char*).
--
David Wilkinson
Visual C++ MVP
Giovanni Dicanio
2007-09-03 22:40:30 UTC
Permalink
Post by David Wilkinson
I don't think it is a compiler issue so much as a hardware issue. Both
Windows and Kubuntu run on Intel platform, which has "sane" behavior.
But there are (apparently) systems where
char c = 0xFF;
creates a value that tests equal to 127. On such systems,
Thanks, I understand now :)

G.
David Wilkinson
2007-09-04 11:51:02 UTC
Permalink
Post by Giovanni Dicanio
Post by David Wilkinson
I don't think it is a compiler issue so much as a hardware issue. Both
Windows and Kubuntu run on Intel platform, which has "sane" behavior.
But there are (apparently) systems where
char c = 0xFF;
creates a value that tests equal to 127. On such systems,
Thanks, I understand now :)
Hi Giovanni:

See my second response to Christof. The right way to do this, as he
points out, is with hexadecimal character constants, not hexadecimal
(integer) constants.
--
David Wilkinson
Visual C++ MVP
David Wilkinson
2007-09-04 11:47:08 UTC
Permalink
Post by David Wilkinson
Post by Christof Meerwald
Post by David Wilkinson
In my MFC project I have a piece of code like
void XMLHelper::WriteHeader(std::ostream& ostrm)
{
ostrm.put(unsigned char(0xEF)); // 1
ostrm.put(unsigned char(0xBB));
ostrm.put(unsigned char(0xBF));
ostrm << '\xef' << '\xbb' << '\xbf';
or even
ostrm << "\xef\xbb\xbf";
Because it seems it is not guaranteed to work.
Christof:

My apologies. I missed the crucial distinction between "hexadecimal
constants" and "hexadecimal character constants".

I'm sure the history of this fiasco is that originally I tried

ostrm.put(0xEF);

and got integer truncation warnings. This caused me to do

ostrm.put(unsigned char(0xEF));

A big, BIG, mistake, which caused me to get involved with compound type
name issues and integer conversion issues. What I should have done was

ostrm.put('\xEF');

Nothing wrong with this, I think, but for a text file perhaps operator
<<() would be better, as you suggest.

I always knew that if I was taken out of my ASCII/8 bit char/2's
complement-centric world then things might get ugly, and I was right!

Thanks again. Hexadecimal character constants. Got it now.
--
David Wilkinson
Visual C++ MVP
Ben Voigt [C++ MVP]
2007-09-04 22:08:36 UTC
Permalink
Post by David Wilkinson
Post by David Wilkinson
Post by Christof Meerwald
Post by David Wilkinson
In my MFC project I have a piece of code like
void XMLHelper::WriteHeader(std::ostream& ostrm)
{
ostrm.put(unsigned char(0xEF)); // 1
ostrm.put(unsigned char(0xBB));
ostrm.put(unsigned char(0xBF));
ostrm << '\xef' << '\xbb' << '\xbf';
or even
ostrm << "\xef\xbb\xbf";
Because it seems it is not guaranteed to work.
My apologies. I missed the crucial distinction between "hexadecimal
constants" and "hexadecimal character constants".
I'm sure the history of this fiasco is that originally I tried
ostrm.put(0xEF);
and got integer truncation warnings. This caused me to do
ostrm.put(unsigned char(0xEF));
A big, BIG, mistake, which caused me to get involved with compound type
name issues and integer conversion issues. What I should have done was
ostrm.put('\xEF');
Nothing wrong with this, I think, but for a text file perhaps operator
<<() would be better, as you suggest.
I think put() is better than operator<< for this, because operator<<, when
passed a non-ASCII character such as you have, might do multibyte encoding
in some environments!
Post by David Wilkinson
I always knew that if I was taken out of my ASCII/8 bit char/2's
complement-centric world then things might get ugly, and I was right!
Thanks again. Hexadecimal character constants. Got it now.
--
David Wilkinson
Visual C++ MVP
Ulrich Eckhardt
2007-09-05 07:03:13 UTC
Permalink
Post by Ben Voigt [C++ MVP]
Post by David Wilkinson
ostrm.put('\xEF');
Nothing wrong with this, I think, but for a text file perhaps operator
<<() would be better, as you suggest.
I think put() is better than operator<< for this, because operator<<, when
passed a non-ASCII character such as you have, might do multibyte encoding
in some environments!
Huh? Transcoding between internally and externally used character sets are
done by the codeconversion facet of the filebuffer, not by the iostreams
themselves. Or is it something else you're referring to?

Uli
Ben Voigt [C++ MVP]
2007-09-05 18:20:51 UTC
Permalink
Post by Ulrich Eckhardt
Post by Ben Voigt [C++ MVP]
Post by David Wilkinson
ostrm.put('\xEF');
Nothing wrong with this, I think, but for a text file perhaps operator
<<() would be better, as you suggest.
I think put() is better than operator<< for this, because operator<<, when
passed a non-ASCII character such as you have, might do multibyte encoding
in some environments!
Huh? Transcoding between internally and externally used character sets are
done by the codeconversion facet of the filebuffer, not by the iostreams
themselves. Or is it something else you're referring to?
Ok, nevermind then.
Post by Ulrich Eckhardt
Uli
Loading...