Is this a BIG bug in all VC++ versions? About Unicode CR/LF translation.

Discussion:

(too old to reply)

xmllmx

2010-11-13 11:08:22 UTC

4 lines of code are worth a thousand words.

wofstream fout("test.txt", ios::out|ios::trunc);
fout.imbue(locale(locale::classic(), new codecvt_utf16<wchar_t,
0x10FFFFUL, little_endian>)); // UTF-16LE
fout << L'\n';
fout.close();

// Compiled with VC++ 2010 that supports C++0x.

Please note that the file is opened in unicode text mode. Because
fout.close() will automatically replace every LF with CR-LF, So,
test.txt now should hold 4 bytes which are 0x0D, 0x00, 0x0A, 0x00.

However, test.txt should just hold 3 bytes which are 0x0D, 0x0A, 0x00.
Notepad will fail to recognize this file.

Is this really a BIG bug?

David Lowndes

2010-11-13 11:30:26 UTC

Permalink

So,test.txt now should hold 4 bytes which are 0x0D, 0x00, 0x0A, 0x00.
However, test.txt should just hold 3 bytes which are 0x0D, 0x0A, 0x00.

Which one is it that you want?

Notepad will fail to recognize this file.

Possibly because you've not written a BOM (Byte Order Mark)?

Dave

xmllmx

2010-11-13 12:30:26 UTC

Permalink

Post by David Lowndes

So,test.txt now should hold 4 bytes which are 0x0D, 0x00, 0x0A, 0x00.
However, test.txt should just hold 3 bytes which are 0x0D, 0x0A, 0x00.

Which one is it that you want?

Notepad will fail to recognize this file.

Possibly because you've not written a BOM (Byte Order Mark)?
Dave

I want 4 bytes 0x0D, 0x00, 0x0A, 0x00 rather than 3 bytes 0x0D, 0x0A,
0x00.

BOM is unrelated with my this issue.

xmllmx

2010-11-13 12:35:43 UTC

Permalink

Post by xmllmx

Post by David Lowndes

So,test.txt now should hold 4 bytes which are 0x0D, 0x00, 0x0A, 0x00.
However, test.txt should just hold 3 bytes which are 0x0D, 0x0A, 0x00.

Which one is it that you want?

Notepad will fail to recognize this file.

Possibly because you've not written a BOM (Byte Order Mark)?
Dave

I want 4 bytes 0x0D, 0x00, 0x0A, 0x00 rather than 3 bytes 0x0D, 0x0A,
0x00.
BOM is unrelated with my this issue.

In UTF-16, LF is encoded as 0x000A, CR is encoded 0x000D. So, each LF
(0x000A) should become 0x000D, 0x000A after the compiler replaced
every LF with CR-LF.

David Lowndes

2010-11-13 23:50:03 UTC

Permalink

Post by xmllmx
I want 4 bytes 0x0D, 0x00, 0x0A, 0x00 rather than 3 bytes 0x0D, 0x0A,
0x00.

OK, it wasn't clear to me what you were saying in your original text
as you said "However, test.txt should just hold 3 bytes which are
0x0D, 0x0A, 0x00.", but presumably you meant "... only has 3 bytes..".

I would agree, this looks like a bug to me - but I don't pretend to
understand the mysteries of the stream IO libraries.

I suggest that you submit a bug report on the MS connect site:
https://connect.microsoft.com/VisualStudio

Please post a link to your bug report back here.

Dave

Ulrich Eckhardt

2010-11-15 08:55:07 UTC

Permalink

Post by xmllmx
wofstream fout("test.txt", ios::out|ios::trunc);
fout.imbue(locale(locale::classic(), new codecvt_utf16<wchar_t,
0x10FFFFUL, little_endian>)); // UTF-16LE
fout << L'\n';
fout.close();

[...]

Post by xmllmx
Please note that the file is opened in unicode text mode. Because
fout.close() will automatically replace every LF with CR-LF, So,
test.txt now should hold 4 bytes which are 0x0D, 0x00, 0x0A, 0x00.
However, test.txt should just hold 3 bytes which are 0x0D, 0x0A, 0x00.
Notepad will fail to recognize this file.
Is this really a BIG bug?

I'm not sure. The point is that the default is to open files in text mode,
and that means on MS Windows that '\n' is replaced with '\r\n'. This is
exactly what you got, even though it is not what you wanted, but computers
don't guess what you want.

I'd suggest that you explicitly write what you want:

// note that out and trunc are implied
wofstream out("test.utf-16", ios_base::binary);
// UTF-16LE
out.imbue(locale(locale::classic(),
new codecvt_utf16<wchar_t,
0x10FFFFUL,
little_endian>));
out << L"\r\n";
out.close();

As a side effect, your code becomes a bit more portable. Alternatively, you
could use UTF-8 instead of UTF-16.

Uli

--
C++ FAQ: http://parashift.com/c++-faq-lite

Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

xmllmx

2010-11-15 15:12:38 UTC

Permalink

Post by Ulrich Eckhardt

[...]

I'm not sure. The point is that the default is to open files in text mode,
and that means on MS Windows that '\n' is replaced with '\r\n'. This is
exactly what you got, even though it is not what you wanted, but computers
don't guess what you want.
// note that out and trunc are implied
wofstream out("test.utf-16", ios_base::binary);
// UTF-16LE
out.imbue(locale(locale::classic(),
new codecvt_utf16<wchar_t,
0x10FFFFUL,
little_endian>));
out << L"\r\n";
out.close();
As a side effect, your code becomes a bit more portable. Alternatively, you
could use UTF-8 instead of UTF-16.
Uli
--
C++ FAQ:http://parashift.com/c++-faq-lite
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

Your solution is nice for fstream, but doesn't work on wout.