Discussion:
Is this a BIG bug in all VC++ versions? About Unicode CR/LF translation.
(too old to reply)
xmllmx
2010-11-13 11:08:22 UTC
Permalink
4 lines of code are worth a thousand words.

wofstream fout("test.txt", ios::out|ios::trunc);
fout.imbue(locale(locale::classic(), new codecvt_utf16<wchar_t,
0x10FFFFUL, little_endian>)); // UTF-16LE
fout << L'\n';
fout.close();

// Compiled with VC++ 2010 that supports C++0x.

Please note that the file is opened in unicode text mode. Because
fout.close() will automatically replace every LF with CR-LF, So,
test.txt now should hold 4 bytes which are 0x0D, 0x00, 0x0A, 0x00.

However, test.txt should just hold 3 bytes which are 0x0D, 0x0A, 0x00.
Notepad will fail to recognize this file.

Is this really a BIG bug?
David Lowndes
2010-11-13 11:30:26 UTC
Permalink
So,test.txt now should hold 4 bytes which are 0x0D, 0x00, 0x0A, 0x00.
However, test.txt should just hold 3 bytes which are 0x0D, 0x0A, 0x00.
Which one is it that you want?
Notepad will fail to recognize this file.
Possibly because you've not written a BOM (Byte Order Mark)?

Dave
xmllmx
2010-11-13 12:30:26 UTC
Permalink
Post by David Lowndes
So,test.txt now should hold 4 bytes which are 0x0D, 0x00, 0x0A, 0x00.
However, test.txt should just hold 3 bytes which are 0x0D, 0x0A, 0x00.
Which one is it that you want?
Notepad will fail to recognize this file.
Possibly because you've not written a BOM (Byte Order Mark)?
Dave
I want 4 bytes 0x0D, 0x00, 0x0A, 0x00 rather than 3 bytes 0x0D, 0x0A,
0x00.

BOM is unrelated with my this issue.
xmllmx
2010-11-13 12:35:43 UTC
Permalink
Post by xmllmx
Post by David Lowndes
So,test.txt now should hold 4 bytes which are 0x0D, 0x00, 0x0A, 0x00.
However, test.txt should just hold 3 bytes which are 0x0D, 0x0A, 0x00.
Which one is it that you want?
Notepad will fail to recognize this file.
Possibly because you've not written a BOM (Byte Order Mark)?
Dave
I want 4 bytes 0x0D, 0x00, 0x0A, 0x00 rather than 3 bytes 0x0D, 0x0A,
0x00.
BOM is unrelated with my this issue.
In UTF-16, LF is encoded as 0x000A, CR is encoded 0x000D. So, each LF
(0x000A) should become 0x000D, 0x000A after the compiler replaced
every LF with CR-LF.
David Lowndes
2010-11-13 23:50:03 UTC
Permalink
Post by xmllmx
I want 4 bytes 0x0D, 0x00, 0x0A, 0x00 rather than 3 bytes 0x0D, 0x0A,
0x00.
OK, it wasn't clear to me what you were saying in your original text
as you said "However, test.txt should just hold 3 bytes which are
0x0D, 0x0A, 0x00.", but presumably you meant "... only has 3 bytes..".

I would agree, this looks like a bug to me - but I don't pretend to
understand the mysteries of the stream IO libraries.

I suggest that you submit a bug report on the MS connect site:
https://connect.microsoft.com/VisualStudio

Please post a link to your bug report back here.

Dave
Ulrich Eckhardt
2010-11-15 08:55:07 UTC
Permalink
Post by xmllmx
wofstream fout("test.txt", ios::out|ios::trunc);
fout.imbue(locale(locale::classic(), new codecvt_utf16<wchar_t,
0x10FFFFUL, little_endian>)); // UTF-16LE
fout << L'\n';
fout.close();
[...]
Post by xmllmx
Please note that the file is opened in unicode text mode. Because
fout.close() will automatically replace every LF with CR-LF, So,
test.txt now should hold 4 bytes which are 0x0D, 0x00, 0x0A, 0x00.
However, test.txt should just hold 3 bytes which are 0x0D, 0x0A, 0x00.
Notepad will fail to recognize this file.
Is this really a BIG bug?
I'm not sure. The point is that the default is to open files in text mode,
and that means on MS Windows that '\n' is replaced with '\r\n'. This is
exactly what you got, even though it is not what you wanted, but computers
don't guess what you want.

I'd suggest that you explicitly write what you want:

// note that out and trunc are implied
wofstream out("test.utf-16", ios_base::binary);
// UTF-16LE
out.imbue(locale(locale::classic(),
new codecvt_utf16<wchar_t,
0x10FFFFUL,
little_endian>));
out << L"\r\n";
out.close();

As a side effect, your code becomes a bit more portable. Alternatively, you
could use UTF-8 instead of UTF-16.

Uli
--
C++ FAQ: http://parashift.com/c++-faq-lite

Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
xmllmx
2010-11-15 15:12:38 UTC
Permalink
Post by Ulrich Eckhardt
Post by xmllmx
wofstream fout("test.txt", ios::out|ios::trunc);
fout.imbue(locale(locale::classic(), new codecvt_utf16<wchar_t,
0x10FFFFUL, little_endian>)); // UTF-16LE
fout << L'\n';
fout.close();
[...]
Post by xmllmx
Please note that the file is opened in unicode text mode. Because
fout.close() will automatically replace every LF with CR-LF, So,
test.txt now should hold 4 bytes which are 0x0D, 0x00, 0x0A, 0x00.
However, test.txt should just hold 3 bytes which are 0x0D, 0x0A, 0x00.
Notepad will fail to recognize this file.
Is this really a BIG bug?
I'm not sure. The point is that the default is to open files in text mode,
and that means on MS Windows that '\n' is replaced with '\r\n'. This is
exactly what you got, even though it is not what you wanted, but computers
don't guess what you want.
  // note that out and trunc are implied
  wofstream out("test.utf-16", ios_base::binary);
  // UTF-16LE
  out.imbue(locale(locale::classic(),
                   new codecvt_utf16<wchar_t,
                                     0x10FFFFUL,
                                     little_endian>));
  out << L"\r\n";
  out.close();
As a side effect, your code becomes a bit more portable. Alternatively, you
could use UTF-8 instead of UTF-16.
Uli
--
C++ FAQ:http://parashift.com/c++-faq-lite
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
Your solution is nice for fstream, but doesn't work on wout.

Loading...