CString::Replace corrupts unicode strings

Discussion:

(too old to reply)

dududuil

2010-04-28 09:24:01 UTC

hi,
I have an application that manipulates strings.
Its compiled without _UNICODE.

I'm trying to do:
mySrt.Remove('\r');

My problem is that although all \r are removed correctly, some characters
are beening changed as well !!
I don't read JP, but its clear that some letters had been change after this
\r removal

What is going on here? please help.

Tom Serface

2010-04-28 11:43:05 UTC

Permalink

If you are doing Unicode you should use _T('\r') to make it a TCHAR. You
could also use L'\r' to force Unicode.

Tom

Post by dududuil
hi,
I have an application that manipulates strings.
Its compiled without _UNICODE.
mySrt.Remove('\r');
My problem is that although all \r are removed correctly, some characters
are beening changed as well !!
I don't read JP, but its clear that some letters had been change after this
\r removal
What is going on here? please help.

dududuil

2010-04-28 12:35:01 UTC

Permalink

My Application is compiled without _UNICODE - so _T() is ignored, and L'\r'
will shrink back to '\r' when calling the Remove

Tom Serface

2010-04-28 13:43:16 UTC

Permalink

Ah... then perhaps the problem is that the string you are modifying really
is a Unicode string (which CString can hold even in a non-Unicode build).
Are you reading the string from a file that might actually be Unicode or
UTF-8 or some other encoding? If so you may still have to use CStringW and
still use the L'\r' method for the character.

Tom

Post by dududuil
My Application is compiled without _UNICODE - so _T() is ignored, and
L'\r'
will shrink back to '\r' when calling the Remove

Ulrich Eckhardt

2010-04-28 12:02:55 UTC

Permalink

Post by dududuil
I have an application that manipulates strings.
Its compiled without _UNICODE.
mySrt.Remove('\r');
My problem is that although all \r are removed correctly, some characters
are beening changed as well !!

How about an example?

Post by dududuil
I don't read JP, but its clear that some letters had been change after
this \r removal

What is JP?

Post by dududuil
What is going on here?

What does all that have to do with Unicode strings?

Uli

--
C++ FAQ: http://parashift.com/c++-faq-lite

Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

dududuil

2010-04-28 12:56:05 UTC

Permalink

My application reads from a file, and put the text in a CString.
On JP (Japanese ) machine, the file might contain unicode characters.

Although my application isn't compile with _UNICODE, CString still supports
unicode characters, and I can do the simple task of Replace a certin string.

After this Replace, I print the string back to the file and notice that some
characters are replaced with other characters.

This is my problem.

Jochen Kalmbach [MVP]

2010-04-28 13:04:28 UTC

Permalink

Hi dududuil!

Post by dududuil
Although my application isn't compile with _UNICODE, CString still supports
unicode characters,

No. It does not! It only support it, you you compile with _UNICODE!
In the current setting it is compiled with ANSI or MBCS... so it does
not know anything about unicode.

Post by dududuil
and I can do the simple task of Replace a certin string.

No, you can't. Because it will replace the "character" on per "byte"
basis, which has the effect you see (currupts the string).
This is because CString does not know about the eoncoding of your string
and it assumes ASCII, if you have not set the thread locale!
If you set the thread locale, it will correctly replace the characters!

--
Greetings
Jochen

My blog about Win32 and .NET
http://blog.kalmbachnet.de/

Jochen Kalmbach [MVP]

2010-04-28 12:40:55 UTC

Permalink

Hi dududuil!

Post by dududuil
mySrt.Remove('\r');
My problem is that although all \r are removed correctly, some characters
are beening changed as well !!
I don't read JP, but its clear that some letters had been change after this
\r removal
What is going on here? please help.

I think, your encoding does not match your settings / current locale.
Maybe you are using UTF8 in the string but the CRT/MFC locale is "C". So
the UTF8-Multibyte characters will also treated as "normal" chars and
therefor it will remove any '\r' (0x0d) in the multibyte character.

--
Greetings
Jochen

My blog about Win32 and .NET
http://blog.kalmbachnet.de/

Ulrich Eckhardt

2010-04-28 14:03:54 UTC

Permalink

Post by Jochen Kalmbach [MVP]
Maybe you are using UTF8 in the string but the CRT/MFC locale is "C". So
the UTF8-Multibyte characters will also treated as "normal" chars and
therefor it will remove any '\r' (0x0d) in the multibyte character.

--
C++ FAQ: http://parashift.com/c++-faq-lite

Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

Tom Serface

2010-04-28 17:45:55 UTC

Permalink

I wish that CString had better built in handling for in memory UTF-8. As it
is I typically only use UTF-8 for files and convert to Unicode for memory
use. It takes more memory, but makes it much easier to interact with other
SDKs.

Tom

Post by Ulrich Eckhardt

The nice thing about UTF-8 is that no ASCII byte will ever have a different
meaning than the one it has for ASCII. All bytes of a multibyte character
have their bit 7 set, so they are outside the ASCII range.
For that reason I think we can rule out UTF-8, otherwise it should work. ;)
Uli
--
C++ FAQ: http://parashift.com/c++-faq-lite
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

Ulrich Eckhardt

2010-04-29 06:56:48 UTC

Permalink

As it is I typically only use UTF-8 for files and convert to Unicode for
memory use.

UTF-8 _is_ Unicode, what do you mean?

Uli

--
C++ FAQ: http://parashift.com/c++-faq-lite

Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

Tim Roberts

2010-05-01 03:55:35 UTC

Permalink

You have an MBCS string here. In MBCS, Japanese characters are represented
by two bytes. CString.Replace doesn't realize that.

So, you can try to set up your locale correctly for the strings you are
reading in, but the BETTER solution is to make your application Unicode.
Then all of these problems go away.

--
Tim Roberts, ***@probo.com
Providenza & Boekelheide, Inc.