How to conveniently convert a stream of byte into UNICODE?

Discussion:

(too old to reply)

Jack

2009-08-04 09:25:48 UTC

Hi,
Just Looking for an handy way to convert
a stream of BYTE into UNICODE....
Does anyone know any of those?
Thanks
Jack

Ulrich Eckhardt

2009-08-04 10:23:24 UTC

Permalink

Post by Jack
Just Looking for an handy way to convert
a stream of BYTE into UNICODE....
Does anyone know any of those?

MultiByteToWideChar

Uli

--
C++ FAQ: http://parashift.com/c++-faq-lite

Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

David Wilkinson

2009-08-04 11:10:31 UTC

Permalink

Post by Jack
Hi,
Just Looking for an handy way to convert
a stream of BYTE into UNICODE....
Does anyone know any of those?

It depends what the stream of bytes is. Any file is a stream of bytes. If it is
text, it might be 8-bit text using the local code page (or some other code
page), it might be 8-bit UTF-8 text, it might already be 16-bit Unicode text, it
might be ... anything.

--
David Wilkinson
Visual C++ MVP

Jack

2009-08-05 00:24:23 UTC

Permalink

Hello,
Thanks both of you, seems it is going to be a hard time...
to convert the whole thing (have been trying for 2 days)
If it is a cocktail of 8-bit and 16-bit characters, would that be more
difficult to achieve or just like ulrich said, pass the whole bunch of data
into MultiByteToWideChar?
Thanks
Jack

David Wilkinson

2009-08-05 01:07:35 UTC

Permalink

Post by Jack
Hello,
Thanks both of you, seems it is going to be a hard time...
to convert the whole thing (have been trying for 2 days)
If it is a cocktail of 8-bit and 16-bit characters, would that be more
difficult to achieve or just like ulrich said, pass the whole bunch of data
into MultiByteToWideChar?

You can only use MultiByteToWideChar() if you know what the format of the source
is. If the source is a mixture of 16-bit and 8-bit characters (unlikely), I do
not see how you could make sense of it, unless the format of the contents were
very carefully documented.

--
David Wilkinson
Visual C++ MVP

Jack

2009-08-05 05:52:28 UTC

Permalink

Hello David,
For simplicity, now pretend I make the assumption of all of them are made up
of 16-bit characters,
What is the general method of converting the stream of BYTEs into foreign
characters?
Thanks
Jack

Scot T Brennecke

2009-08-05 06:21:08 UTC

Permalink

Post by Jack
Hello David,
For simplicity, now pretend I make the assumption of all of them are made up
of 16-bit characters,
What is the general method of converting the stream of BYTEs into foreign
characters?
Thanks
Jack

If they are 16-bit characters, then they are either already in Unicode, or DBCS, or some other non-ANSI string format. There is not
a one-size-fits-all answer here. You MUST know something about the current format of the BYTE stream, or conversion is a crapshoot.

If they are ANSI or MBCS strings in the BYTE stream, then you have already been told the answer: MultiByteToWideChar

Ulrich Eckhardt

2009-08-05 07:14:15 UTC

Permalink

Post by Jack
For simplicity, now pretend I make the assumption of all of them are made
up of 16-bit characters, What is the general method of converting the
stream of BYTEs into foreign characters?

Question up front: what is a foreign character? Anything but Latin letters
and Arabian numbers? Hebrew is much older than Latin, so isn't Latin the
foreign one then?

Anyway, back from philosophy to programming. Without knowing the encoding of
the byte data, there is no way to convert it to anything. Further, and that
is something you might have missed, Unicode (not UNICODE) is not a file
format or anything like that. Rather, it is a standard that also defines
several file or transfer formats (UTF = Unicode Transfer Format). Now, MS
Windows internally uses WCHAR, which encodes Unicode codepoints using
UTF-16.

If your file happens to be UTF-16 encoded, too, you can just memcpy() the
file content into a WCHAR array and you're done. If it happens to be the
big-endian variant of UTF-16, you will further have to swap every two
bytes. Also, if you want your code to run on anything non-MS Windows or on
a big-endian machine (which MS-Windows doesn't run on anyways), you will
have to perform further conversions.

BTW: It would help if you showed a short snippet of a hexdump of the file in
question including the text it should represent. It should then be easy to
figure out if it is one of the common encodings.

Good luck!

Uli

--
C++ FAQ: http://parashift.com/c++-faq-lite

Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

Jack

2009-08-05 08:01:03 UTC

Permalink

That is the most constructive advice I ever had
Thanks Ulrich.....
Jack