On Sun, 7 Jun 2009 12:49:09 -0400, "Igor Tandetnik" <***@mvps.org> wrote:
|Vincent Fatica wrote:
|> On Sun, 7 Jun 2009 11:44:47 -0400, "Igor Tandetnik"
|> <***@mvps.org> wrote:
|>
|>> For example, GREEK CAPITAL LETTER ALPHA is code 193 (hex 0xC1) in
|>> CP1253. But you will interpret it as U+00C1, LATIN CAPITAL LETTER A
|>> WITH ACUTE.
|>
|> What's the problem? When I convert each Unicode argv back to MBCS
|> with
|>
|> while ( *p++ == (CHAR) *wp++ );
|>
|> won't it go back to 193 (and again be interpreted as GREEK CAPITAL
|> LETTER ALPHA)? I don't think CommandLineToArgvW cares whether it's
|> GREEK CAPITAL LETTER ALPHA or LATIN CAPITAL LETTER A WITH ACUTE. I'm
|> assuming CommandLineToArgvW only **interprets** whitespace,
|> backslashes, and double-quotes.
|
|Ah, I didn't realize you were going to Unicode and back. Anyway, you'd
|still have problems with true double-byte encodings, like Chinese BIG-5
|or Japanese Shift-JIS. In these encodings, some characters are
|represented by two bytes, called lead byte and trailing byte. Lead byte
|always has high bit set, but trailing byte could have any value at all,
|including values that just happen to be the same as ASCII codes for
|space, backslash or double quote.
|
|Your naive algorithm will convert such double-byte character to two
|independent Unicode codepoints. The codepoint corresponding to the
|trailing byte could then be interpreted by CommandLineToArgvW as a
|separator. As a result, a) some parameter will be broken up in the
|middle, and b) when your algorithm converts back from Unicode to MBCS,
|you'll end up with a lead byte not followed by a trailing byte (or
|followed by an unrelated ASCII character that will be misinterpreted as
|a trailing byte).
Yes, I see. STDARGV.C deals with this (if (_ismbblead(c)) ...). Do you think
that I could (possibly with some effort) include STDARGV.C in my project and use
its parse_cmdline()?
--
- Vince