Discussion:
isupper and islower for wstring
(too old to reply)
Rahul
2010-12-09 07:38:32 UTC
Permalink
Hi,

I have a std::wstring and I want to find which character are upper
case and which ones are lowercase. the std::isupper and islower seems
to work on ASCII characters only but I want to be able to find out all
kinds of uppercase and lowercase characters
e.g. á is an "Latin small letter a with acute"
and Á is an "Latin capital letter A with acute"

Is there any function (mfc, boost or in any other library) which I can
use to find out the above said difference? My application is a native
VC++ program.

Thanks in advance
Rahul
David Lowndes
2010-12-09 08:09:00 UTC
Permalink
Post by Rahul
I have a std::wstring and I want to find which character are upper
case and which ones are lowercase. the std::isupper and islower seems
to work on ASCII characters only but I want to be able to find out all
kinds of uppercase and lowercase characters
e.g. á is an "Latin small letter a with acute"
and Á is an "Latin capital letter A with acute"
Is there any function (mfc, boost or in any other library) which I can
use to find out the above said difference? My application is a native
VC++ program.
I suspect you may have to revert to using the Windows API IsCharUpper.

Dave
Rahul
2010-12-09 08:24:51 UTC
Permalink
Post by David Lowndes
Post by Rahul
I have a std::wstring and I want to find which character are upper
case and which ones are lowercase. the std::isupper and islower seems
to work on ASCII characters only but I want to be able to find out all
kinds of uppercase and lowercase characters
e.g. is an "Latin small letter a with acute"
and         is an "Latin capital letter A with acute"
Is there any function (mfc, boost or in any other library) which I can
use to find out the above said difference? My application is a native
VC++ program.
I suspect you may have to revert to using the Windows API IsCharUpper.
Dave
Hi Dave,

Basically I want a function to differentiate between "Unicode
lowercase characters" and rest all unicode chars.
on reading the msdn description I feel IsCharUpper is not the one, but
let me try it.
Rahul
2010-12-09 08:58:49 UTC
Permalink
Post by Rahul
Post by David Lowndes
Post by Rahul
I have a std::wstring and I want to find which character are upper
case and which ones are lowercase. the std::isupper and islower seems
to work on ASCII characters only but I want to be able to find out all
kinds of uppercase and lowercase characters
e.g. is an "Latin small letter a with acute"
and         is an "Latin capital letter A with acute"
Is there any function (mfc, boost or in any other library) which I can
use to find out the above said difference? My application is a native
VC++ program.
I suspect you may have to revert to using the Windows API IsCharUpper.
Dave
Hi Dave,
Basically I want a function to differentiate between  "Unicode
lowercase characters" and rest all unicode chars.
on reading the msdn description I feel IsCharUpper is not the one, but
let me try it.
Yes it is exactly what I wanted (IsCharUpperW/IsCharLowerW ).
Thanks Dave.
Joseph M. Newcomer
2010-12-09 20:49:19 UTC
Permalink
Note that IsCharUpper in a Unicode app becomes IsCharUpperW. I presume you have a Unicode
app. If you don't, you are already in trouble. If you have a Unicode app, there is no
need to explicitly call IsCharUpperW because that is what the IsCharUpper macro expands
to.
joe
Post by Rahul
Post by Rahul
Post by David Lowndes
Post by Rahul
I have a std::wstring and I want to find which character are upper
case and which ones are lowercase. the std::isupper and islower seems
to work on ASCII characters only but I want to be able to find out all
kinds of uppercase and lowercase characters
e.g. is an "Latin small letter a with acute"
and         is an "Latin capital letter A with acute"
Is there any function (mfc, boost or in any other library) which I can
use to find out the above said difference? My application is a native
VC++ program.
I suspect you may have to revert to using the Windows API IsCharUpper.
Dave
Hi Dave,
Basically I want a function to differentiate between  "Unicode
lowercase characters" and rest all unicode chars.
on reading the msdn description I feel IsCharUpper is not the one, but
let me try it.
Yes it is exactly what I wanted (IsCharUpperW/IsCharLowerW ).
Thanks Dave.
Joseph M. Newcomer [MVP]
email: ***@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
Paavo Helde
2010-12-10 03:13:07 UTC
Permalink
Post by Joseph M. Newcomer
Note that IsCharUpper in a Unicode app becomes IsCharUpperW. I
presume you have a Unicode app. If you don't, you are already in
trouble. If you have a Unicode app, there is no need to explicitly
call IsCharUpperW because that is what the IsCharUpper macro expands
to.
This is kind of circular. About the only meaning of "Unicode app" in MSVC
is that the 'W' variants are used for string-specific functions. This is
done by some macro trickery with all the pitfalls of macros. So if I always
call the 'W' or 'A' variants explicitly as needed there is no need to mark
the application as Unicode (or not) and as a bonus I can #undef all the
conflicting macros. Another bonus is that the code will not change meaning
silently if somebody defines or undefines UNICODE.

Cheers
Paavo
Joseph M. Newcomer
2010-12-10 04:27:49 UTC
Permalink
See below...
Post by Paavo Helde
Post by Joseph M. Newcomer
Note that IsCharUpper in a Unicode app becomes IsCharUpperW. I
presume you have a Unicode app. If you don't, you are already in
trouble. If you have a Unicode app, there is no need to explicitly
call IsCharUpperW because that is what the IsCharUpper macro expands
to.
This is kind of circular. About the only meaning of "Unicode app" in MSVC
is that the 'W' variants are used for string-specific functions. This is
done by some macro trickery with all the pitfalls of macros. So if I always
call the 'W' or 'A' variants explicitly as needed there is no need to mark
the application as Unicode (or not) and as a bonus I can #undef all the
conflicting macros. Another bonus is that the code will not change meaning
silently if somebody defines or undefines UNICODE.
****
But there is no need to put the suffix on if you are just calling the API. And I'm not
sure what "pitfalls" occur since these macros are not particularly sophisticated--not
like, for example, the min and max macros, which really are dangerous. I'm not sure why
you think explicitly calling the W form is going to make sense if the app is not Unicode,
because you have not said explicitly that you are worrying about exceptions in the case of
an ANSI app. I'm not sure why anyone would want to #undef all the API calls just because
you harbor some mythical fear about macros. I wouldn't consider that a bonus, I'd
consider that silly. And the point of the APIs is that they retain meaning in both ANSI
and Unicode.
joe
****
Post by Paavo Helde
Cheers
Paavo
Joseph M. Newcomer [MVP]
email: ***@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
Pete Becker
2010-12-10 14:53:09 UTC
Permalink
Post by Joseph M. Newcomer
But there is no need to put the suffix on if you are just calling the API. And I'm not
sure what "pitfalls" occur since these macros are not particularly sophisticated--not
like, for example, the min and max macros, which really are dangerous.
The one I like is when you use a name like IsCharUpper (or any other
name that resorts to this trickery) as the name of a member function.
Then some source file starts out with #include <windows.h>, followed by
your header, then calls your member function, whose name has been
changed by the macro. You get a link error...
--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)
Paavo Helde
2010-12-10 21:59:15 UTC
Permalink
Post by Joseph M. Newcomer
See below...
On Thu, 09 Dec 2010 21:13:07 -0600, Paavo Helde
Post by Paavo Helde
Post by Joseph M. Newcomer
Note that IsCharUpper in a Unicode app becomes IsCharUpperW. I
presume you have a Unicode app. If you don't, you are already in
trouble. If you have a Unicode app, there is no need to explicitly
call IsCharUpperW because that is what the IsCharUpper macro expands
to.
This is kind of circular. About the only meaning of "Unicode app" in
MSVC is that the 'W' variants are used for string-specific functions.
This is done by some macro trickery with all the pitfalls of macros.
So if I always call the 'W' or 'A' variants explicitly as needed there
is no need to mark the application as Unicode (or not) and as a bonus
I can #undef all the conflicting macros. Another bonus is that the
code will not change meaning silently if somebody defines or undefines
UNICODE.
****
But there is no need to put the suffix on if you are just calling the
API. And I'm not sure what "pitfalls" occur since these macros are
not particularly sophisticated--not like, for example, the min and max
macros, which really are dangerous.
I have several times inadvertently defined class methods like GetUserName
() which conflicted Windows macros. There are hundreds of such names and
I'm not good in remembering them all. To be honest, Perl headers are even
worse in this regard as they define very common words as macros.
Post by Joseph M. Newcomer
I'm not sure why you think
explicitly calling the W form is going to make sense if the app is not
Unicode, because you have not said explicitly that you are worrying
about exceptions in the case of an ANSI app. I'm not sure why anyone
would want to #undef all the API calls just because you harbor some
mythical fear about macros. I wouldn't consider that a bonus, I'd
consider that silly. And the point of the APIs is that they retain
meaning in both ANSI and Unicode.
No they don't. As the "multibyte codesets" in Windows are capable of
encoding only very small part of Unicode the ANSI versions cannot retain
the same functionality. The macro system might once have been a clever
hack to get the ANSI version uplifted to Unicode with a minimal effort,
but I do not need such a hack as I'm coding for Unicode from the start.

Cheers
Paavo
Goran
2010-12-10 05:16:10 UTC
Permalink
Post by Paavo Helde
About the only meaning of "Unicode app" in MSVC
is that the 'W' variants are used for string-specific functions.
"Unicode app" rather means "my application supports text in any
language you throw at it, and can mix several languages in one single
text". The "W" APIs are a tool to get Unicode (through UTF-16LE, as
that's what "W" APIs work with), a consequence, if you will. Hardly
"about the only meaning".
Post by Paavo Helde
This is
done by some macro trickery with all the pitfalls of macros.
Pretty much ANYTHING in programming has pitfalls, but advantages, too.
In case of A/W variants of APIs, advantage is that you don't see
gibberish at the end of function names, and you can go from MBCS to
Unicode more easily. And if your own code is correct in the first
place, there is no "macro pitfall" with A/W functions. If you think
there is, show it.
Post by Paavo Helde
So if I always
call the 'W' or 'A' variants explicitly as needed there is no need to mark
the application as Unicode (or not) and as a bonus I can #undef all the
conflicting macros. Another bonus is that the code will not change meaning
silently if somebody defines or undefines UNICODE.
Many consider this ability to change a bonus (admittedly, much smaller
today; anything should just be compiled with UNICODE/_UNICODE, and
reach for MBCS function variants only in rare language-specific
cases).

All in all, I think that you're doing it wrong, you are making it more
complicated for yourself, and you're making it strange for other
people who are used to A/W APIs. If your code is your personal
project, OK, but if you're in a team, I think you should step back and
reconsider your opinions.

Goran.
Paavo Helde
2010-12-10 22:30:47 UTC
Permalink
Post by Goran
Post by Paavo Helde
About the only meaning of "Unicode app" in MSVC
is that the 'W' variants are used for string-specific functions.
"Unicode app" rather means "my application supports text in any
language you throw at it, and can mix several languages in one single
text". The "W" APIs are a tool to get Unicode (through UTF-16LE, as
that's what "W" APIs work with), a consequence, if you will. Hardly
"about the only meaning".
I was specifically referring to the "Use Unicode character set" setting
present in MSVC project configurations. For supporting Unicode in general
there are of course multiple ways, I am preferring UTF-8 myself.
Post by Goran
Post by Paavo Helde
This is
done by some macro trickery with all the pitfalls of macros.
Pretty much ANYTHING in programming has pitfalls, but advantages, too.
In case of A/W variants of APIs, advantage is that you don't see
gibberish at the end of function names, and you can go from MBCS to
Unicode more easily. And if your own code is correct in the first
place, there is no "macro pitfall" with A/W functions. If you think
there is, show it.
Post by Paavo Helde
So if I always
call the 'W' or 'A' variants explicitly as needed there is no need to
mar
k
Post by Paavo Helde
the application as Unicode (or not) and as a bonus I can #undef all
the conflicting macros. Another bonus is that the code will not
change meanin
g
Post by Paavo Helde
silently if somebody defines or undefines UNICODE.
Many consider this ability to change a bonus (admittedly, much smaller
today; anything should just be compiled with UNICODE/_UNICODE, and
reach for MBCS function variants only in rare language-specific
cases).
All in all, I think that you're doing it wrong, you are making it more
complicated for yourself, and you're making it strange for other
people who are used to A/W APIs. If your code is your personal
project, OK, but if you're in a team, I think you should step back and
reconsider your opinions.
My programs are portable to Linux/MacOSX and all strings are internally
in UTF-8. Interfacing with Windows system calls typically looks something
like:

std::string name = ...;

HANDLE f = ::CreateFileW(Utf2Win(name).c_str(), GENERIC_READ,
FILE_SHARE_READ|FILE_SHARE_WRITE, NULL, OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL, NULL);

-- or --
//...
} catch(const std::exception& e) {
::MessageBoxW(NULL, Utf2Win(e.what()).c_str(), L"Error in baseutil
load", MB_OK);
}

The Utf2Win() function converts from UTF-8 to UTF-16 for Windows wide
character API-s. This is done only on application boundary. I do not see
any gain in defining UNICODE and relying on macro trickery to achieve the
exact same code what I have written here.

Note that I'm not using MFC, this would probably change the rules of the
game.

Cheers
Paavo
Joseph M. Newcomer
2010-12-11 05:47:57 UTC
Permalink
See below...
Post by Paavo Helde
Post by Goran
Post by Paavo Helde
About the only meaning of "Unicode app" in MSVC
is that the 'W' variants are used for string-specific functions.
"Unicode app" rather means "my application supports text in any
language you throw at it, and can mix several languages in one single
text". The "W" APIs are a tool to get Unicode (through UTF-16LE, as
that's what "W" APIs work with), a consequence, if you will. Hardly
"about the only meaning".
I was specifically referring to the "Use Unicode character set" setting
present in MSVC project configurations. For supporting Unicode in general
there are of course multiple ways, I am preferring UTF-8 myself.
*****
UTF-8 is great at the edges, but it sucks internally.
*****
Post by Paavo Helde
Post by Goran
Post by Paavo Helde
This is
done by some macro trickery with all the pitfalls of macros.
Pretty much ANYTHING in programming has pitfalls, but advantages, too.
In case of A/W variants of APIs, advantage is that you don't see
gibberish at the end of function names, and you can go from MBCS to
Unicode more easily. And if your own code is correct in the first
place, there is no "macro pitfall" with A/W functions. If you think
there is, show it.
Post by Paavo Helde
So if I always
call the 'W' or 'A' variants explicitly as needed there is no need to
mar
k
Post by Paavo Helde
the application as Unicode (or not) and as a bonus I can #undef all
the conflicting macros. Another bonus is that the code will not
change meanin
g
Post by Paavo Helde
silently if somebody defines or undefines UNICODE.
Many consider this ability to change a bonus (admittedly, much smaller
today; anything should just be compiled with UNICODE/_UNICODE, and
reach for MBCS function variants only in rare language-specific
cases).
All in all, I think that you're doing it wrong, you are making it more
complicated for yourself, and you're making it strange for other
people who are used to A/W APIs. If your code is your personal
project, OK, but if you're in a team, I think you should step back and
reconsider your opinions.
My programs are portable to Linux/MacOSX and all strings are internally
in UTF-8. Interfacing with Windows system calls typically looks something
****
Most people have found trying to use UTF-8 internally is a nightmare.
*****
Post by Paavo Helde
std::string name = ...;
HANDLE f = ::CreateFileW(Utf2Win(name).c_str(), GENERIC_READ,
FILE_SHARE_READ|FILE_SHARE_WRITE, NULL, OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL, NULL);
-- or --
//...
} catch(const std::exception& e) {
::MessageBoxW(NULL, Utf2Win(e.what()).c_str(), L"Error in baseutil
load", MB_OK);
}
The Utf2Win() function converts from UTF-8 to UTF-16 for Windows wide
character API-s. This is done only on application boundary. I do not see
any gain in defining UNICODE and relying on macro trickery to achieve the
exact same code what I have written here.
Note that I'm not using MFC, this would probably change the rules of the
game.
****
Probably not much, if you can use UTF-8 comfortably. But most of us prefer the simplicity
of UTF-16 which for the bulk of locales is perfectly fine. If Microsoft introduced UTF-32
I could move to that without blinking.
joe
****
Post by Paavo Helde
Cheers
Paavo
Joseph M. Newcomer [MVP]
email: ***@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
Miles Bader
2010-12-11 06:46:53 UTC
Permalink
Post by Joseph M. Newcomer
Most people have found trying to use UTF-8 internally is a nightmare.
Not true, of course.

Given MS's full-court press to try and get people to use UTF-16, I can
understand why windows programmers might feel that way in some cases
though...

-Miles
--
"An atheist doesn't have to be someone who thinks he has a proof that
there can't be a god. He only has to be someone who believes that the
evidence on the God question is at a similar level to the evidence on
the werewolf question." [John McCarthy]
Paavo Helde
2010-12-11 08:22:02 UTC
Permalink
Post by Joseph M. Newcomer
Probably not much, if you can use UTF-8 comfortably. But most of us
prefer the simplicity of UTF-16 which for the bulk of locales is
perfectly fine.
I cannot see any benefits of UTF-16 over UTF-8. Both are packed formats and
need special care for accessing individual characters. And how is this
related to locales? Locale should specify how I want to see the dates or
numbers formatted, not which characters I can process or see on my screen.
Post by Joseph M. Newcomer
If Microsoft introduced UTF-32 I could move to that
without blinking.
If Microsoft introduced decent support of UTF-8 I could move to that
without blinking.

Cheers
Paavo
Joseph M. Newcomer
2010-12-11 21:38:50 UTC
Permalink
The problem with 8-bit apps is they use localized code pages to display text, which can be
a problem in certain locales. And locale determines some very important parameters, such
as collating sequence (sort order), and what is a "lower case" and "upper case" letter for
those languages which have the notion of case. These functions work on UTF-16 encoding
but not UTF-8. So if you need to compare two strings, you can't compare the UTF-8
encodings to determine collating sequence, and you can't do "bitwise" comparison of
UTF-16, but you can call the functions (e.g., CompareString, lstrcmp) which are
locale-aware and will sort the strings in accordance with the rules of the locale. Did
you know that CompareString does the right thing for Chinese symbols?
joe
Post by Paavo Helde
Post by Joseph M. Newcomer
Probably not much, if you can use UTF-8 comfortably. But most of us
prefer the simplicity of UTF-16 which for the bulk of locales is
perfectly fine.
I cannot see any benefits of UTF-16 over UTF-8. Both are packed formats and
need special care for accessing individual characters. And how is this
related to locales? Locale should specify how I want to see the dates or
numbers formatted, not which characters I can process or see on my screen.
Post by Joseph M. Newcomer
If Microsoft introduced UTF-32 I could move to that
without blinking.
If Microsoft introduced decent support of UTF-8 I could move to that
without blinking.
Cheers
Paavo
Joseph M. Newcomer [MVP]
email: ***@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
James Kanze
2010-12-13 10:05:01 UTC
Permalink
Post by Joseph M. Newcomer
The problem with 8-bit apps is they use localized code pages
to display text, which can be a problem in certain locales.
Use a UTF-8 locale:-).
Post by Joseph M. Newcomer
And locale determines some very important parameters, such as
collating sequence (sort order), and what is a "lower case"
and "upper case" letter for those languages which have the
notion of case. These functions work on UTF-16 encoding but
not UTF-8.
Yes and no. These functions don't work with multi-byte
characters (UTF-8), surrogates (UTF-16) or characters composed
of multiple code points (Unicode in general, regardless of the
encoding format used). And they are always locale dependent.
Post by Joseph M. Newcomer
So if you need to compare two strings, you can't
compare the UTF-8 encodings to determine collating sequence,
You can *if* you are using the "native" collating sequence. And
otherwise, it's a question of implementation.
Post by Joseph M. Newcomer
and you can't do "bitwise" comparison of UTF-16, but you can
call the functions (e.g., CompareString, lstrcmp) which are
locale-aware and will sort the strings in accordance with the
rules of the locale. Did you know that CompareString does the
right thing for Chinese symbols?
Which is a question of quality of implementation. There's no
reason for it to do the right thing with wchar_t, and not with
char in a UTF-8 locale. std::locale has a templated
operator()(basic_string<> const& basic_string<> const&) which
does the right thing when used with
std::lexicographical_compare. This is the standard way of
handling the problem, and should work for both UTF-8 (with
std::string) and UTF-16 or UTF-32 (with std::wstring, depending
on the actual type of wchar_t), providing you have the
corresponding locales (which is a requirement anyway).

--
James Kanze

James Kanze
2010-12-13 09:53:09 UTC
Permalink
Post by Paavo Helde
Post by Joseph M. Newcomer
Probably not much, if you can use UTF-8 comfortably. But most of us
prefer the simplicity of UTF-16 which for the bulk of locales is
perfectly fine.
I cannot see any benefits of UTF-16 over UTF-8. Both are
packed formats and need special care for accessing individual
characters.
I'm not sure about the meaning of "packed" here, but all Unicode
encoding formats, including UTF-32, require special care for
accessing individual characters. The question is more or less
one of the cases when that special care is needed: I've dealt
with more than one application where no special care is needed
for UTF-8. And there are a lot of applications in which special
care will be needed for UTF-8, but not for UTF-16.
Post by Paavo Helde
And how is this
related to locales? Locale should specify how I want to see the dates or
numbers formatted, not which characters I can process or see on my screen.
Locale also determines the behavior of such functions as isalpha
or toupper (or iswalpha or towupper). And rather obviously,
such functions depend on the encoding, so locales encompass
encodings.
Post by Paavo Helde
Post by Joseph M. Newcomer
If Microsoft introduced UTF-32 I could move to that
without blinking.
If Microsoft introduced decent support of UTF-8 I could move to that
without blinking.
Define "decent support":-). If any system introduced some sort
of comprehensive support for any of the Unicode encoding
formats, I'd jump at it. At present, if you need comprehensive
support, your best bet is ICU (which uses UTF-16). (The fact
that the only comprehensive support for Unicode uses UTF-16
suggests that programs requiring such support use UTF-16, even
on platforms where wchar_t is 32 bits. But judging from what
I've seen, such programs are the exception.)

At one time (up until about a year ago), I'd been experimenting
with implementing more or less comprehensive support for UTF-8,
but that's on hold at present.

--
James Kanze
Goran
2010-12-11 11:12:38 UTC
Permalink
Post by Paavo Helde
My programs are portable to Linux/MacOSX and all strings are internally
in UTF-8. Interfacing with Windows system calls typically looks something
std::string name = ...;
HANDLE f = ::CreateFileW(Utf2Win(name).c_str(), GENERIC_READ,
FILE_SHARE_READ|FILE_SHARE_WRITE, NULL, OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL, NULL);
That's not all that good as a general approach. As soon as you enter
into windows-specific code, that string should be UTF-16. Your main
line does not look like snippets you're showing here. Or at least it
shouldn't look like that. (Else, you're doing inline-#define-based
platform independence, a silly thing to do).

Instead, you should have platform-agnostic wrappers for platform-
specific stuff, and these will have platform-specific implementation.
Now... Implementation on windows needs UTF-16, so it's most expedient
to convert your UTF-8 to UTF-16 either at wrapper entry, or when you
first pass your text to win api (see e.g. implementation of _bstr_t
for an equivalent example). If you don't do this, you are carrying
UTF-8 around and converting it it to UTF-16 at least once, but as soon
as the wrapper isn't trivial, multiple times).

But of course, if platform-independence is the goal, a lot of code
would be better off using the likes of Qt and have platform-
independence cut-out for them (albeit, funnily, Qt's strings actually
use UTF-16)^^^.

And finally, once you do have said wrappers, you compile them with
UNICODE and look at Func1, Func2 instead of finger-sore-inducing and
eye-sore-inducing Func1W, Func2W.

Goran.

^^^ Use of UTF-16 (and not e.g. UTF-8) is IMO a good sign of platform
maturity when it comes to Unicode. Why? Because it kinda-sorta shows
that platform has been around when Unicode meant BMP and UCS2. And
indeed, Windows, Java, Qt, ICU, all picked UTF-16. That's not accident
or ignorance, it's historical convenience.
Paavo Helde
2010-12-12 18:25:24 UTC
Permalink
Post by Goran
Instead, you should have platform-agnostic wrappers for platform-
specific stuff, and these will have platform-specific implementation.
Now... Implementation on windows needs UTF-16, so it's most expedient
to convert your UTF-8 to UTF-16 either at wrapper entry, or when you
first pass your text to win api (see e.g. implementation of _bstr_t
for an equivalent example). If you don't do this, you are carrying
UTF-8 around and converting it it to UTF-16 at least once, but as soon
as the wrapper isn't trivial, multiple times).
Currently I try to keep platform-dependent code in separate .cpp files,
to be compiled only for the given platform. Currently my I/O wrapper
functions are quite trivial for Linux/MacOSX as there the UTF-8 locales
are the de facto standard, and Windows specific wrappers contain UTF-8
to/from UTF-16 conversions. If I used UTF-16 internally this would be the
other way around, no improvement in my mind.
Post by Goran
But of course, if platform-independence is the goal, a lot of code
would be better off using the likes of Qt and have platform-
independence cut-out for them (albeit, funnily, Qt's strings actually
use UTF-16)^^^.
I have got an impression Qt is mostly about GUI interfaces, but that's
not so interesting for us. Our GUI-s are mostly done via web interfaces
(sometimes as embedded browser windows) and there UTF-8 is quite
widespread, again.

Cheers
Paavo
Goran
2010-12-12 20:49:34 UTC
Permalink
Post by Paavo Helde
Post by Goran
Now... Implementation on windows needs UTF-16, so it's most expedient
to convert your UTF-8 to UTF-16 either at wrapper entry, or when you
first pass your text to win api (see e.g. implementation of _bstr_t
for an equivalent example). If you don't do this, you are carrying
UTF-8 around and converting it it to UTF-16 at least once, but as soon
as the wrapper isn't trivial, multiple times).
Currently I try to keep platform-dependent code in separate .cpp files,
to be compiled only for the given platform. Currently my I/O wrapper
functions are quite trivial for Linux/MacOSX as there the UTF-8 locales
are the de facto standard, and Windows specific wrappers contain UTF-8
to/from UTF-16 conversions. If I used UTF-16 internally this would be the  
other way around, no improvement in my mind.
I wanted to say that you should convert your text to UTF-16 when you
enter your win-wrapper, or eventually at first use of a win function
using the text, __not__ that you should use UTF-16 for all of your
text.

Goran.
Joseph M. Newcomer
2010-12-13 02:23:21 UTC
Permalink
See below...
Post by Paavo Helde
Post by Goran
Instead, you should have platform-agnostic wrappers for platform-
specific stuff, and these will have platform-specific implementation.
Now... Implementation on windows needs UTF-16, so it's most expedient
to convert your UTF-8 to UTF-16 either at wrapper entry, or when you
first pass your text to win api (see e.g. implementation of _bstr_t
for an equivalent example). If you don't do this, you are carrying
UTF-8 around and converting it it to UTF-16 at least once, but as soon
as the wrapper isn't trivial, multiple times).
Currently I try to keep platform-dependent code in separate .cpp files,
to be compiled only for the given platform. Currently my I/O wrapper
functions are quite trivial for Linux/MacOSX as there the UTF-8 locales
are the de facto standard, and Windows specific wrappers contain UTF-8
to/from UTF-16 conversions. If I used UTF-16 internally this would be the
other way around, no improvement in my mind.
****
I well and truly detest #ifdefd scattered throughout the source files--if youve ever been
a victim^H^H^H^H^H^Huser of anything from the Free Software Foundation, trying to figure
out what of the 8-deep collection of complex #ifdef/#if defined(...) && defined(...) code
actually generates, or porting it to a new platform realizes that keeping it as separate
sources in separate directories is about the only sane approach. We adopted this
technique when porting code across six platforms, having six subdirectories (Unix, Ultrix,
Mac, Win16, VMS, and a couple years later, Win32), and it was really a sane thing. We had
a few #ifdefs in one of the header files to define types, e.g., we would have used
something like "VCHAR" which might be defined as 'char' or 'TCHAR' or 'WCHAR'; the hardest
definitions dealt with how to represent 32-bit address arithmetic on Win16 (we had to make
them HUGE, for those of you unfortunate enough to remember Win16). I highly recommend
this technique, and I still use it for porting.
****
Post by Paavo Helde
Post by Goran
But of course, if platform-independence is the goal, a lot of code
would be better off using the likes of Qt and have platform-
independence cut-out for them (albeit, funnily, Qt's strings actually
use UTF-16)^^^.
I have got an impression Qt is mostly about GUI interfaces, but that's
not so interesting for us. Our GUI-s are mostly done via web interfaces
(sometimes as embedded browser windows) and there UTF-8 is quite
widespread, again.
Cheers
Paavo
Joseph M. Newcomer [MVP]
email: ***@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
Goran
2010-12-09 12:59:45 UTC
Permalink
Post by Rahul
I have a std::wstring and I want to find which character are upper
case and which ones are lowercase. the std::isupper and islower seems
to work on ASCII characters only but I want to be able to find out all
kinds of uppercase and lowercase characters
e.g. is an "Latin small letter a with acute"
and         is an "Latin capital letter A with acute"
Try GetStringType. C and C++ are horrible when it comes to Unicode.
You may also want to try ICU library.

Goran.
Joseph M. Newcomer
2010-12-09 20:48:01 UTC
Permalink
THe problem is that isupper defaults to the "C" locale, which makes it compatible with the
1975 PDP-11 implementation of the C language. You must do a SetLocale to choose the
target locale.

As already pointed out, the API IsCharUpper will be more robust, but note that it works in
the current user locale.
joe
Post by Rahul
Hi,
I have a std::wstring and I want to find which character are upper
case and which ones are lowercase. the std::isupper and islower seems
to work on ASCII characters only but I want to be able to find out all
kinds of uppercase and lowercase characters
e.g. á is an "Latin small letter a with acute"
and Á is an "Latin capital letter A with acute"
Is there any function (mfc, boost or in any other library) which I can
use to find out the above said difference? My application is a native
VC++ program.
Thanks in advance
Rahul
Joseph M. Newcomer [MVP]
email: ***@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
James Kanze
2010-12-10 10:43:01 UTC
Permalink
Post by Rahul
I have a std::wstring and I want to find which character are upper
case and which ones are lowercase. the std::isupper and islower seems
to work on ASCII characters only but I want to be able to find out all
kinds of uppercase and lowercase characters
e.g. á is an "Latin small letter a with acute"
and Á is an "Latin capital letter A with acute"
Is there any function (mfc, boost or in any other library) which I can
use to find out the above said difference? My application is a native
VC++ program.
What's wrong with iswupper (in wctype.h)? (Like isupper, it is
locale dependent.) Or using the equivalent functionality in
<locale>?

Also, be aware that concepts such as upper case don't
necessarily have any meaning in non-latin alphabets.

--
James Kanze
Goran
2010-12-10 13:15:03 UTC
Permalink
Post by Rahul
I have a std::wstring and I want to find which character are upper
case and which ones are lowercase. the std::isupper and islower seems
to work on ASCII characters only but I want to be able to find out all
kinds of uppercase and lowercase characters
e.g. á is an "Latin small letter a with acute"
and     Á is an "Latin capital letter A with acute"
Is there any function (mfc, boost or in any other library) which I can
use to find out the above said difference? My application is a native
VC++ program.
What's wrong with iswupper (in wctype.h)?  (Like isupper, it is
locale dependent.)  Or using the equivalent functionality in
<locale>?
If character is outside basic multilingual plane, how do you plan to
put it in a wint_t? (I don't know if case matters for languages
outside BMP, but why wouldn't it?).

That's why I proposed GetStringType or ICU.

Goran.
James Kanze
2010-12-10 15:26:11 UTC
Permalink
Post by Goran
Post by James Kanze
Post by Rahul
I have a std::wstring and I want to find which character are upper
case and which ones are lowercase. the std::isupper and islower seems
to work on ASCII characters only but I want to be able to find out all
kinds of uppercase and lowercase characters
e.g. á is an "Latin small letter a with acute"
and Á is an "Latin capital letter A with acute"
Is there any function (mfc, boost or in any other library) which I can
use to find out the above said difference? My application is a native
VC++ program.
What's wrong with iswupper (in wctype.h)? (Like isupper, it is
locale dependent.) Or using the equivalent functionality in
<locale>?
If character is outside basic multilingual plane, how do you plan to
put it in a wint_t? (I don't know if case matters for languages
outside BMP, but why wouldn't it?).
That is a general problem with all such functions which take
a single code point (even in UTF-32, although it probably only
affects very, very few characters with UTF-32).
Post by Goran
That's why I proposed GetStringType or ICU.
GetStringType seems to have a fairly complicated interface; I'm
not too sure about ICU. But you're right. And the complicated
interface is due to the fact that the problem itself is more
complicated than it might appear at first glance. (Somewhere
floating around, I've got code which implements the functions in
ctype for UTF-8. Obviously, it takes two iterators to bytes,
rather than a single int, as argument. And the actual tables it
uses are generated from the UnicodeData.txt file. But one of
the things I learned while doing it is that some obvious
definitions, like isupper, are far from obvious once you leave
the usual Western European conventions. And that it still isn't
really correct, because it ignores composed characters, and only
treats single code points.)

--
James Kanze
Loading...