How to read unicode character??

Hi all,

Currently i am doing one project related to unicode file reading, showing the characters in the ListCtrl. I got my result applying CFile, API WideCharToMultibyte & the reverse one.Here I read BOM for each file(.txt),then got the bytes for respective characters.
         but i have some confusion regarding whether those functions work properly if the character is more than two bytes ,ofcourse it is going to surrogates.I just want to know is there any deficiency/limitation in those  fuction

If I use wistream is it possible I know its possible. But which is the better way.Though I am not able to implement wistream. Actually I want to use only library fuctions. Previously I made my own function.

   Guyz I need your help to implement wistream/wifstream and the way to get the actual result.If it is not possible by only library function provided by VC ,then what type function I need to create

Thanks...
Sajal



Answer this question

How to read unicode character??

  • ppz

    Thank you Caves for your suggestion. Now atleast i will not try to use wistream/wifstream, because i have to handle all unicode characters.

       Now you tell me, whether WcharToMultiByte & the reverse one works fine with all unicode characters
       What are the correct possible way to read & represent all unicode characters

    Thanks..

  • Dion Le Roux

    Hi: unfortunately wistream/wifstream were not designed for Unicode: they assume a fixed wchar_t character size and on Windows wchar_t is 2-bytes. If you know that you will never encounter a surrogate then you could use these types: but if you know that you are going to encounter a surrogate then you will hit problems.

    Having said that I am not sure exactly how to deal with this. On the C++ compiler team we decided just to ignore this issue for Visual C++ 2005 but I am certain that we will have to deal with it in a future release.

    One option would be to move to your own 32-bit character type: but this solution has its own problems (interfacing with the rest of the system being one).

    Another is to stick with a 16-bit character type but to only use the PSDK functions for string manipulation as these functions do handle surrogates.

    If you haven't seen this already here's the link to the MSDN articles on Unicode:

    http://msdn.microsoft.com/library/default.asp url=/library/en-us/intl/unicode_192r.asp

    It includes a small article on surrogates:

    http://msdn.microsoft.com/library/default.asp url=/library/en-us/intl/unicode_192r.asp

    Sorry for the lack of a clear answer but I think that the need to deal with surrogates is an issue that most people are just starting to find out about.

  • Amberite

    In practice I find "floating accents" to cause much more trouble than surrogate pairs. Not even UTF-32 can take care of them.
  • shamita

    As long as the code-page is set correctl WideCharToMultiByte can handle all Unicode characters: though how it handles illegal Unicode characters depends on the version of Windows you are using as well on the flags you pass into the function.

    http://msdn.microsoft.com/library/default.asp url=/library/en-us/intl/unicode_2bj9.asp

    I would read in a file using ReadFile and then pick a canonical representation - personally I would probably pick UTF-16 - but the choice is up to you.

  • How to read unicode character??