An STL Problem: Reading Unicode File in VC++ 2005

Hi,

I wrote some code, aiming at loading a text files in UNICODE:

#include <iostream>
#include <fstream>
#include <sstream>

using namespace std;

int _tmain(int /*argc*/, _TCHAR* /*argv[]*/)
{
wstring szresource;
wstringstream strstm;
wifstream file;
file.open(L"E:\\Page1.xml");

while(!file.fail() && !file.eof())
{
file >> szresource;

wcout << szresource.c_str() << endl;

strstm << szresource;
}

// other operation

return 0;
}

The file "Page1.xml" is not a standard XML file, but just a normal text. However, the wcout function just printed out the first character stored in szresource, and after the while loop, strstm had only the first character in its stream too.

I debugged the code and found it quite strange. The text in Page1.xml contained only a word:

Dialog

And I thought since it was handled by wchar_t, it should be like

44 00 69 00 61 00 6c 00 6f 00 67 00

However, in the corresponding memory map, it was

44 00 00 00 69 00 00 00 61 00 00 00 6c 00 00 00 6f 00 00 00 67 00 00 00

I was really confused...is there anybody could help by the way, the solution was built using unicode char set by default.

Thanks a lot!




Answer this question

An STL Problem: Reading Unicode File in VC++ 2005

  • lovy0000

    Taka Muraoka's article still applies

    http://www.codeproject.com/vcpp/stl/upgradingstlappstounicode.asp

    .

     



  • Rafal Szul

    Pomelo wrote:

    Thank you!

    Do you suggest that it is a bug (isn't it ) by-design, and if we want to handle unicode we had to take the conversions as what in Taka Muraoka's article

    BTY, I doubt if it is the way that ANSI C++ handles wchar_ts with unicode.

    I checked the relevant part of the C++ standard (27.8), but couldn't understand it. Therefore, I'll quote from Taka's article (in the section Wide File I/O):

    "

    It turns out that the C++ standard dictates that wide-streams are required to convert double-byte characters to single-byte when writing to a file. So in the example above, the wide string L"ABC" (which is 6 bytes long) gets converted to a narrow string (3 bytes) before it is written to the file. And if that wasn't bad enough, how this conversion is done is implementation-dependent.

    "

    (but I'm not sure how he came to this conclusion).



  • highgameapple

    I can only find the following info from the c++ standard

    Multibyte character and Files A File provides byte sequences. So the streambuf (or its derived classes) treats a file
    as the external source/sink byte sequence. In a large character set environment, multibyte character sequences are
    held in files
    . In order to provide the contents of a file as wide character sequences, wide-oriented filebuf, namely
    wfilebuf should convert wide character sequences.

    And something about codecvt

    The class codecvt<internT,externT,stateT> is for use when converting from one codeset to another, such as from
    wide characters to multibyte characters or between wide character encodings such as Unicode and EUC

    BTW: In my opinion, the txt file encoded in Unicode will also have various encoding, like Unicode,Unicode big endian,UTF-8 (these are supported by Windows Notepad), so there is no portable way to define the default encoding of the file. And the c++ provides the codecvt class to customize the converting option. I think it is up to the programmer to tell the fstream which encode the file really uses.


  • somerandomperson

    Thank you!

    Do you suggest that it is a bug (isn't it ) by-design, and if we want to handle unicode we had to take the conversions as what in Taka Muraoka's article

    BTY, I doubt if it is the way that ANSI C++ handles wchar_ts with unicode.



  • imbat

    It seems that the STL fstream assumes ANSI files, so it will automatically convert ansi -> unicode for you no matter whether the file is really ansi encoded.

    Please see the url OShah given(http://www.codeproject.com/vcpp/stl/upgradingstlappstounicode.asp) for the workround


  • An STL Problem: Reading Unicode File in VC++ 2005