You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xindice-users@xml.apache.org by Adrian Petru Dimulescu <ad...@free.fr> on 2002/07/01 00:07:38 UTC
UTF8 encoding
hello,
i have recently posted some messages on iso-8859-2 encoding problems.
trying to solve that problem I encoded the latin2 xml document as UTF-8 and
did an AddDocument to xindice.
the behaviour is similar: the characters which happen to be in the iso-8859-1
(è, â, î) are alright. the ones that are specific to 8859-2 are replaced by
"?". this happens in the very file where XIndice holds its database.
this is probably caused by opening a Writer somewhere in the I/O part of
XIndice (i have not found yet the code which actually does this ) without
specifying an encoding.
as the default encoding is usually iso-8859-1, the latin2 texts are improperly
handled.
indeed, a solution is changing the file.encoding property for Java. for
instance, if i call java this way:
java -Dfile.encoding=utf-8
the problem disappears: the latin2 text is stored as utf-8 in the xindice db,
which is ok for me.
I wonder it would not be more proper to allow the user to choose the encoding
in which his text will be stored, and do something like:
Writer writer = new ...Writer(outputStream, "my-encoding-here")
in the I/O code of XIndice.
or, even better, look at the <?xml version=1.0 encoding="my-encoding-here" ?>
and use the given encoding when storing the document into XIndice.
otherwise, the majority will use, without knowing, the default encodings of
their machines.
best regards,
adrian.