You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xindice-users@xml.apache.org by Adrian Petru Dimulescu <ad...@free.fr> on 2002/07/01 00:07:38 UTC

UTF8 encoding

hello,

i have recently posted some messages on iso-8859-2 encoding problems.

trying to solve that problem I encoded the latin2 xml document as UTF-8 and 
did an AddDocument to xindice. 

the behaviour is similar: the characters which happen to be in the iso-8859-1 
(è, â, î) are alright. the ones that are specific to 8859-2 are replaced by 
"?". this happens in the very file where XIndice holds its database.

this is probably caused by opening a Writer somewhere in the I/O part of 
XIndice (i have not found yet the code which actually does this ) without 
specifying an encoding.

as the default encoding is usually iso-8859-1, the latin2 texts are improperly 
handled.

indeed, a solution is changing the file.encoding property for Java. for 
instance, if i call java this way:

java -Dfile.encoding=utf-8

the problem disappears: the latin2 text is stored as utf-8 in the xindice db, 
which is ok for me.

I wonder it would not be more proper to allow the user to choose the encoding 
in which his text will be stored, and do something like:

	Writer writer = new ...Writer(outputStream, "my-encoding-here")

in the I/O code of XIndice.

or, even better, look at the <?xml version=1.0 encoding="my-encoding-here" ?>
and use the given encoding when storing the document into XIndice. 

otherwise, the majority will use, without knowing, the default encodings of 
their machines.


best regards,
adrian.