You are viewing a plain text version of this content. The canonical link for it is here.

Posted to xindice-dev@xml.apache.org by James Bates <ja...@amplexor.com> on 2002/02/15 13:12:25 UTC

Unicode issues: submitting an update...

Finally I've gotten inside the Xindice source code and had a more
serious look how it works and am beginning to understand the overall
organisation of the code, as well some more detailed parts in it.

I had wanted to understand why XML documents with Greek or other
non-western characters in them were not stored orrectly by Xindice:
it stored a collection of question marks instead. Forgetting the
command-line tools for the moment, I noticed many places in the source
code where characters and bytes are interchanged without any regard for
encoding schemes. This causes Java to use what it calls the "default
encoding scheme", which on my computer happens to be ISO-8859-1. I
located these "dangerous" instructions and converted them as best
I could to store bytes as UTF-8 inside the Xindice data files. This does
of course make any data-files with non-ASCII characters (any character
above U+007F) backwards INCOMPATIBLE!

On the DOM Compressor, I saw that care had already been taken to do this
correctly, but in classes such as org.apache.xindice.core.data.Value and org.apache.xindice.util.ByteBuffer, conversions were still taking place
with the "default" character sets.

I have tried out my patches here, and the patched Xindice now DOES
correctly store XML documents with, for example, mixed greek and arabic
content, as UTF-8, and upon retrieval reproduced the data correctly. This
was using the Java XML:DB API to communicate with the database.

In all I have modified 15 source files from a CVS copy of Xindice dated
12.2.2001. As far as I can make out, only Paged.java and BTree.java have
changed since then (In fact I did a 'cvs update' this morning and that's
the only change I saw).

The changes do mean, however, that any non-ASCII data in existing xindice
datafiles makes those datafiles useless. Also, as I don't FULLY understand
the function of every bit of the source code, I may have introduced
dangerous code myself, e.g. by breaking assumptions about the lengths of
items (indeed strings and their UTF-8 representations in bytes do NOT in
general have the same length!). In particular I am very unsure about the
workings of:

- ValueIndexer.java
- HTTPServer.java

Both do suspect things when converting between bytes and characters, but
I understand them too poorly to know how to fix them.

For these reasons, I'd appreciate it if someone who knows the internals
of Xindice would take a look at my modifications, as such a person would
probably understand better the full implications of the changes. I've
included a comment /* UTF8FIXED */ at the top of each such file, and a
/* UTF8FIX */ comment near each patched instruction, to locate them easily.
Possibly Kimbro Staken or Tom Bradford could take a look? I've attached a
zip with the patched sources.

There's also an issue with the command-line tools: they assumes without
checking that all XML documents are "platform default encoding" (because they
read/write the XML files converting bytes to Strings without supplying a
character encoding scheme). The situation here is more complicated, as the
XML specification states (section 4.3.3) that the XML parser should detect
whether the document is UTF-16, UTF-8 or some other encoding specified in the
"encoding" "attribute" (not really an attribute I know) of the <?xml?>
declaration...

Possibly the best way to fix this would be to have Xerces read in the document,
and feed a DOM or SAX events to the Collection.setContentAsXXX methods. (Xerces
has code to perform this auto-detection).

Writing out documents again, UTF-8 would be the safest to use. Indeed without
an "encoding" "attribute" in the XML declaration, other software MUST assume
that the document is UTF-8 or UTF-16. A better way would be te let the user choose
the encoding he wants with, say a command-line switch, and then produce the XML
file in the encoding he'd like.

For the moment I have hard-coded UTF-8 into the command-line tools, but this
obviously needs to further developed.

Anyway I really hope someone with knowledge of Xindice internals can help me
complete these issues: Xindice could then become one of the first ever (I kid
you not!) XML database to fully support the complete XML specification

Kind regards,
James Bates
<<xml-xindice.zip>>

Re: Unicode issues: submitting an update...

Posted by Stefano Mazzocchi <st...@apache.org>.

James Bates wrote:

> Possibly Kimbro Staken or Tom Bradford could take a look? I've attached a
> zip with the patched sources.

James, 

you have submitted 4.5 Mb of attachment, add 15% for MIME encoding,
multiply times  more than a hundred (the people subscribed here) and you
get the amount of frustration that such a post yields in the community.

All for sending a few patches that would probably take a few Kb if you
just send the diffs.

Next time, please, understand that "being polite" on a mail list has
several forms and one of them is respectful of other's bandwidth.

Thanks.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------

Re: Unicode issues: submitting an update...

Posted by Kimbro Staken <ks...@xmldatabases.org>.

Hi James,

This looks like some good work and something we definitely need. As 
Stefano mentioned a patch would be easier to work with so that we can see 
exactly what changed. Thanks for tackling this, it will really improve the 
system.

On Friday, February 15, 2002, at 05:12 AM, James Bates wrote:

> Finally I've gotten inside the Xindice source code and had a more
> serious look how it works and am beginning to understand the overall
> organisation of the code, as well some more detailed parts in it.
>
> I had wanted to understand why XML documents with Greek or other
> non-western characters in them were not stored orrectly by Xindice:
> it stored a collection of question marks instead. Forgetting the
> command-line tools for the moment, I noticed many places in the source
> code where characters and bytes are interchanged without any regard for
> encoding schemes. This causes Java to use what it calls the "default
> encoding scheme", which on my computer happens to be ISO-8859-1. I
> located these "dangerous" instructions and converted them as best
> I could to store bytes as UTF-8 inside the Xindice data files. This does
> of course make any data-files with non-ASCII characters (any character
> above U+007F) backwards INCOMPATIBLE!
>
> On the DOM Compressor, I saw that care had already been taken to do this
> correctly, but in classes such as org.apache.xindice.core.data.Value and 
> org.apache.xindice.util.ByteBuffer, conversions were still taking place
> with the "default" character sets.
>
> I have tried out my patches here, and the patched Xindice now DOES
> correctly store XML documents with, for example, mixed greek and arabic
> content, as UTF-8, and upon retrieval reproduced the data correctly. This
> was using the Java XML:DB API to communicate with the database.
>
> In all I have modified 15 source files from a CVS copy of Xindice dated
> 12.2.2001. As far as I can make out, only Paged.java and BTree.java have
> changed since then (In fact I did a 'cvs update' this morning and that's
> the only change I saw).
>
> The changes do mean, however, that any non-ASCII data in existing xindice
> datafiles makes those datafiles useless. Also, as I don't FULLY understand
> the function of every bit of the source code, I may have introduced
> dangerous code myself, e.g. by breaking assumptions about the lengths of
> items (indeed strings and their UTF-8 representations in bytes do NOT in
> general have the same length!). In particular I am very unsure about the
> workings of:
>
>   - ValueIndexer.java
>   - HTTPServer.java
>
> Both do suspect things when converting between bytes and characters, but
> I understand them too poorly to know how to fix them.
>
> For these reasons, I'd appreciate it if someone who knows the internals
> of Xindice would take a look at my modifications, as such a person would
> probably understand better the full implications of the changes. I've
> included a comment /* UTF8FIXED */ at the top of each such file, and a
> /* UTF8FIX */ comment near each patched instruction, to locate them 
> easily.
> Possibly Kimbro Staken or Tom Bradford could take a look? I've attached a
> zip with the patched sources.
>
> There's also an issue with the command-line tools: they assumes without
> checking that all XML documents are "platform default encoding" (because 
> they
> read/write the XML files converting bytes to Strings without supplying a
> character encoding scheme). The situation here is more complicated, as the
> XML specification states (section 4.3.3) that the XML parser should detect
> whether the document is UTF-16, UTF-8 or some other encoding specified in 
> the
> "encoding" "attribute" (not really an attribute I know) of the <?xml?>
> declaration...
>
> Possibly the best way to fix this would be to have Xerces read in the 
> document,
> and feed a DOM or SAX events to the Collection.setContentAsXXX methods. 
> (Xerces
> has code to perform this auto-detection).
>
> Writing out documents again, UTF-8 would be the safest to use. Indeed 
> without
> an "encoding" "attribute" in the XML declaration, other software MUST 
> assume
> that the document is UTF-8 or UTF-16. A better way would be te let the 
> user choose
> the encoding he wants with, say a command-line switch, and then produce 
> the XML
> file in the encoding he'd like.
>
> For the moment I have hard-coded UTF-8 into the command-line tools, but 
> this
> obviously needs to further developed.
>
> Anyway I really hope someone with knowledge of Xindice internals can help 
> me
> complete these issues: Xindice could then become one of the first ever (I 
> kid
> you not!) XML database to fully support the complete XML specification
>
> Kind regards,
> James Bates
>  <<xml-xindice.zip>>
>
Kimbro Staken
XML Database Software, Consulting and Writing
http://www.xmldatabases.org/