You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xindice-users@xml.apache.org by Gianugo Rabellino <gi...@rabellino.it> on 2002/01/07 23:47:17 UTC

Encoding... giving up :(

Ciao,

I spent quite a lot of time and effort to try and track down the new 
encoding bug that appeared today to my Solaris installation (Sparc/Sol8, 
tried JDK 1.3.1.01, 1.3.1.0, 1.4b3) to no avail. Tracing the throughput 
of the XML data it seems to me that the problem finally lies in 
DOMParser.java or, better, in the underlying SAX parser. Until then, 
AFAIK, the data that flow through are good: the byte arrays look fine 
and if converted to Strings a conversion via getBytes("UTF-8) always did 
the trick. after the Sp.parse() method is called, though, the char[]s 
received in the character() method look screwed up, with all the 
encoding information lost.

I hope someone can shed some light on it. Meanwhile, being in bed with a 
flu, I'll give up hacking and wait for your help :)

Ciao,

-- 
Gianugo


Re: Encoding... giving up :(

Posted by Joel Rosi-Schwartz <jo...@btconnect.com>.

Gianugo Rabellino wrote:

> Oh well, I can have fun even from the most boring code around :)
> Actually I was thinking of reformatting code so that there are no more
> import.*... go figure ;) I sure hope I can be of some help pretty soon.

You might want to try Eclipse for this as it has an automated "Organize imports"
feature.  You can fix all of Xindice in about 15 minutes. Now a smart perl
script that can run across all of the files in one command line would be better,
but I have not ran into one yet ;-)

Just a thought.

Joel


Re: Encoding... giving up :(

Posted by Tom Bradford <br...@dbxmlgroup.com>.
Tom Bradford wrote:
> 
> Gianugo Rabellino wrote:
> > > Not working yet, but probably almost there. I checked and it seems like
> > > my code modifications succeed in a valid document insertion, but still
> > > retrieval fails. I'll clean up a bit my messy attempts and I'll send you
> > > a diff: maybe that we both forgot some quirk elsewhere in the source.
> >
> > YAY! Now it works. :-)
> 
> Cool.
> 
> > Attached is the diff file, please review it carefully, I might have done
> > even too much in converting Strings and encoding (hopefully properly)
> > stuff. But now, in my setup, it works like a charm. I haven't tested it
> > on W2K, will do it tomorrow.
> 
> I'll try it on Linux, OS X and Solaris this evening.

Just tried it on Linux and am seeing some strangeness while bulk loading
800 standard ANSI docs (7-bit characters only).  About 15 of them won't
load properly.  The server isn't able to parse them even though they
look perfectly fine and parsed previously.  It may be an issue with
AddDocument using InputStreams instead of Readers.

-- 
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database)
Creator - Project Labrador (XML Object Broker)

Re: Encoding... giving up :(

Posted by Tom Bradford <br...@dbxmlgroup.com>.
Gianugo Rabellino wrote:
> > Not working yet, but probably almost there. I checked and it seems like
> > my code modifications succeed in a valid document insertion, but still
> > retrieval fails. I'll clean up a bit my messy attempts and I'll send you
> > a diff: maybe that we both forgot some quirk elsewhere in the source.
> 
> YAY! Now it works. :-)

Cool.

> Attached is the diff file, please review it carefully, I might have done
> even too much in converting Strings and encoding (hopefully properly)
> stuff. But now, in my setup, it works like a charm. I haven't tested it
> on W2K, will do it tomorrow.

I'll try it on Linux, OS X and Solaris this evening.

> Thaks again Tom for bearing with me,

No problem.

-- 
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database)
Creator - Project Labrador (XML Object Broker)

Re: Encoding... giving up :(

Posted by Gianugo Rabellino <gi...@apache.org>.
Gianugo Rabellino wrote:


> 
> Not working yet, but probably almost there. I checked and it seems like 
> my code modifications succeed in a valid document insertion, but still 
> retrieval fails. I'll clean up a bit my messy attempts and I'll send you 
> a diff: maybe that we both forgot some quirk elsewhere in the source.


YAY! Now it works. :-)

Attached is the diff file, please review it carefully, I might have done 
even too much in converting Strings and encoding (hopefully properly) 
stuff. But now, in my setup, it works like a charm. I haven't tested it 
on W2K, will do it tomorrow.

I'm an happy man again: my site will go in production tomorrow, my flu 
is almost over, I've been hacking for a couple of days and learning 
about XIndice internals... life is beautiful. :)

Thaks again Tom for bearing with me,

-- 
Gianugo Rabellino




Re: Encoding... giving up :(

Posted by Gianugo Rabellino <gi...@apache.org>.
Tom Bradford wrote:


>> Any news about this Tom? I'm still striving to understand what's 
>> exactly happening but it's a bit troublesome since my only access to a 
>> Solaris machine is via a slow line, so it's a bit painful to 
>> code&debug with a mere shell access and no debugging tools. However 
>> now it *seems* that I can make it in storing the documents correctly, 
>> but I still have problems in retrieval...
> 
> Try out the source code that I just checked into CVS.  You'll have to 
> wipe and and recreate your collections, but I believe it should work.


Not working yet, but probably almost there. I checked and it seems like 
my code modifications succeed in a valid document insertion, but still 
retrieval fails. I'll clean up a bit my messy attempts and I'll send you 
a diff: maybe that we both forgot some quirk elsewhere in the source.

Thanks again,

-- 
Gianugo


Re: Encoding... giving up :(

Posted by Tom Bradford <br...@dbxmlgroup.com>.
On Wednesday, January 9, 2002, at 12:12 PM, Gianugo Rabellino wrote:
> Any news about this Tom? I'm still striving to understand what's 
> exactly happening but it's a bit troublesome since my only access to a 
> Solaris machine is via a slow line, so it's a bit painful to code&debug 
> with a mere shell access and no debugging tools. However now it *seems* 
> that I can make it in storing the documents correctly, but I still have 
> problems in retrieval...

Gianugo,

Try out the source code that I just checked into CVS.  You'll have to 
wipe and and recreate your collections, but I believe it should work.

--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database)
Creator - Project Labrador (XML Object Broker)


Re: Encoding... giving up :(

Posted by Tom Bradford <br...@dbxmlgroup.com>.
On Wednesday, January 9, 2002, at 12:12 PM, Gianugo Rabellino wrote:
> Any news about this Tom? I'm still striving to understand what's 
> exactly happening but it's a bit troublesome since my only access to a 
> Solaris machine is via a slow line, so it's a bit painful to code&debug 
> with a mere shell access and no debugging tools. However now it *seems* 
> that I can make it in storing the documents correctly, but I still have 
> problems in retrieval...

Another user pointed out some issues in how we serialize our strings out 
that may address the problems.  I was calling getBytes, and basing array 
size on the string.length() method, when I should be calling getBytes 
with a proper encoding, and basing the array size on the byte array that 
is yield.  I'll try to get this all fixed this afternoon.

> Oh well, I can have fun even from the most boring code around :) 
> Actually I was thinking of reformatting code so that there are no more 
> import.*... go figure ;) I sure hope I can be of some help pretty soon.

But then I'll be forced to not be lazy anymore.

--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database)
Creator - Project Labrador (XML Object Broker)


Re: Encoding... giving up :(

Posted by Gianugo Rabellino <gi...@apache.org>.
Tom Bradford wrote:


>> various stages: everything seems fine untili DOMParser.getDocument() 
>> is called. I might well be wrong, but I have a strong feeling that the 
>> problem is there.
> 
> Ok.. I think I may know where the problem is.  I'll try to fix it tonight.


Any news about this Tom? I'm still striving to understand what's exactly 
happening but it's a bit troublesome since my only access to a Solaris 
machine is via a slow line, so it's a bit painful to code&debug with a 
mere shell access and no debugging tools. However now it *seems* that I 
can make it in storing the documents correctly, but I still have 
problems in retrieval...

>> Thanks for supporting this. I hope I can be of any help soon (well, 
>> there is a good side after all: I'm starting to get a good grasp on 
>> the XIndice internals :)).
> 
> Fine by me.  If you want to take over, be my guest :-)  I'll work on the 
> fun stuff instead.


Oh well, I can have fun even from the most boring code around :) 
Actually I was thinking of reformatting code so that there are no more 
import.*... go figure ;) I sure hope I can be of some help pretty soon.

Ciao,

-- 
Gianugo


Re: Encoding... giving up :(

Posted by Tom Bradford <br...@dbxmlgroup.com>.
On Monday, January 7, 2002, at 04:40 PM, Gianugo Rabellino wrote:
> This time I would say that it happens when the document is stored: from 
> a quick look at the .tbl files I can see the difference between W2K and 
> Solaris. I also tried to debug and dump the document at various stages: 
> everything seems fine untili DOMParser.getDocument() is called. I might 
> well be wrong, but I have a strong feeling that the problem is there.

Ok.. I think I may know where the problem is.  I'll try to fix it 
tonight.

> Thanks for supporting this. I hope I can be of any help soon (well, 
> there is a good side after all: I'm starting to get a good grasp on the 
> XIndice internals :)).

Fine by me.  If you want to take over, be my guest :-)  I'll work on the 
fun stuff instead.

--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database)
Creator - Project Labrador (XML Object Broker)


Re: Encoding... giving up :(

Posted by Gianugo Rabellino <gi...@rabellino.it>.
Tom Bradford wrote:


>> always did the trick. after the Sp.parse() method is called, though, 
>> the char[]s received in the character() method look screwed up, with 
>> all the encoding information lost.
> 

> I'll continue to try to track this down.  Are the errors you're seeing 
> only happening as the document comes out of the server or as it goes in 
> as well?


This time I would say that it happens when the document is stored: from 
a quick look at the .tbl files I can see the difference between W2K and 
Solaris. I also tried to debug and dump the document at various stages: 
everything seems fine untili DOMParser.getDocument() is called. I might 
well be wrong, but I have a strong feeling that the problem is there.

Thanks for supporting this. I hope I can be of any help soon (well, 
there is a good side after all: I'm starting to get a good grasp on the 
XIndice internals :)).

Ciao,

-- 
Gianugo





Re: Encoding... giving up :(

Posted by Tom Bradford <br...@dbxmlgroup.com>.
On Monday, January 7, 2002, at 03:47 PM, Gianugo Rabellino wrote:
> I spent quite a lot of time and effort to try and track down the new 
> encoding bug that appeared today to my Solaris installation 
> (Sparc/Sol8, tried JDK 1.3.1.01, 1.3.1.0, 1.4b3) to no avail. Tracing 
> the throughput of the XML data it seems to me that the problem finally 
> lies in DOMParser.java or, better, in the underlying SAX parser. Until 
> then, AFAIK, the data that flow through are good: the byte arrays look 
> fine and if converted to Strings a conversion via getBytes("UTF-8) 
> always did the trick. after the Sp.parse() method is called, though, 
> the char[]s received in the character() method look screwed up, with 
> all the encoding information lost.

I'll continue to try to track this down.  Are the errors you're seeing 
only happening as the document comes out of the server or as it goes in 
as well?

> I hope someone can shed some light on it. Meanwhile, being in bed with 
> a flu, I'll give up hacking and wait for your help :)

--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database)
Creator - Project Labrador (XML Object Broker)