You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Stefano Mazzocchi <st...@apache.org> on 2002/02/23 16:00:56 UTC

[bug] encoding problems

[cross posted because people on the cocoon list might hit this as well]

I've always tested xindice with english documents, so I didn't notice
this behavior until today when I imported an italian XML document.

The document is encoded using UTF-8 and looks like this:

 <?xml version="1.0" encoding="UTF-8"?>
 ...
  <subtitle>
   In sempre più film il computer con la Mela è l'arma 
   dei giusti contro criminali di ogni specie che invece 
   preferiscono i pc
  </subtitle>
 ...

[this is a news document taken from an italian on-line newspaper]

 ù -> ù
 è -> è

are the two unicode translations for the non-ASCII character (since
UTF-8 is back compatible to ASCII you don't note any difference until
you use non-ASCII letters such as these)

Opening the document in Explorer or XML-Spy yields the correct
characters.

Then I import it into the database and I access it from the cocoon
XML:DB source I get (in the explorer window):

  <?xml version="1.0" encoding="UTF-8" ?> 
   ...
  <subtitle>
   In sempre più film il computer con la Mela è l'arma dei giusti 
   contro criminali di ogni specie che invece preferiscono i pc
  </subtitle> 

same thing when opening the source from the the notepad window. But in
win2k notepad is UNICODE-aware... so I saved the source on disk and I
opened it with UltraEdit (which is UNICODE-aware but has a nice binary
view) and voila'

  ...
  <subtitle>
   In sempre più film il computer con la Mela è 
   l'arma dei giusti contro criminali di ogni specie 
   che invece preferiscono i pc
  </subtitle>
  ...

where I believe that

 Ã -> Ã
 ¹ -> ¹

This similarity in encoding probably shows why nobody noticed this
before.

So I went directly into the news.tbl and got the same bytes:

   n sempre più film il compu
   ter con la Mela è l'arma d
   ei giusti 

which clearly indicates that 'xindice' command line import tool is
somewhat ignoring the 'UTF-8' encoding and performing UTF-8 encoding on
something that is *already* UTF-8 encoded.

My perception is that there is nothing wrong in the way XIndice or
Cocoon get the information *out* of the database: the problem resides on
how the information gets *in* the database.

I would suggest the XIndice dev community to consider this bug a
showstopper for the 1.0 final release.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [bug] encoding problems

Posted by Gianugo Rabellino <gi...@apache.org>.
Kimbro Staken wrote:
>> Both of us worked on it and came out with a working patch: James 
>> patches are more complete than mine (well... mine came first ;-P) so 
>> now it's just a matter of having them included into CVS (though I 
>> think that Kimbro and James, as a new commiter, are already working on 
>> this).
> 
> 
> I thought Tom added your patch already?

Not AFAIK. I guess it was kinda forgotten, probably Tom didn't have time 
to look at it thourogly. My work, however, has been superseded by James 
patch wich is more complete and accurate than my hacks. :)

Ciao,

-- 
Gianugo



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [bug] encoding problems

Posted by Kimbro Staken <ks...@xmldatabases.org>.
On Saturday, February 23, 2002, at 08:38 AM, Gianugo Rabellino wrote:

> Stefano Mazzocchi wrote:
>> [cross posted because people on the cocoon list might hit this as well]
>> I've always tested xindice with english documents, so I didn't notice
>> this behavior until today when I imported an italian XML document.
>
> Stefano,
>
> I did:
>
> http://marc.theaimsgroup.com/?l=xindice-users&m=101061862020149&w=2
>
> and so did James Bates lately:
>
> http://marc.theaimsgroup.com/?t=101377536900001&r=1&w=2
>
> Both of us worked on it and came out with a working patch: James patches 
> are more complete than mine (well... mine came first ;-P) so now it's 
> just a matter of having them included into CVS (though I think that 
> Kimbro and James, as a new commiter, are already working on this).

I thought Tom added your patch already?

>
> Ciao,
>
> -- Gianugo
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
> For additional commands, email: cocoon-dev-help@xml.apache.org
>
>
Kimbro Staken - http://www.kstaken.org - http://www.xmldatabases.org
Apache Xindice native XML database http://xml.apache.org
XML:DB Initiative http://www.xmldb.org
Senior Technologist (Your company name here)


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [bug] encoding problems

Posted by Gianugo Rabellino <gi...@apache.org>.
Stefano Mazzocchi wrote:
> [cross posted because people on the cocoon list might hit this as well]
> 
> I've always tested xindice with english documents, so I didn't notice
> this behavior until today when I imported an italian XML document.

Stefano,

I did:

http://marc.theaimsgroup.com/?l=xindice-users&m=101061862020149&w=2

and so did James Bates lately:

http://marc.theaimsgroup.com/?t=101377536900001&r=1&w=2

Both of us worked on it and came out with a working patch: James patches 
are more complete than mine (well... mine came first ;-P) so now it's 
just a matter of having them included into CVS (though I think that 
Kimbro and James, as a new commiter, are already working on this).

Ciao,

-- 
Gianugo


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [bug] encoding problems

Posted by Gianugo Rabellino <gi...@apache.org>.
Stefano Mazzocchi wrote:
> [cross posted because people on the cocoon list might hit this as well]
> 
> I've always tested xindice with english documents, so I didn't notice
> this behavior until today when I imported an italian XML document.

Stefano,

I did:

http://marc.theaimsgroup.com/?l=xindice-users&m=101061862020149&w=2

and so did James Bates lately:

http://marc.theaimsgroup.com/?t=101377536900001&r=1&w=2

Both of us worked on it and came out with a working patch: James patches 
are more complete than mine (well... mine came first ;-P) so now it's 
just a matter of having them included into CVS (though I think that 
Kimbro and James, as a new commiter, are already working on this).

Ciao,

-- 
Gianugo


Re: [bug] encoding problems

Posted by Kimbro Staken <ks...@xmldatabases.org>.
On Saturday, February 23, 2002, at 11:49 AM, Tom Bradford wrote:
>
> Hmmm...  In theory, there shouldn't be any special considerations, but I'
> ll think about it more.
>

Ok, it would really suck to have to keep recreating your collection config.
  There needs to be a way to back it up and restore it regardless.

> One other thing that will potentially require a rebuild will be the 
> changes we'll need to make to workaround DOS reserved filenames.

This is a pretty minor bug too, but if we're going to require a rebuild 
anyway we should probably go ahead and fix it now. Are you still just 
thinking of adding a prefix to the filename?

>
> --
> Tom Bradford - http://www.tbradford.org
> Architect - XQRL (XQuery Engine) - http://www.xqrl.com
> Apache Xindice (Native XML Database) - http://xml.apache.org
> Project Labrador (Web Services Framework) - http://notdotnet.org
>
>
Kimbro Staken - http://www.kstaken.org - http://www.xmldatabases.org
Apache Xindice native XML database http://xml.apache.org
XML:DB Initiative http://www.xmldb.org
Senior Technologist (Your company name here)


Re: [bug] encoding problems

Posted by Tom Bradford <to...@xqrl.com>.
On Saturday, February 23, 2002, at 11:41 AM, Kimbro Staken wrote:
>> We should encourage a db rebuild anyway because of the flush changes.
>>
>
> This is a different story though. If we need to do this, then we don't 
> really have any choice. We need some docs on how to go about it though. 
> Is there anything that prevents adding the database configuration as a 
> file? So we can tell people to backup their database by dumping the 
> entire contents using export and then restoring using import? Would the 
> database config need to be handled in any special manner?

Hmmm...  In theory, there shouldn't be any special considerations, but 
I'll think about it more.

One other thing that will potentially require a rebuild will be the 
changes we'll need to make to workaround DOS reserved filenames.

--
Tom Bradford - http://www.tbradford.org
Architect - XQRL (XQuery Engine) - http://www.xqrl.com
Apache Xindice (Native XML Database) - http://xml.apache.org
Project Labrador (Web Services Framework) - http://notdotnet.org


Re: [bug] encoding problems

Posted by Kimbro Staken <ks...@xmldatabases.org>.
On Saturday, February 23, 2002, at 11:31 AM, Tom Bradford wrote:

> On Saturday, February 23, 2002, at 11:26 AM, Kimbro Staken wrote:
>> Is there any real issue there though? Other then consistency? If not let'
>> s wait.
>
> Other than keeping people from referring to the binary contents of the 
> file, none.

Ok, that's a really minor bug. :-)

> We should encourage a db rebuild anyway because of the flush changes.
>

This is a different story though. If we need to do this, then we don't 
really have any choice. We need some docs on how to go about it though. Is 
there anything that prevents adding the database configuration as a file? 
So we can tell people to backup their database by dumping the entire 
contents using export and then restoring using import? Would the database 
config need to be handled in any special manner?

> --
> Tom Bradford - http://www.tbradford.org
> Architect - XQRL (XQuery Engine) - http://www.xqrl.com
> Apache Xindice (Native XML Database) - http://xml.apache.org
> Project Labrador (Web Services Framework) - http://notdotnet.org
>
>
Kimbro Staken - http://www.kstaken.org - http://www.xmldatabases.org
Apache Xindice native XML database http://xml.apache.org
XML:DB Initiative http://www.xmldb.org
Senior Technologist (Your company name here)


Re: [bug] encoding problems

Posted by Tom Bradford <to...@xqrl.com>.
On Saturday, February 23, 2002, at 11:26 AM, Kimbro Staken wrote:
> Is there any real issue there though? Other then consistency? If not 
> let's wait.

Other than keeping people from referring to the binary contents of the 
file, none.  We should encourage a db rebuild anyway because of the 
flush changes.

--
Tom Bradford - http://www.tbradford.org
Architect - XQRL (XQuery Engine) - http://www.xqrl.com
Apache Xindice (Native XML Database) - http://xml.apache.org
Project Labrador (Web Services Framework) - http://notdotnet.org


Re: [bug] encoding problems

Posted by Kimbro Staken <ks...@xmldatabases.org>.
On Saturday, February 23, 2002, at 11:22 AM, Tom Bradford wrote:

> On Saturday, February 23, 2002, at 11:11 AM, Kimbro Staken wrote:
>> Why did you need to make this change now? I was really hoping that we 
>> could not require full database rebuilds at this point. 1.1 is going to 
>> require it anyway, so unless there's a really good reason I say this 
>> waits.
>
> Easy to yank, but it should be considered a minor bug since no other 
> collection in the system stores its documents as plain text.
>

Is there any real issue there though? Other then consistency? If not let's 
wait.

> --
> Tom Bradford - http://www.tbradford.org
> Architect - XQRL (XQuery Engine) - http://www.xqrl.com
> Apache Xindice (Native XML Database) - http://xml.apache.org
> Project Labrador (Web Services Framework) - http://notdotnet.org
>
>
Kimbro Staken - http://www.kstaken.org - http://www.xmldatabases.org
Apache Xindice native XML database http://xml.apache.org
XML:DB Initiative http://www.xmldb.org
Senior Technologist (Your company name here)


Re: [bug] encoding problems

Posted by Tom Bradford <to...@mac.com>.
On Saturday, February 23, 2002, at 11:11 AM, Kimbro Staken wrote:
> Why did you need to make this change now? I was really hoping that we 
> could not require full database rebuilds at this point. 1.1 is going to 
> require it anyway, so unless there's a really good reason I say this 
> waits.

Easy to yank, but it should be considered a minor bug since no other 
collection in the system stores its documents as plain text.

--
Tom Bradford - http://www.tbradford.org
Architect - XQRL (XQuery Engine) - http://www.xqrl.com
Apache Xindice (Native XML Database) - http://xml.apache.org
Project Labrador (Web Services Framework) - http://notdotnet.org


Re: [bug] encoding problems

Posted by Kimbro Staken <ks...@xmldatabases.org>.
On Saturday, February 23, 2002, at 10:34 AM, Tom Bradford wrote:
>
> I agree.  Let's ship a 1.0 that does some XML encodings in a more robust 
> fashion, rather than potentially introduce instability by attempting to 
> support all XML encodings.
>
> BTW, I'm going to checkin a minor change to SystemCollection that stores 
> the SysConfig collection as a standard compressed collection rather than 
> an uncached, uncompressed Collection.  The latter was required when the 
> SysConfig collection pointed to a directory, but it's no longer necessary.
>   This change will require a complete db rebuild though.
>

Why did you need to make this change now? I was really hoping that we 
could not require full database rebuilds at this point. 1.1 is going to 
require it anyway, so unless there's a really good reason I say this waits.

> Name for RC2 will be Pepper.  If it's 1.0, then it's Birthday
>
> --
> Tom Bradford - http://www.tbradford.org
> Architect - XQRL (XQuery Engine) - http://www.xqrl.com
> Apache Xindice (Native XML Database) - http://xml.apache.org
> Project Labrador (Web Services Framework) - http://notdotnet.org
>
>
Kimbro Staken - http://www.kstaken.org - http://www.xmldatabases.org
Apache Xindice native XML database http://xml.apache.org
XML:DB Initiative http://www.xmldb.org
Senior Technologist (Your company name here)


Re: [bug] encoding problems

Posted by Tom Bradford <to...@mac.com>.
On Saturday, February 23, 2002, at 09:04 AM, Kimbro Staken wrote:
> I don't think we should do this. I'd really rather get 1.0 out with 
> this limitation noted and then make this an extremely high priority for 
> 1.1. Just to be clear I do realize this is a potential issue for many 
> people, but fixing it would delay a release by at least a month or 
> more. We've had 1.0 on the verge of release for almost five months, and 
> right now it's stalling the project overall. We have enough other 
> issues that we really need to just ship 1.0, clear the blockage and get 
> back to making rapid progress.

I agree.  Let's ship a 1.0 that does some XML encodings in a more robust 
fashion, rather than potentially introduce instability by attempting to 
support all XML encodings.

BTW, I'm going to checkin a minor change to SystemCollection that stores 
the SysConfig collection as a standard compressed collection rather than 
an uncached, uncompressed Collection.  The latter was required when the 
SysConfig collection pointed to a directory, but it's no longer 
necessary.  This change will require a complete db rebuild though.

Name for RC2 will be Pepper.  If it's 1.0, then it's Birthday

--
Tom Bradford - http://www.tbradford.org
Architect - XQRL (XQuery Engine) - http://www.xqrl.com
Apache Xindice (Native XML Database) - http://xml.apache.org
Project Labrador (Web Services Framework) - http://notdotnet.org


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [bug] encoding problems

Posted by Tom Bradford <to...@mac.com>.
On Saturday, February 23, 2002, at 09:04 AM, Kimbro Staken wrote:
> I don't think we should do this. I'd really rather get 1.0 out with 
> this limitation noted and then make this an extremely high priority for 
> 1.1. Just to be clear I do realize this is a potential issue for many 
> people, but fixing it would delay a release by at least a month or 
> more. We've had 1.0 on the verge of release for almost five months, and 
> right now it's stalling the project overall. We have enough other 
> issues that we really need to just ship 1.0, clear the blockage and get 
> back to making rapid progress.

I agree.  Let's ship a 1.0 that does some XML encodings in a more robust 
fashion, rather than potentially introduce instability by attempting to 
support all XML encodings.

BTW, I'm going to checkin a minor change to SystemCollection that stores 
the SysConfig collection as a standard compressed collection rather than 
an uncached, uncompressed Collection.  The latter was required when the 
SysConfig collection pointed to a directory, but it's no longer 
necessary.  This change will require a complete db rebuild though.

Name for RC2 will be Pepper.  If it's 1.0, then it's Birthday

--
Tom Bradford - http://www.tbradford.org
Architect - XQRL (XQuery Engine) - http://www.xqrl.com
Apache Xindice (Native XML Database) - http://xml.apache.org
Project Labrador (Web Services Framework) - http://notdotnet.org


Re: [bug] encoding problems

Posted by Kimbro Staken <ks...@xmldatabases.org>.
On Saturday, February 23, 2002, at 08:00 AM, Stefano Mazzocchi wrote:

> [cross posted because people on the cocoon list might hit this as well]
>
> I've always tested xindice with english documents, so I didn't notice
> this behavior until today when I imported an italian XML document.
>

This is a known problem. We have a patch in the queue for some of it, but 
it won't resolve everything.

<snip>

>
> I would suggest the XIndice dev community to consider this bug a
> showstopper for the 1.0 final release.
>

I don't think we should do this. I'd really rather get 1.0 out with this 
limitation noted and then make this an extremely high priority for 1.1. 
Just to be clear I do realize this is a potential issue for many people, 
but fixing it would delay a release by at least a month or more. We've had 
1.0 on the verge of release for almost five months, and right now it's 
stalling the project overall. We have enough other issues that we really 
need to just ship 1.0, clear the blockage and get back to making rapid 
progress.

I'm hoping we can ship a 1.1 within a month or so after 1.0. So given that,
  I'd rather just ship 1.0 and follow with 1.1 after. This gets a 1.0 into 
the hands of people who can use it sooner and get's the encoding issues 
resolved in roughly the same amount of time overall. Is this agreeable? I 
know this is critical to get fixed, but for the health of the project (and 
my sanity) I feel it's better to release 1.0 with the existing flaws then 
to continue to hold it.

> --
> Stefano Mazzocchi      One must still have chaos in oneself to be
>                           able to give birth to a dancing star.
> <st...@apache.org>                             Friedrich Nietzsche
> --------------------------------------------------------------------
>
>
>
Kimbro Staken - http://www.kstaken.org - http://www.xmldatabases.org
Apache Xindice native XML database http://xml.apache.org
XML:DB Initiative http://www.xmldb.org
Senior Technologist (Your company name here)


Re: [bug] encoding problems

Posted by Kimbro Staken <ks...@xmldatabases.org>.
On Saturday, February 23, 2002, at 08:00 AM, Stefano Mazzocchi wrote:

> [cross posted because people on the cocoon list might hit this as well]
>
> I've always tested xindice with english documents, so I didn't notice
> this behavior until today when I imported an italian XML document.
>

This is a known problem. We have a patch in the queue for some of it, but 
it won't resolve everything.

<snip>

>
> I would suggest the XIndice dev community to consider this bug a
> showstopper for the 1.0 final release.
>

I don't think we should do this. I'd really rather get 1.0 out with this 
limitation noted and then make this an extremely high priority for 1.1. 
Just to be clear I do realize this is a potential issue for many people, 
but fixing it would delay a release by at least a month or more. We've had 
1.0 on the verge of release for almost five months, and right now it's 
stalling the project overall. We have enough other issues that we really 
need to just ship 1.0, clear the blockage and get back to making rapid 
progress.

I'm hoping we can ship a 1.1 within a month or so after 1.0. So given that,
  I'd rather just ship 1.0 and follow with 1.1 after. This gets a 1.0 into 
the hands of people who can use it sooner and get's the encoding issues 
resolved in roughly the same amount of time overall. Is this agreeable? I 
know this is critical to get fixed, but for the health of the project (and 
my sanity) I feel it's better to release 1.0 with the existing flaws then 
to continue to hold it.

> --
> Stefano Mazzocchi      One must still have chaos in oneself to be
>                           able to give birth to a dancing star.
> <st...@apache.org>                             Friedrich Nietzsche
> --------------------------------------------------------------------
>
>
>
Kimbro Staken - http://www.kstaken.org - http://www.xmldatabases.org
Apache Xindice native XML database http://xml.apache.org
XML:DB Initiative http://www.xmldb.org
Senior Technologist (Your company name here)


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org