You are viewing a plain text version of this content. The canonical link for it is here.

Posted to xindice-dev@xml.apache.org by James Bates <ja...@amplexor.com> on 2002/05/07 19:38:57 UTC

utf-8 working code... caution with existing data files.

Boys (and girls?),

I have a patch for the current Xindice CVS that supports reading/writing files in UTF-8 containing any Unicode characters you want into and out of Xindice. This means Greek, Hebrew, Korean, Chinese, Arabic, Russian, etc... To allow for this, I have had to modify the internal data format of Xindice files, meaning that existing Xindice databases will appear corrupt to Xindice patched with this new code...

It is however necessary in my opinion, as discussed in earlier posts, to migrate toward this.

In reality, this will affect ONLY databases that contain XML documents with NON-ASCII characters. ASCII characters are: English letters, Digits, punctuation marks, Whitespaces, as well as some control characters like delete and backspace. There are 128 ASCII characters in all. So as long as you have databases using documents with only these characters, the patch won't affect your datafiles.

Typical non-ASCII characters, which will cause incompatibilities between old and new database files include: french, spanish etc... accented characters, such as à, ò, ç, ü; currencies like £, EUR, ¥, non-breakable spaces (&nbsp; in HTML), fancy quotes «, », copyright sign ©, etc...

Because of these possible incompatiblities, I'd like to WARN people and try and co-ordinate applying them so as to cause as little disruption as possible. You can check them out already at
http://lambiek.amplexor.be/downloads/xindice/new-utf8-patch.

I don't believe you NEED to use the XML-RPC client for just reading/writing documents, though I haven't really tested the CORBA client anymore... Using XPaths and XUpdates with non-ASCII characers will definately not work in CORBA, but should now work with XML-RPC interface. (Need to test some more myself though).

Anyway, let me know how and when I can commit this patch...

James

Re: utf-8 working code... caution with existing data files.

Posted by Gianugo Rabellino <gi...@apache.org>.

Stefano Mazzocchi wrote:
> I have resources to dedicate to metadata, security and versioning, but I
> don't want to be blocked in the middle of a "XML:DB API" muds. This is
> my reasons for wrapping instead of patching.

Makes sense.

> 
> This said, I'm very willing to reconsider this ideas based on what this
> community finds appropriate... the only thing that is slowly killing the
> evolution of this project is an API that was created *before* the
> mile-long TODO design list was written. An API that not many seems to be
> endorsing.

Well... yes. At least looking at the traffic on Xapi-dev, this is the 
right conclusion. Yet I think there is room to revive the API, and this 
target might be left aside for a while, not forgotten.

> My suggestion is: forget the API and concentrate on functionality. The
> API will follow.
> 
> But I'm pretty sure the xindice committers don't agree with my vision,
> or am I wrong?

IIRC this was already discussed and approved, so I don't see any problem 
in going this way. However, I'd also investigate if the extension 
mechanism contained in the API (the Service) can be exploited to extend 
Xindice with the functionality we all need. We have an XPathQueryService 
and an XUpdateQueryService, it might be easy to add a MetadataService, a 
VersioningService, and so on. If this is not enough, let's move on and 
not let the XML:DB API stop Xindice evolution.

How about it?

Ciao,

-- 
Gianugo Rabellino

Re: utf-8 working code... caution with existing data files.

Posted by Kimbro Staken <ks...@xmldatabases.org>.

On Saturday, May 11, 2002, at 12:51 PM, Stefano Mazzocchi wrote:
>>
>
> I have resources to dedicate to metadata, security and versioning, but I
> don't want to be blocked in the middle of a "XML:DB API" muds. This is
> my reasons for wrapping instead of patching.
>
> This said, I'm very willing to reconsider this ideas based on what this
> community finds appropriate... the only thing that is slowly killing the
> evolution of this project is an API that was created *before* the
> mile-long TODO design list was written. An API that not many seems to be
> endorsing.
>
> My suggestion is: forget the API and concentrate on functionality. The
> API will follow.
>
This is what I've been saying for a couple months. I don't want to worry 
about the XML:DB API right now either. I just want to figure out what 
functionality we want in this area and then see what we do with the API. I'
ve just been looking to other people to step up and decide what should 
happen here. I don't have any personal interest in this area (nor the time 
to really do much right now, hopefully that will change in a month or so).

Once we get a clear picture on what functionality is being exposed by the 
server, then we can worry about the java API and whether we stick with the 
XML:DB API or not. In the short term we can just work the functionality at 
the XML-RPC level.


> But I'm pretty sure the xindice committers don't agree with my vision,
> or am I wrong?
>
> --
> Stefano Mazzocchi      One must still have chaos in oneself to be
>                           able to give birth to a dancing star.
> <st...@apache.org>                             Friedrich Nietzsche
> --------------------------------------------------------------------
>
>
Kimbro Staken
Java and XML Software, Consulting and Writing http://www.xmldatabases.org/
Apache Xindice native XML database http://xml.apache.org/xindice
XML:DB Initiative http://www.xmldb.org

Re: utf-8 working code... caution with existing data files.

Posted by Stefano Mazzocchi <st...@apache.org>.

Gianugo Rabellino wrote:
> 
> Stefano Mazzocchi wrote:
> 
> >
> > I agree that since Xindice has namespace support, most application level
> > metadata can be stored without Xindice knowing about it, but there is
> > some that simply cannot, without a serious and complex wrapper around
> > Xindice (which is something I'm going to seriously consider since I need
> > stuff that most people here would think it doesn't belong to a DB core,
> > but to a higher middleware, things like validation, security)
> >
> > I'm seriously considering writing a big fat wrapper around XIndice to
> > provide all the functionality I need around a document-oriented XML
> > datastore, also without the problem of having an API to bind me.
> >
> > What would you people think of such a thing?
> 
> It's another way to solve the problem. Yet I tend to think that without
> metadata, security, validation, linking and so on, Xindice is at risk of
> becoming just a distributed XML filesystem. Which is way too little to
> become a viable solution for anything. 

I surely tend to agree (without any offense for anyone here)

> I wish I had more time ATM to
> help this community find a way out of what you correctly call a
> stagnation that scares me out quite a bit. And I hope to be proven wrong :-)

I have resources to dedicate to metadata, security and versioning, but I
don't want to be blocked in the middle of a "XML:DB API" muds. This is
my reasons for wrapping instead of patching.

This said, I'm very willing to reconsider this ideas based on what this
community finds appropriate... the only thing that is slowly killing the
evolution of this project is an API that was created *before* the
mile-long TODO design list was written. An API that not many seems to be
endorsing.

My suggestion is: forget the API and concentrate on functionality. The
API will follow.

But I'm pretty sure the xindice committers don't agree with my vision,
or am I wrong?

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------

Re: utf-8 working code... caution with existing data files.

Posted by Gianugo Rabellino <gi...@apache.org>.

Stefano Mazzocchi wrote:

> 
> I agree that since Xindice has namespace support, most application level
> metadata can be stored without Xindice knowing about it, but there is
> some that simply cannot, without a serious and complex wrapper around
> Xindice (which is something I'm going to seriously consider since I need
> stuff that most people here would think it doesn't belong to a DB core,
> but to a higher middleware, things like validation, security)
> 
> I'm seriously considering writing a big fat wrapper around XIndice to
> provide all the functionality I need around a document-oriented XML
> datastore, also without the problem of having an API to bind me.
> 
> What would you people think of such a thing?

It's another way to solve the problem. Yet I tend to think that without 
metadata, security, validation, linking and so on, Xindice is at risk of 
becoming just a distributed XML filesystem. Which is way too little to 
become a viable solution for anything. I wish I had more time ATM to 
help this community find a way out of what you correctly call a 
stagnation that scares me out quite a bit. And I hope to be proven wrong :-)

Ciao,

-- 
Gianugo

Re: utf-8 working code... caution with existing data files.

Posted by Stefano Mazzocchi <st...@apache.org>.

David Viner wrote:
> 
> "proper encoding and lack of available metadata"
> am i correct in assuming that you mean the correct value for the encoding
> attribute?

No, I'm referring to the fact that XIndice stored textual stuff using a
specific encoding which was not general enough to handle all Unicode,
which is pretty serious issue for an XMl database given that XML can
contain all unicode codes.

But sounds like James just fixed this by using UTF-8 as the default
encoding for what Xindice stores so I'm happy on that side (I'm going to
test and try it out real soon)

> what metadata are you referring to?  do you mean things like a DOCTYPE
> element or a schema?
> or do you mean something more like "last modified time"?

Sorry, I have to explain further: I'm talking about datastore-specific
metadata. Things like "last modified" and such.

There has been no agreement on this list about "what data is metadata"
since metadata is data anyway. 

I agree that since Xindice has namespace support, most application level
metadata can be stored without Xindice knowing about it, but there is
some that simply cannot, without a serious and complex wrapper around
Xindice (which is something I'm going to seriously consider since I need
stuff that most people here would think it doesn't belong to a DB core,
but to a higher middleware, things like validation, security)

I'm seriously considering writing a big fat wrapper around XIndice to
provide all the functionality I need around a document-oriented XML
datastore, also without the problem of having an API to bind me.

What would you people think of such a thing?

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------

RE: utf-8 working code... caution with existing data files.

Posted by David Viner <dv...@yahoo-inc.com>.

"proper encoding and lack of available metadata"
am i correct in assuming that you mean the correct value for the encoding
attribute?
what metadata are you referring to?  do you mean things like a DOCTYPE
element or a schema?
or do you mean something more like "last modified time"?

dave

-----Original Message-----
From: Stefano Mazzocchi [mailto:stefano@apache.org]
Sent: Thursday, May 09, 2002 5:11 AM
To: xindice-dev@xml.apache.org
Subject: Re: utf-8 working code... caution with existing data files.

James Bates wrote:
>
> Boys (and girls?),
>
> I have a patch for the current Xindice CVS that supports reading/writing
files in UTF-8 containing any Unicode characters you want into and out of
Xindice. This means Greek, Hebrew, Korean, Chinese, Arabic, Russian, etc...
To allow for this, I have had to modify the internal data format of Xindice
files, meaning that existing Xindice databases will appear corrupt to
Xindice patched with this new code...
>
> It is however necessary in my opinion, as discussed in earlier posts, to
migrate toward this.
>
> In reality, this will affect ONLY databases that contain XML documents
with NON-ASCII characters. ASCII characters are: English letters, Digits,
punctuation marks,  Whitespaces, as well as some control characters like
delete and backspace. There are 128 ASCII characters in all. So as long as
you have databases using documents with only these characters, the patch
won't affect your datafiles.
>
> Typical non-ASCII characters, which will cause incompatibilities between
old and new database files include: french, spanish etc... accented
characters, such as à, ò, ç, ü; currencies like £, EUR, ¥, non-breakable
spaces (&nbsp; in HTML), fancy quotes «, », copyright sign ©, etc...
>
> Because of these possible incompatiblities, I'd like to WARN people and
try and co-ordinate applying them so as to cause as little disruption as
possible. You can check them out already at
> http://lambiek.amplexor.be/downloads/xindice/new-utf8-patch.
>
> I don't believe you NEED to use the XML-RPC client for just
reading/writing documents, though I haven't really tested the CORBA client
anymore... Using XPaths and XUpdates with non-ASCII characers will
definately not work in CORBA, but should now work with XML-RPC interface.
(Need to test some more myself though).
>
> Anyway, let me know how and when I can commit this patch...
>
> James

+1 for committing as early as possible.

The trick would be a way to write a client that serializes the entire
database into a big XML file and another one in the new version that
allows import thru this XML dump file (which can use namespaces to
indicate xindice-specific data along the tree).

What do you think?

[NOTE: XIndice is totally useless to me today exactly because of proper
encoding and lack of available metadata... and I've met tons of people
that believe the exact same, so I'd suggest to patch these two things
then do a 1.1 release ASAP... this is very likely the reason why this
community is stagnating, so this might be a good thing to patch]

I volunteer to work on the metadata since I badly need it in the future.
Just don't know how to do it and I think the XML:DB API are slowing us
down rather than helping us in any way.

Comments?

--
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------

Re: utf-8 working code... caution with existing data files.

Posted by Stefano Mazzocchi <st...@apache.org>.

James Bates wrote:
> 
> Boys (and girls?),
> 
> I have a patch for the current Xindice CVS that supports reading/writing files in UTF-8 containing any Unicode characters you want into and out of Xindice. This means Greek, Hebrew, Korean, Chinese, Arabic, Russian, etc... To allow for this, I have had to modify the internal data format of Xindice files, meaning that existing Xindice databases will appear corrupt to Xindice patched with this new code...
> 
> It is however necessary in my opinion, as discussed in earlier posts, to migrate toward this.
> 
> In reality, this will affect ONLY databases that contain XML documents with NON-ASCII characters. ASCII characters are: English letters, Digits, punctuation marks,  Whitespaces, as well as some control characters like delete and backspace. There are 128 ASCII characters in all. So as long as you have databases using documents with only these characters, the patch won't affect your datafiles.
> 
> Typical non-ASCII characters, which will cause incompatibilities between old and new database files include: french, spanish etc... accented characters, such as à, ò, ç, ü; currencies like £, EUR, ¥, non-breakable spaces (&nbsp; in HTML), fancy quotes «, », copyright sign ©, etc...
> 
> Because of these possible incompatiblities, I'd like to WARN people and try and co-ordinate applying them so as to cause as little disruption as possible. You can check them out already at
> http://lambiek.amplexor.be/downloads/xindice/new-utf8-patch.
> 
> I don't believe you NEED to use the XML-RPC client for just reading/writing documents, though I haven't really tested the CORBA client anymore... Using XPaths and XUpdates with non-ASCII characers will definately not work in CORBA, but should now work with XML-RPC interface. (Need to test some more myself though).
> 
> Anyway, let me know how and when I can commit this patch...
> 
> James

+1 for committing as early as possible.

The trick would be a way to write a client that serializes the entire
database into a big XML file and another one in the new version that
allows import thru this XML dump file (which can use namespaces to
indicate xindice-specific data along the tree).

What do you think?

[NOTE: XIndice is totally useless to me today exactly because of proper
encoding and lack of available metadata... and I've met tons of people
that believe the exact same, so I'd suggest to patch these two things
then do a 1.1 release ASAP... this is very likely the reason why this
community is stagnating, so this might be a good thing to patch]

I volunteer to work on the metadata since I badly need it in the future.
Just don't know how to do it and I think the XML:DB API are slowing us
down rather than helping us in any way.

Comments?

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------

Re: utf-8 working code... caution with existing data files.

Posted by Michael Westbay <we...@users.sourceforge.net>.

Bates-san wrote:

> I have a patch for the current Xindice CVS that supports
> reading/writing files in UTF-8 containing any Unicode characters you
> want into and out of Xindice.

I just checked out the latest CVS code, and confirmed that the latests checkin works with Japanese.  My original document was in EUC-JP encoding, and it properly converted it to UTF-8 for storage.

I've also tested retrieval with the command line tools (the retrieved file is in UTF-8 encoding), and retrieving and updating files in Japanese with YAP.  They all now work.  Very nice job.

Now that Japanese documents work, I can start really pounding on this.  Great job!

-- 
Michael Westbay
Work: Beacon-IT http://www.beacon-it.co.jp/
Home:           http://www.seaple.icc.ne.jp/~westbay
Commentary:     http://www.japanesebaseball.com/forum/

Re: utf-8 working code... caution with existing data files.

Posted by Kimbro Staken <ks...@xmldatabases.org>.

My opinion is go ahead and commit. The tree is unstable right now anyway 
and this is something we definitely need in place.

On Tuesday, May 7, 2002, at 10:38 AM, James Bates wrote:

> Boys (and girls?),
>
> I have a patch for the current Xindice CVS that supports reading/writing 
> files in UTF-8 containing any Unicode characters you want into and out of 
> Xindice. This means Greek, Hebrew, Korean, Chinese, Arabic, Russian, etc.
> .. To allow for this, I have had to modify the internal data format of 
> Xindice files, meaning that existing Xindice databases will appear 
> corrupt to Xindice patched with this new code...
>
> It is however necessary in my opinion, as discussed in earlier posts, to 
> migrate toward this.
>
> In reality, this will affect ONLY databases that contain XML documents 
> with NON-ASCII characters. ASCII characters are: English letters, Digits,
>  punctuation marks,  Whitespaces, as well as some control characters like 
> delete and backspace. There are 128 ASCII characters in all. So as long 
> as you have databases using documents with only these characters, the 
> patch won't affect your datafiles.
>
> Typical non-ASCII characters, which will cause incompatibilities between 
> old and new database files include: french, spanish etc... accented 
> characters, such as à, ò, ç, ü; currencies like £, EUR, ¥, non-breakable 
> spaces (&nbsp; in HTML), fancy quotes «, », copyright sign ©, etc...
>
> Because of these possible incompatiblities, I'd like to WARN people and 
> try and co-ordinate applying them so as to cause as little disruption as 
> possible. You can check them out already at
> http://lambiek.amplexor.be/downloads/xindice/new-utf8-patch.
>
> I don't believe you NEED to use the XML-RPC client for just 
> reading/writing documents, though I haven't really tested the CORBA 
> client anymore... Using XPaths and XUpdates with non-ASCII characers will 
> definately not work in CORBA, but should now work with XML-RPC interface.
>  (Need to test some more myself though).
>
> Anyway, let me know how and when I can commit this patch...
>
> James
>
>
Kimbro Staken
Java and XML Software, Consulting and Writing http://www.xmldatabases.org/
Apache Xindice native XML database http://xml.apache.org/xindice
XML:DB Initiative http://www.xmldb.org