You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xindice-users@xml.apache.org by Thomas Sempf <xi...@chaosmachine.de> on 2002/01/14 12:53:10 UTC
Speed issue
Hello,
i have a question concerning the speed of xindice. I have some xml-files
in the db, about 1200 files with 8.5 Meg. It´s only text, with some
attributes and tags. When i start a xpath query for a specific word, it
takes very long, about 1 Minute. Is it possible to set an index for a
specific word? Or is it only possible for tags and attributes?
I use the cvs version of xindice and during the build i saw some message
about debugging code and no optimization. Is there something to optimize?
With regards
Thomas Sempf
Re: Speed issue
Posted by Murray Altheim <mu...@sun.com>.
Tom Bradford wrote:
[...]
> > I use the cvs version of xindice and during the build i saw some
> > message about debugging code and no optimization. Is there something to
> > optimize?
>
> We've approached this first version of Xindice abiding by the basic
> rules of software optimization, so there has been absolutely no
> optimization done to any of the code. Our goal with this first release
> is to nail down functionality, we'll worry about performance later on.
> Unfortunately, there's only so much you can do with Java, so we'll be
> somewhat limited. I do have some thoughts on this, and will get around
> to jotting them down some day soon.
Speaking as a developer, not as a Sun employee (people who know me know
I'm not exactly "corporate" despite liking where I work), I've not found
Java's performance to be lagging much at all when certain principles are
followed. There's a number of good Java performance books on the market
(eg. "Java 2 Performance and Idiom Guide" by Larman and Guthrie) which
not only help you improve your application's speed but your coding
practices as well.
I've noted significant differences by just minimizing String creation (!),
fixing my misunderstanding of how arrays and StringBuffers work in Java,
and using various other tricks (such as JIT compilers and playing with
the VM settings). With intelligent use of indices in Xindice, you should
also be able to improve the performance. In something as String intensive
as Xindice I expect there's lots of room for optimization.
Murray
...........................................................................
Murray Altheim <mailto:murray.altheim@sun.com>
XML Technology Center, Java and XML Software
Sun Microsystems, Inc., MS MPK17-102, 1601 Willow Rd., Menlo Park, CA 94025
Corporations do not have human rights, despite the
altogether too-human opinions of the US Supreme Court.
Re: Speed issue
Posted by David BERNARD <dw...@java-fan.com>.
Hi,
> If an index is available for lname, it will be used, but *not* for a
> contains query because a contains query is essentially a substring
> search which can't benefit from index retrieval, so you're essentially
> doing a collection scan. Also, because you're using //*, you're telling
> the XPath processor to look at every single element of every document to
> determine if it has a lname attribute (for the first one) or an lname
> element child (for the second one). This is incredibly slow, even for a
> single document because the XPath processor is forced to visit every
> node instead of searching from the root.
I've recently work with search engine, and for your search I think
use a specific one (like lucene http://jakarta.apache.org/lucene/) should be better.
By exemple with Lucene you code your specific Document attribut and indeces definition.
In my current project, will use (integre) a search engine (for similary search) as service
for XML:DB (Xindice plan to use to store data).
--
--------------------------------------------------------------
David "Dwayne" Bernard Freelance Developer (Java)
mailto:dwayne@java-fan.com
\|/ http://dwayne.java-fan.com
--o0O @.@ O0o-------------------------------------------------
Re: Speed issue
Posted by Tom Bradford <br...@dbxmlgroup.com>.
On Wednesday, January 16, 2002, at 05:40 PM, Mark J. Stang wrote:
> Then using "contains" will be "incredibly slow", what do you advise for
> quicker searches? Could you elaborate a bit on how slow/fast
> Xindice XPath searches are? My favorites are Contains, Starts with and
> equals.
starts-with and boolean expressions are supported without any problems,
and are generally resolved rather quickly using B+Tree indexing.
Obviously, the overall performs depends on how large your collections
are an how much data you're indexing. The reason that contains() will
never be quick is for this reason:
F J P V
| | | | |
C H L T Y
| | | | | |
And End Jet May Why Zap
If you performed a contains('nd'), you're really saying that you want
any value that contains the character sequence 'nd' anywhere in the
value. In this case, And and End match the query, but And and End are
in two different pages. The indexing system collates starting at the
beginning of the value, so there's no way to quickly find a value based
on a substring, even if you were using a full-text index this wouldn't
be a quick operation unless you performed the query:
contains(' nd ')
In which case, the query processor might use logic that identifies 'nd'
as being a whole word, in which case, some optimizations might be
performed.
This is also the reason SQL queries that use LIKE "%value%" are so slow.
--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database) - http://xml.apache.org
Creator - Project Labrador (Web Services Framework) -
http://notdotnet.org
Re: Speed issue
Posted by Steven Noels <st...@outerthought.org>.
On Wed, 16 Jan 2002, Mark J. Stang wrote:
> Tom,
> Then using "contains" will be "incredibly slow", what do you advise for
> quicker searches? Could you elaborate a bit on how slow/fast
> Xindice XPath searches are? My favorites are Contains, Starts with and
> equals.
>
If these contructs make the majority of your intended searches, I'm not so
sure whether you should use XIndice...
Lucene (http://jakarta.apache.org/lucene/) is a Java full-text indexing &
retrieval engine that is known to work with XML (packaged as a
search-engine with Cocoon).
Regards,
</Steven>
Re: Speed issue
Posted by "Mark J. Stang" <ma...@earthlink.net>.
Tom,
Then using "contains" will be "incredibly slow", what do you advise for
quicker searches? Could you elaborate a bit on how slow/fast
Xindice XPath searches are? My favorites are Contains, Starts with and
equals.
thanks,
Mark
Tom Bradford wrote:
> On Tuesday, January 15, 2002, at 08:28 AM, Mark J. Stang wrote:
> > What Steven wrote helps and it creates a new question.
> > Examples:
> > //*[contains(@lname,'Bradford')]
> > //*/lname[contains(self::*,'Staken')]
>
> If an index is available for lname, it will be used, but *not* for a
> contains query because a contains query is essentially a substring
> search which can't benefit from index retrieval, so you're essentially
> doing a collection scan. Also, because you're using //*, you're telling
> the XPath processor to look at every single element of every document to
> determine if it has a lname attribute (for the first one) or an lname
> element child (for the second one). This is incredibly slow, even for a
> single document because the XPath processor is forced to visit every
> node instead of searching from the root.
>
> --
> Tom Bradford - http://www.tbradford.org
> Developer - Apache Xindice (Native XML Database) - http://xml.apache.org
> Creator - Project Labrador (Web Services) - http://xml-labrador.sf.net
Re: Speed issue
Posted by Tom Bradford <br...@dbxmlgroup.com>.
On Tuesday, January 15, 2002, at 08:28 AM, Mark J. Stang wrote:
> What Steven wrote helps and it creates a new question.
> Examples:
> //*[contains(@lname,'Bradford')]
> //*/lname[contains(self::*,'Staken')]
If an index is available for lname, it will be used, but *not* for a
contains query because a contains query is essentially a substring
search which can't benefit from index retrieval, so you're essentially
doing a collection scan. Also, because you're using //*, you're telling
the XPath processor to look at every single element of every document to
determine if it has a lname attribute (for the first one) or an lname
element child (for the second one). This is incredibly slow, even for a
single document because the XPath processor is forced to visit every
node instead of searching from the root.
--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database) - http://xml.apache.org
Creator - Project Labrador (Web Services) - http://xml-labrador.sf.net
Re: Speed issue
Posted by "Mark J. Stang" <ma...@earthlink.net>.
What Steven wrote helps and it creates a new question.
If I create indexs on my tags and attributes, when I use an
xpath query that specifies one of those tags, will the index
be used.
Examples:
//*[contains(@lname,'Bradford')]
//*/lname[contains(self::*,'Staken')]
In these examples, "lname" is in one case a tag and in another case an
attribute. If I index on the tag and the attribute, then XIndice
shouldn't
have to search every piece of text in the document only those tags and
attributes for each document in the collection.
Tom/Kimbro/et. al. is is this correct, does XIndice work this way?
thanks,
Mark
Steven Noels wrote:
> > -----Original Message-----
> > From: Mark J. Stang [mailto:markstang@earthlink.net]
> > Sent: dinsdag 15 januari 2002 7:46
> > To: xindice-users@xml.apache.org
> > Subject: Re: Speed issue
> >
> >
> > Tom,
> > Two questions:
> > 1) What is an "assistant predicate"?
> >
> > 2)I have been doing searches using the following format:
> >
> > "//*[contains(@"+getFieldName()+",'"+getText()+"')]";
> > "//*/"+getFieldName()+"[contains(self::*,'"+getText()+"')]";
> >
>
> Seems like your XPath expressions are the culprit: what you are
> basically asking is to check each & every element in the collection
> (//*) for a certain substring.
>
> Indices will not help much, I'm afraid.
>
> Is it not possible to indicate the element (or path to the element) in
> which the substring should occur?
>
> /root/some_path_towards/element[contains()]
>
> Using // basically shotcuts any optimization possible.
>
> Hope this helps,
>
> Steven Noels
> http://outerthought.org/
> (+32)478 292900
RE: Speed issue
Posted by Steven Noels <st...@outerthought.org>.
> -----Original Message-----
> From: Mark J. Stang [mailto:markstang@earthlink.net]
> Sent: dinsdag 15 januari 2002 7:46
> To: xindice-users@xml.apache.org
> Subject: Re: Speed issue
>
>
> Tom,
> Two questions:
> 1) What is an "assistant predicate"?
>
> 2)I have been doing searches using the following format:
>
> "//*[contains(@"+getFieldName()+",'"+getText()+"')]";
> "//*/"+getFieldName()+"[contains(self::*,'"+getText()+"')]";
>
Seems like your XPath expressions are the culprit: what you are
basically asking is to check each & every element in the collection
(//*) for a certain substring.
Indices will not help much, I'm afraid.
Is it not possible to indicate the element (or path to the element) in
which the substring should occur?
/root/some_path_towards/element[contains()]
Using // basically shotcuts any optimization possible.
Hope this helps,
Steven Noels
http://outerthought.org/
(+32)478 292900
Re: Speed issue
Posted by "Mark J. Stang" <ma...@earthlink.net>.
Tom,
Two questions:
1) What is an "assistant predicate"?
2)I have been doing searches using the following format:
"//*[contains(@"+getFieldName()+",'"+getText()+"')]";
"//*/"+getFieldName()+"[contains(self::*,'"+getText()+"')]";
Are these done via a collection scan also? And if so, will indexing
attributes and
tags help? What do you suggest to speed up searches?
Sorry, turns out to be more than two questions.
thanks,
Mark
Tom Bradford wrote:
> On Monday, January 14, 2002, at 04:53 AM, Thomas Sempf wrote:
> > i have a question concerning the speed of xindice. I have some
> > xml-files in the db, about 1200 files with 8.5 Meg. It´s only text,
> > with some attributes and tags. When i start a xpath query for a
> > specific word, it takes very long, about 1 Minute. Is it possible to
> > set an index for a specific word? Or is it only possible for tags and
> > attributes?
>
> When you do a contains() query, the resolver has no other choice but to
> do a collection scan. This means inspecting every document in the
> collection in sequence.
>
> I was working on a full text indexer in the past, and may commit it, but
> it wouldn't be very useful right away. The problem is that XPath
> doesn't support the concept of issuing full text queries unless we were
> to develop our own extension function. Using the contains function is a
> substring search and wouldn't be able to operate against a full text
> index is most cases.
>
> For now, you'll either have to index on entire values, or you can add an
> assistant predicate to your queries to narrow the set that the
> contains() query will be performed against.
>
> > I use the cvs version of xindice and during the build i saw some
> > message about debugging code and no optimization. Is there something to
> > optimize?
>
> We've approached this first version of Xindice abiding by the basic
> rules of software optimization, so there has been absolutely no
> optimization done to any of the code. Our goal with this first release
> is to nail down functionality, we'll worry about performance later on.
> Unfortunately, there's only so much you can do with Java, so we'll be
> somewhat limited. I do have some thoughts on this, and will get around
> to jotting them down some day soon.
>
> --
> Tom Bradford - http://www.tbradford.org
> Developer - Apache Xindice (Native XML Database) - http://xml.apache.org
> Creator - Project Labrador (Web Services) - http://xml-labrador.sf.net
Re: Speed issue
Posted by Tom Bradford <br...@dbxmlgroup.com>.
On Monday, January 14, 2002, at 04:53 AM, Thomas Sempf wrote:
> i have a question concerning the speed of xindice. I have some
> xml-files in the db, about 1200 files with 8.5 Meg. It´s only text,
> with some attributes and tags. When i start a xpath query for a
> specific word, it takes very long, about 1 Minute. Is it possible to
> set an index for a specific word? Or is it only possible for tags and
> attributes?
When you do a contains() query, the resolver has no other choice but to
do a collection scan. This means inspecting every document in the
collection in sequence.
I was working on a full text indexer in the past, and may commit it, but
it wouldn't be very useful right away. The problem is that XPath
doesn't support the concept of issuing full text queries unless we were
to develop our own extension function. Using the contains function is a
substring search and wouldn't be able to operate against a full text
index is most cases.
For now, you'll either have to index on entire values, or you can add an
assistant predicate to your queries to narrow the set that the
contains() query will be performed against.
> I use the cvs version of xindice and during the build i saw some
> message about debugging code and no optimization. Is there something to
> optimize?
We've approached this first version of Xindice abiding by the basic
rules of software optimization, so there has been absolutely no
optimization done to any of the code. Our goal with this first release
is to nail down functionality, we'll worry about performance later on.
Unfortunately, there's only so much you can do with Java, so we'll be
somewhat limited. I do have some thoughts on this, and will get around
to jotting them down some day soon.
--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database) - http://xml.apache.org
Creator - Project Labrador (Web Services) - http://xml-labrador.sf.net