You are viewing a plain text version of this content. The canonical link for it is here.

Posted to xindice-users@xml.apache.org by Thomas Sempf <xi...@chaosmachine.de> on 2002/01/14 12:53:10 UTC

Speed issue

Hello,

i have a question concerning the speed of xindice. I have some xml-files 
in the db, about 1200 files with 8.5 Meg. It´s only text, with some 
attributes and tags. When i start a xpath query for a specific word, it 
takes very long, about 1 Minute. Is it possible to set an index for a 
specific word? Or is it only possible for tags and attributes?
I use the cvs version of xindice and during the build i saw some message 
about debugging code and no optimization. Is there something to optimize?

With regards
Thomas Sempf

Re: Speed issue

Posted by Murray Altheim <mu...@sun.com>.

Tom Bradford wrote:
[...]
> > I use the cvs version of xindice and during the build i saw some
> > message about debugging code and no optimization. Is there something to
> > optimize?
> 
> We've approached this first version of Xindice abiding by the basic
> rules of software optimization, so there has been absolutely no
> optimization done to any of the code.  Our goal with this first release
> is to nail down functionality, we'll worry about performance later on.
> Unfortunately, there's only so much you can do with Java, so we'll be
> somewhat limited.  I do have some thoughts on this, and will get around
> to jotting them down some day soon.

Speaking as a developer, not as a Sun employee (people who know me know
I'm not exactly "corporate" despite liking where I work), I've not found 
Java's performance to be lagging much at all when certain principles are 
followed. There's a number of good Java performance books on the market 
(eg. "Java 2 Performance and Idiom Guide" by Larman and Guthrie) which 
not only help you improve your application's speed but your coding
practices as well.

I've noted significant differences by just minimizing String creation (!),
fixing my misunderstanding of how arrays and StringBuffers work in Java, 
and using various other tricks (such as JIT compilers and playing with 
the VM settings). With intelligent use of indices in Xindice, you should
also be able to improve the performance. In something as String intensive
as Xindice I expect there's lots of room for optimization.

Murray

...........................................................................
Murray Altheim                         <mailto:murray.altheim&#x40;sun.com>
XML Technology Center, Java and XML Software
Sun Microsystems, Inc., MS MPK17-102, 1601 Willow Rd., Menlo Park, CA 94025

            Corporations do not have human rights, despite the 
          altogether too-human opinions of the US Supreme Court.

Re: Speed issue

Posted by David BERNARD <dw...@java-fan.com>.

Hi,

> If an index is available for lname, it will be used, but *not* for a
> contains query because a contains query is essentially a substring
> search which can't benefit from index retrieval, so you're essentially
> doing a collection scan.  Also, because you're using //*, you're telling
> the XPath processor to look at every single element of every document to
> determine if it has a lname attribute (for the first one) or an lname
> element child (for the second one).  This is incredibly slow, even for a
> single document because the XPath processor is forced to visit every
> node instead of searching from the root.

I've recently work with search engine, and for your search I think
use a specific one (like lucene http://jakarta.apache.org/lucene/) should be better.
By exemple with Lucene you code your specific Document attribut and indeces definition.

In my current project, will use (integre) a search engine (for similary search) as service
for XML:DB (Xindice plan to use to store data).

-- 
--------------------------------------------------------------
David "Dwayne" Bernard             Freelance Developer (Java)
                                   mailto:dwayne@java-fan.com
      \|/                          http://dwayne.java-fan.com
--o0O @.@ O0o-------------------------------------------------

Re: Speed issue

Posted by Tom Bradford <br...@dbxmlgroup.com>.

On Wednesday, January 16, 2002, at 05:40 PM, Mark J. Stang wrote:
> Then using "contains" will be "incredibly slow", what do you advise for
> quicker searches?   Could you elaborate a bit on how slow/fast
> Xindice XPath searches are?   My favorites are Contains, Starts with and
> equals.

starts-with and boolean expressions are supported without any problems, 
and are generally resolved rather quickly using B+Tree indexing.  
Obviously, the overall performs depends on how large your collections 
are an how much data you're indexing.  The reason that contains() will 
never be quick is for this reason:

       F	  J	  P	  V
     |   |   |   |   |
     C   H   L   T   Y
   |   |   |   |   |   |
  And End Jet May Why Zap

If you performed a contains('nd'), you're really saying that you want 
any value that contains the character sequence 'nd' anywhere in the 
value.  In this case, And and End match the query, but And and End are 
in two different pages.  The indexing system collates starting at the 
beginning of the value, so there's no way to quickly find a value based 
on a substring, even if you were using a full-text index this wouldn't 
be a quick operation unless you performed the query:

    contains(' nd ')

In which case, the query processor might use logic that identifies 'nd' 
as being a whole word, in which case, some optimizations might be 
performed.

This is also the reason SQL queries that use LIKE "%value%" are so slow.

--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database) - http://xml.apache.org
Creator - Project Labrador (Web Services Framework) - 
http://notdotnet.org

Re: Speed issue

Posted by Steven Noels <st...@outerthought.org>.

On Wed, 16 Jan 2002, Mark J. Stang wrote:

> Tom,
> Then using "contains" will be "incredibly slow", what do you advise for
> quicker searches?   Could you elaborate a bit on how slow/fast
> Xindice XPath searches are?   My favorites are Contains, Starts with and
> equals.
> 

If these contructs make the majority of your intended searches, I'm not so
sure whether you should use XIndice...

Lucene (http://jakarta.apache.org/lucene/) is a Java full-text indexing &
retrieval engine that is known to work with XML (packaged as a
search-engine with Cocoon).

Regards,

</Steven>

Re: Speed issue

Posted by "Mark J. Stang" <ma...@earthlink.net>.

Tom,
Then using "contains" will be "incredibly slow", what do you advise for
quicker searches?   Could you elaborate a bit on how slow/fast
Xindice XPath searches are?   My favorites are Contains, Starts with and
equals.

thanks,

Mark

Tom Bradford wrote:

> On Tuesday, January 15, 2002, at 08:28 AM, Mark J. Stang wrote:
> > What Steven wrote helps and it creates a new question.
> > Examples:
> > //*[contains(@lname,'Bradford')]
> > //*/lname[contains(self::*,'Staken')]
>
> If an index is available for lname, it will be used, but *not* for a
> contains query because a contains query is essentially a substring
> search which can't benefit from index retrieval, so you're essentially
> doing a collection scan.  Also, because you're using //*, you're telling
> the XPath processor to look at every single element of every document to
> determine if it has a lname attribute (for the first one) or an lname
> element child (for the second one).  This is incredibly slow, even for a
> single document because the XPath processor is forced to visit every
> node instead of searching from the root.
>
> --
> Tom Bradford - http://www.tbradford.org
> Developer - Apache Xindice (Native XML Database) - http://xml.apache.org
> Creator - Project Labrador (Web Services) - http://xml-labrador.sf.net

Re: Speed issue

Posted by Tom Bradford <br...@dbxmlgroup.com>.

On Tuesday, January 15, 2002, at 08:28 AM, Mark J. Stang wrote:
> What Steven wrote helps and it creates a new question.
> Examples:
> //*[contains(@lname,'Bradford')]
> //*/lname[contains(self::*,'Staken')]

If an index is available for lname, it will be used, but *not* for a 
contains query because a contains query is essentially a substring 
search which can't benefit from index retrieval, so you're essentially 
doing a collection scan.  Also, because you're using //*, you're telling 
the XPath processor to look at every single element of every document to 
determine if it has a lname attribute (for the first one) or an lname 
element child (for the second one).  This is incredibly slow, even for a 
single document because the XPath processor is forced to visit every 
node instead of searching from the root.

--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database) - http://xml.apache.org
Creator - Project Labrador (Web Services) - http://xml-labrador.sf.net

Re: Speed issue

Posted by "Mark J. Stang" <ma...@earthlink.net>.

What Steven wrote helps and it creates a new question.

If I create indexs on my tags and attributes, when I use an
xpath query that specifies one of those tags, will the index
be used.

Examples:
//*[contains(@lname,'Bradford')]
//*/lname[contains(self::*,'Staken')]

In these examples, "lname" is in one case a tag and in another case an
attribute.   If I index on the tag and the attribute, then XIndice
shouldn't
have to search every piece of text in the document only those tags and
attributes for each document in the collection.

Tom/Kimbro/et. al. is is this correct, does XIndice work this way?

thanks,

Mark

Steven Noels wrote:

> > -----Original Message-----
> > From: Mark J. Stang [mailto:markstang@earthlink.net]
> > Sent: dinsdag 15 januari 2002 7:46
> > To: xindice-users@xml.apache.org
> > Subject: Re: Speed issue
> >
> >
> > Tom,
> > Two questions:
> > 1) What is an "assistant predicate"?
> >
> > 2)I have been doing searches using the following format:
> >
> > "//*[contains(@"+getFieldName()+",'"+getText()+"')]";
> > "//*/"+getFieldName()+"[contains(self::*,'"+getText()+"')]";
> >
>
> Seems like your XPath expressions are the culprit: what you are
> basically asking is to check each & every element in the collection
> (//*) for a certain substring.
>
> Indices will not help much, I'm afraid.
>
> Is it not possible to indicate the element (or path to the element) in
> which the substring should occur?
>
> /root/some_path_towards/element[contains()]
>
> Using // basically shotcuts any optimization possible.
>
> Hope this helps,
>
> Steven Noels
> http://outerthought.org/
> (+32)478 292900

RE: Speed issue

Posted by Steven Noels <st...@outerthought.org>.

> -----Original Message-----
> From: Mark J. Stang [mailto:markstang@earthlink.net]
> Sent: dinsdag 15 januari 2002 7:46
> To: xindice-users@xml.apache.org
> Subject: Re: Speed issue
>
>
> Tom,
> Two questions:
> 1) What is an "assistant predicate"?
>
> 2)I have been doing searches using the following format:
>
> "//*[contains(@"+getFieldName()+",'"+getText()+"')]";
> "//*/"+getFieldName()+"[contains(self::*,'"+getText()+"')]";
>

Seems like your XPath expressions are the culprit: what you are
basically asking is to check each & every element in the collection
(//*) for a certain substring.

Indices will not help much, I'm afraid.

Is it not possible to indicate the element (or path to the element) in
which the substring should occur?

/root/some_path_towards/element[contains()]

Using // basically shotcuts any optimization possible.

Hope this helps,

Steven Noels
http://outerthought.org/
(+32)478 292900

Re: Speed issue

Posted by "Mark J. Stang" <ma...@earthlink.net>.

Tom,
Two questions:
1) What is an "assistant predicate"?

2)I have been doing searches using the following format:

"//*[contains(@"+getFieldName()+",'"+getText()+"')]";
"//*/"+getFieldName()+"[contains(self::*,'"+getText()+"')]";

Are these done via a collection scan also?  And if so, will indexing
attributes and
tags help?   What do you suggest to speed up searches?

Sorry, turns out to be more than two questions.

thanks,

Mark


Tom Bradford wrote:

> On Monday, January 14, 2002, at 04:53 AM, Thomas Sempf wrote:
> > i have a question concerning the speed of xindice. I have some
> > xml-files in the db, about 1200 files with 8.5 Meg. It´s only text,
> > with some attributes and tags. When i start a xpath query for a
> > specific word, it takes very long, about 1 Minute. Is it possible to
> > set an index for a specific word? Or is it only possible for tags and
> > attributes?
>
> When you do a contains() query, the resolver has no other choice but to
> do a collection scan.  This means inspecting every document in the
> collection in sequence.
>
> I was working on a full text indexer in the past, and may commit it, but
> it wouldn't be very useful right away.  The problem is that XPath
> doesn't support the concept of issuing full text queries unless we were
> to develop our own extension function.  Using the contains function is a
> substring search and wouldn't be able to operate against a full text
> index is most cases.
>
> For now, you'll either have to index on entire values, or you can add an
> assistant predicate to your queries to narrow the set that the
> contains() query will be performed against.
>
> > I use the cvs version of xindice and during the build i saw some
> > message about debugging code and no optimization. Is there something to
> > optimize?
>
> We've approached this first version of Xindice abiding by the basic
> rules of software optimization, so there has been absolutely no
> optimization done to any of the code.  Our goal with this first release
> is to nail down functionality, we'll worry about performance later on.
> Unfortunately, there's only so much you can do with Java, so we'll be
> somewhat limited.  I do have some thoughts on this, and will get around
> to jotting them down some day soon.
>
> --
> Tom Bradford - http://www.tbradford.org
> Developer - Apache Xindice (Native XML Database) - http://xml.apache.org
> Creator - Project Labrador (Web Services) - http://xml-labrador.sf.net

Re: Speed issue

Posted by Tom Bradford <br...@dbxmlgroup.com>.

On Monday, January 14, 2002, at 04:53 AM, Thomas Sempf wrote:
> i have a question concerning the speed of xindice. I have some 
> xml-files in the db, about 1200 files with 8.5 Meg. It´s only text, 
> with some attributes and tags. When i start a xpath query for a 
> specific word, it takes very long, about 1 Minute. Is it possible to 
> set an index for a specific word? Or is it only possible for tags and 
> attributes?

When you do a contains() query, the resolver has no other choice but to 
do a collection scan.  This means inspecting every document in the 
collection in sequence.

I was working on a full text indexer in the past, and may commit it, but 
it wouldn't be very useful right away.  The problem is that XPath 
doesn't support the concept of issuing full text queries unless we were 
to develop our own extension function.  Using the contains function is a 
substring search and wouldn't be able to operate against a full text 
index is most cases.

For now, you'll either have to index on entire values, or you can add an 
assistant predicate to your queries to narrow the set that the 
contains() query will be performed against.

> I use the cvs version of xindice and during the build i saw some 
> message about debugging code and no optimization. Is there something to 
> optimize?

We've approached this first version of Xindice abiding by the basic 
rules of software optimization, so there has been absolutely no 
optimization done to any of the code.  Our goal with this first release 
is to nail down functionality, we'll worry about performance later on.  
Unfortunately, there's only so much you can do with Java, so we'll be 
somewhat limited.  I do have some thoughts on this, and will get around 
to jotting them down some day soon.

--
Tom Bradford - http://www.tbradford.org
Developer - Apache Xindice (Native XML Database) - http://xml.apache.org
Creator - Project Labrador (Web Services) - http://xml-labrador.sf.net