You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/12/02 12:55:49 UTC

Lucene performance bottlenecks

Hi,

I'm doing some performance profiling of a Nutch installation, working 
with relatively large individual indexes (10 mln docs), and I'm puzzled 
with the results.

Here's the listing of the index:
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f0
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f1
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f2
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f3
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f4
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f5
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f6
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f7
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f8
-rw-r--r--  1 andrzej andrzej  2494445020 Dec  2 04:58 _0.fdt
-rw-r--r--  1 andrzej andrzej    78424800 Dec  2 04:58 _0.fdx
-rw-r--r--  1 andrzej andrzej          92 Dec  2 04:55 _0.fnm
-rw-r--r--  1 andrzej andrzej  7436259508 Dec  2 05:24 _0.frq
-rw-r--r--  1 andrzej andrzej 12885589796 Dec  2 05:24 _0.prx
-rw-r--r--  1 andrzej andrzej     3483642 Dec  2 05:24 _0.tii
-rw-r--r--  1 andrzej andrzej   280376933 Dec  2 05:24 _0.tis
-rw-r--r--  1 andrzej andrzej           4 Dec  2 05:25 deletable
-rw-r--r--  1 andrzej andrzej          27 Dec  2 05:25 segments


I run it on an AMD Opteron 246, 2Ghz, 4GB RAM, java -version says:

Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_05-b05, mixed mode)

I run it with a heap of 1.5-2.5 GB, which doesn't make any difference 
(see below). I'm using the latest SVN code (from yesterday) + 
performance enhancements to ConjunctionScorer and BooleanScorer2 from JIRA.

The performance is less than impressive, response times being more than 
1 sec. Nutch produces complex queries for phrases, so the user query 
"term1 term2" gets rewritten like this:

+(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 
host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 content:term2 
title:term2^1.5 host:term2^2.0) url:"term1 term2"~2147483647^4.0 
anchor:"term1 term2"~4^2.0 content:"term1 term2"~2147483647 title:"term1 
term2"~2147483647^1.5 host:"term1 term2"~2147483647^2.0

For a simple TermQuery, if the DF(term) is above 10%, the response time 
from IndexSearcher.search() is around 400ms (repeatable, after warm-up). 
For such complex phrase queries the response time is around 1 sec or 
more (again, after warm-up).

Initially I thought the process is I/O or heap/GC bound, this is a large 
index after all, but the profiler shows it's purely CPU bound. I tracked 
the bottleneck to the scorers (see my previous email on this), but also 
to IndexInput.readVInt.. What's even more curious, most of the heap is 
unused - I had the impression that Lucene tries to read as much of the 
index as it can to memory in order to speed up the access, but 
apparently that's not the case. The heap consumption was always in the 
order of 100-200MB, no matter how large heap I set (and I tried values 
between 1-4GB).

For those interested in profiler info, look here:

http://www.getopt.org/lucene/20051202/

Here's an example of elapsed times [ms] for IndexSearcher.search, and 
for getting the first 100 docs using Hits.doc(i):

19. Complex search1:
 search: 1309
 hits.doc: 4
19. Complex search2:
 search: 2492
 hits.doc: 5
19. Simple search:
 search: 392
 hits.doc: 5
20. Complex search1:
 search: 1307
 hits.doc: 5
20. Complex search2:
 search: 2499
 hits.doc: 5
20. Simple search:
 search: 391
 hits.doc: 5


I would appreciate any suggestions how to proceed with this...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: DistributingMultiFieldQueryParser and DisjunctionMaxQuery

Posted by Yonik Seeley <ys...@gmail.com>.

On 12/14/05, Chuck Williams <ch...@manawiz.com> wrote:
> If there is some specific reason it is not deemed suitable
> to commit, please let me know.  It is much harder to use
> DisjunctionMaxQuery without this parser.

Hey Chuck,
  I committed DisjunctionMaxQuery after I took the time to understand
it, and realized how much it was needed in certain scenarios.  I
haven't committed DistributingMultiFieldQueryParser because I honestly
haven't had a chance to look at it yet.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: DistributingMultiFieldQueryParser and DisjunctionMaxQuery

Posted by Chuck Williams <ch...@manawiz.com>.

----- Original Message -----
*From:* Miles Barr <mi...@runtime-collective.com>
*To:* java-user@lucene.apache.org
*Sent:* 12/14/2005 12:43:04 AM
*Subject:* DistributingMultiFieldQueryParser and DisjunctionMaxQuery


>On Tue, 2005-12-13 at 11:51 -0800, Chris Hostetter wrote:
>  
>
>>As i mentioned in the comments for LUCENE-323,
>>DistributingMultiFieldQueryParser seems to be more of a demo of what's
>>possible with DisjunctionMaxQuery -- not neccessarily a full fledged
>>QueryParser.  I think that's why it wasn't commited (even though
>>DisjunctionMaxQuery was), and the issue was left open.
>>    
>>
It is not intended to be a demo.  I use it for real and believe it is
complete and correct.  At the moment, it uses a QueryParser api that is
deprecated in the latest Lucene source, but it still works.  There are a
couple To Do's marked where it could be improved, but its current
behavior is acceptable.

>I've only had a quick play with it so this problem is probably down to
>my misuse of the class but I found that negations weren't handled
>properly. e.g.
>
>fruit AND -apples
>
>The DistributingMultiFieldQueryParser would correctly generate a query
>that would find fruit in one of the fields, but would only ensure that
>apples did not appear in one field, not not appear in all the fields,
>which was the behaviour I wanted. Hence negations didn't really work if
>the term appeared in more than one field.
>  
>
>
>I just tested putting together a prohibited boolean query with a
>DisjunctionMaxQuery programmatically rather than via the
>DistributingMultiFieldQueryParser and it works fine. 
>  
>
Would you mind submitting a test case that shows the problem as I cannot
replicate this?  E.g the attached test cases runs an equivalent query,
"fruit AND -plum" and works properly.  Negation should work fine in
general.  The transformation performed on BooleanQuery's is this:
  BooleanQuery (q1.occur1 ... qn.occurn) applied to fields (f1 ... fm)
==> ((q1 applied to f1...fm).occur1 ... ((qn applied to f1...fm).occurn))
So for MUST_NOT clauses, the NOT is scoped around the OR over the fields
and so the value can be found in no fields.

If there are bugs in DistributingMultiFieldQueryParser, I will be happy
to fix them.  If there is some specific reason it is not deemed suitable
to commit, please let me know.  It is much harder to use
DisjunctionMaxQuery without this parser.

FYI, here is the output I get from the attached test case (running my
version of DistributingMultiFieldQueryParser, which is also attached in
case it is different than what you have):

------------- Standard Output ---------------
Collection:
  uid:{doc1}    title:{fruit}    body:{apple, pear, plum}
  uid:{doc2}    title:{plum fruit}    body:{delicious ripe plum}
  uid:{doc3}    title:{fruit medley}    body:{peach, banana, pear, cherry}

testParse
  Query:{fruit AND -plum} ==> +(title:fruit^5.0 | body:fruit)~0.1
-(title:plum^5.0 | body:plum)~0.1

  title:{fruit medley}    body:{peach, banana, pear, cherry}

------------- ---------------- ---------------

Chuck

Re: DistributingMultiFieldQueryParser and DisjunctionMaxQuery

Posted by Miles Barr <mi...@runtime-collective.com>.

On Tue, 2005-12-13 at 11:51 -0800, Chris Hostetter wrote:
> As i mentioned in the comments for LUCENE-323,
> DistributingMultiFieldQueryParser seems to be more of a demo of what's
> possible with DisjunctionMaxQuery -- not neccessarily a full fledged
> QueryParser.  I think that's why it wasn't commited (even though
> DisjunctionMaxQuery was), and the issue was left open.

I just tested putting together a prohibited boolean query with a
DisjunctionMaxQuery programmatically rather than via the
DistributingMultiFieldQueryParser and it works fine. 


Thanks for the heads up.


Miles


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: DistributingMultiFieldQueryParser and DisjunctionMaxQuery

Posted by Chris Hostetter <ho...@fucit.org>.

: The DistributingMultiFieldQueryParser would correctly generate a query
: that would find fruit in one of the fields, but would only ensure that
: apples did not appear in one field, not not appear in all the fields,
: which was the behaviour I wanted. Hence negations didn't really work if
: the term appeared in more than one field.
:
: Has anyone else experienced this problem?

As i mentioned in the comments for LUCENE-323,
DistributingMultiFieldQueryParser seems to be more of a demo of what's
possible with DisjunctionMaxQuery -- not neccessarily a full fledged
QueryParser.  I think that's why it wasn't commited (even though
DisjunctionMaxQuery was), and the issue was left open.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

DistributingMultiFieldQueryParser and DisjunctionMaxQuery

Posted by Miles Barr <mi...@runtime-collective.com>.

On Mon, 2005-12-12 at 15:35 -0800, Chris Hostetter wrote:
> : Oh, BTW:  I just found the DisjunctionMaxQuery class, recently added it
> : seems. Do you think this query structure could benefit from using it
> : instead of the BooleanQuery?
> 
> DisjunctionMaxQuery kicks ass (in my opinion), and It certainly seems like
> (from your query structure) it's something you might want to consider
> using, but I don't know thta it will sove the performance problems you're
> having -- I can't think of any situations in which DisjunctionMaxScorer
> could skip more docs/terms then DisjuntionSumScorer.

I've only had a quick play with it so this problem is probably down to
my misuse of the class but I found that negations weren't handled
properly. e.g.

fruit AND -apples

The DistributingMultiFieldQueryParser would correctly generate a query
that would find fruit in one of the fields, but would only ensure that
apples did not appear in one field, not not appear in all the fields,
which was the behaviour I wanted. Hence negations didn't really work if
the term appeared in more than one field.

Has anyone else experienced this problem?

Cheers,
Miles

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene performance bottlenecks

Posted by Chris Hostetter <ho...@fucit.org>.

: Oh, BTW:  I just found the DisjunctionMaxQuery class, recently added it
: seems. Do you think this query structure could benefit from using it
: instead of the BooleanQuery?

DisjunctionMaxQuery kicks ass (in my opinion), and It certainly seems like
(from your query structure) it's something you might want to consider
using, but I don't know thta it will sove the performance problems you're
having -- I can't think of any situations in which DisjunctionMaxScorer
could skip more docs/terms then DisjuntionSumScorer.

I believe Paul already suggested writting a completely new type of
specialized Query/Scorer for your purposes -- one possibility I can
imaging is a specialized version of BooleanQuery that skips past documents
in the optional scorers once it's gotten a score from at least one other
optional scorer (or N other optional scorers).  Then as long as you're
carefull to list the really usefull sub-clauses (ie: the shorter fields,
with the higher boosts) first, you can save a lot of calculation with
(what i expect would be)  little loss in functionality.

for example, looking at your sample query...

+(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 host:term1^2.0)
+(url:term2^4.0 anchor:term2^2.0 content:term2 title:term2^1.5 host:term2^2.0)
url:"term1 term2"~2147483647^4.0
anchor:"term1 term2"~4^2.0
content:"term1 term2"~2147483647
title:"term1 term2"~2147483647^1.5
host:"term1 term2"~2147483647^2.0

...if you reordered those first two clauses like this...

+(url:term1^4.0 anchor:term1^2.0 host:term1^2.0 title:term1^1.5 content:term1)

...once the Scorers from the url term query and the achor term query
have chimed in with a non-zero score for a document, the scores from the
other sub-clauses maynot be very relevant in the final score calculation
... so why not skip them?





-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene performance bottlenecks

Posted by Andrzej Bialecki <ab...@getopt.org>.

Paul Elschot wrote:

>There is one indexing parameter that might help performance
>for BooleanScorer2, it is the skip interval in Lucene's TermInfosWriter.
>The current value is 16, and there was a question about it
>on 16 Oct 2005 on java-dev with title "skipInterval".
>I don't know how the value of skipInterval was initially determined.
>It's possible that a larger value gives somewhat better query
>performance in this case.
>Changing the skip interval might require reindexing, though.
>  
>

In Nutch the default is 128. And yes, changing this requires re-creating 
the index (actually, it's enough to optimize it, so that the .tii file 
is re-written).

>I considered a specialised scorer for the earlier query:
>
>+(url:term1^4.0 anchor:term1^2.0 content:term1
>   title:term1^1.5  host:term1^2.0)
>+(url:term2^4.0 anchor:term2^2.0 content:term2
>   title:term2^1.5 host:term2^2.0)
>url:"term1 term2"~2147483647^4.0 
>anchor:"term1 term2"~4^2.0
>content:"term1 term2"~2147483647
>title:"term1 term2"~2147483647^1.5
>host:"term1 term2"~2147483647^2.0
>  
>
[...]

Thank you for the detailed analysis. Currently we pursue a totally 
different approach: limiting the size of the index by clever selection 
of the most promising postings, and resorting the posting lists so that 
they are ordered according to a "pagerank"-like value, so that we could 
skip postings coming from less significant docs. Please see the 
nutch-dev discussion for more details.

Oh, BTW:  I just found the DisjunctionMaxQuery class, recently added it 
seems. Do you think this query structure could benefit from using it 
instead of the BooleanQuery?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene performance bottlenecks

Posted by Paul Elschot <pa...@xs4all.nl>.

On Wednesday 07 December 2005 10:51, Andrzej Bialecki wrote:
> Paul Elschot wrote:
> >On Saturday 03 December 2005 14:09, Andrzej Bialecki wrote:
> >>Paul Elschot wrote:
> >>
...
> >>>This is one of the cases in which BooleanScorer2 can be faster
> >>>than the 1.4 BooleanScorer because the 1.4 BooleanScorer does
> >>>not use skipTo() for the optional clauses.
> >>>Could you try this by calling the static method
> >>>BooleanQuery.setUseScorer14(true) and repeating the test?
> >>>      
> >>>
> 
> 
> As far as I can tell it doesn't make any statistically significant 
> difference - all search times remain nearly the same. If anything the 
> test runs with useScorer14 == true are fractionally faster.
> 

There is one indexing parameter that might help performance
for BooleanScorer2, it is the skip interval in Lucene's TermInfosWriter.
The current value is 16, and there was a question about it
on 16 Oct 2005 on java-dev with title "skipInterval".
I don't know how the value of skipInterval was initially determined.
It's possible that a larger value gives somewhat better query
performance in this case.
Changing the skip interval might require reindexing, though.

I considered a specialised scorer for the earlier query:

+(url:term1^4.0 anchor:term1^2.0 content:term1
   title:term1^1.5  host:term1^2.0)
+(url:term2^4.0 anchor:term2^2.0 content:term2
   title:term2^1.5 host:term2^2.0)
url:"term1 term2"~2147483647^4.0 
anchor:"term1 term2"~4^2.0
content:"term1 term2"~2147483647
title:"term1 term2"~2147483647^1.5
host:"term1 term2"~2147483647^2.0

In this query, term1 and term2 are each used in 5 fields,
and each such combination is used in 2 clauses,
one required and one optional.
In the required clause the term scorer uses a TermDocs,
in the optional clause the phrase scorer uses a TermPositions.

For each combination of a query term and a field, a TermDocs
and a TermPositions is currently used.
TermPositions inherits from TermDocs, so there is
redundancy here, and one could try and reduce this by using
only a TermPositions. The redundancy consists ao. of
some double readVInt's from the index files for the TermDocs.
I don't know how much double work is actually going on
behind the scenes in the IndexReader. This depends
among others on how effective skipTo() on a TermPositions is.

On the top level of the query, the optional phrases are
skipTo()'ed when the two required clauses match.
The required clauses only use the TermDocs info, so
it should be straightforward to implement the optional phrase.
There is a catch here in that the advanceAfterCurrent() method
for the disjunctions inside the required clauses does what it sais:
it advances the scorers to documents after the current match,
and this makes it impossible to skipTo() a possible matching
phrase in the document being scored.
That means that to use only a TermPositions (and not also a
TermDocs) it will be necessary to rewrite DisjunctionSumScorer
so that it never advances after the current doc when the doc
matches.

For the rest one could try and refactor the existing scorers for
terms and phrases to use given TermDocs/TermPositions instead of
private ones.
A constructor for a specialized scorer for this query could be passed
term1 and term2, an array of the 5 field names and some
more arrays for weights and slops.

But there still are some unknowns: the cost of the method calls in
the current trees of scorers, how effective skipTo() is on a
TermPositions, and how much influence the skip interval has.

Regards,
Paul Elschot

P.S. Please feel free to use this on the nutch list(s).

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene performance bottlenecks

Posted by Doug Cutting <cu...@apache.org>.

Paul Elschot wrote:
> Querying the host field like this in a web page index can be dangerous
> business. For example when term1 is "wikipedia" and term2 is "org",
> the query will match at least all pages from wikipedia.org.

Note that if you search for wikipedia.org in Nutch this is interpreted 
as an implicit phrase query and is expanded differently, as:

+(url:"wikipedia org"^4.0
   anchor:"wikipedia org"^2.0
   content:"wikipedia org"
   title:"wikipedia org"^1.5
   host:"wikipedia org"^2.0)

Note also that Nutch can use N-grams for common terms.  One can thus 
configure things so that this would be instead:

+(url:"wikipedia-org"^4.0
   anchor:"wikipedia org"^2.0
   content:"wikipedia org"
   title:"wikipedia org"^1.5
   host:"wikipedia-org"^2.0)

where wikipedia-org is a bigram term indexed when org occurs in the url 
or host field.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene performance bottlenecks

Posted by Andrzej Bialecki <ab...@getopt.org>.

Paul Elschot wrote:

>On Saturday 03 December 2005 14:09, Andrzej Bialecki wrote:
>  
>
>>Paul Elschot wrote:
>>
>>    
>>
>>>In somewhat more readable layout:
>>>
>>>+(url:term1^4.0 anchor:term1^2.0 content:term1
>>>  title:term1^1.5  host:term1^2.0)
>>>+(url:term2^4.0 anchor:term2^2.0 content:term2
>>>  title:term2^1.5 host:term2^2.0)
>>>url:"term1 term2"~2147483647^4.0 
>>>anchor:"term1 term2"~4^2.0
>>>content:"term1 term2"~2147483647
>>>title:"term1 term2"~2147483647^1.5
>>>host:"term1 term2"~2147483647^2.0
>>>
>>>The first two clauses with + prefixes are required, and I would guess
>>>that the 5 way disjunctions inside these clauses take most of the cpu time
>>>during search.
>>> 
>>>
>>>      
>>>
>>That's an interesting observation. This suggests that it could pay off 
>>to glue these fields together and change this to a query on a single 
>>combined field, right? I.e. to trade off space for speed.
>>    
>>
>
>Yes.
>  
>

Unfortunately, that's not the option... Nutch uses these clauses to 
affect the final score value, i.e. it has to use different fields in 
order to apply different boost values per field, both in the query and 
in the encoded fieldNorms.

>Querying the host field like this in a web page index can be dangerous
>business. For example when term1 is "wikipedia" and term2 is "org",
>the query will match at least all pages from wikipedia.org.
>
>  
>

Well, that's the idea - all pages in wikipedia.org are somehow relevant 
to a query "wikipedia org". How relevant depends on the weights of 
individual clauses (in addition to the usual tf / idf / fieldNorm)..

>>>The remaining clauses will be skipped to only when the two required
>>>clauses are both present, so these are probably not the problem.
>>>In the call tree for scoring these can be identified by the skipTo()
>>>being called inside the score() method at the top level.
>>>
>>>This is one of the cases in which BooleanScorer2 can be faster
>>>than the 1.4 BooleanScorer because the 1.4 BooleanScorer does
>>>not use skipTo() for the optional clauses.
>>>Could you try this by calling the static method
>>>BooleanQuery.setUseScorer14(true) and repeating the test?
>>>      
>>>


As far as I can tell it doesn't make any statistically significant 
difference - all search times remain nearly the same. If anything the 
test runs with useScorer14 == true are fractionally faster.

>>>>>From the hot spots output of the profiler info I see that the following 
>>>      
>>>
>methods:
>  
>
>>>- PhrasePositions.nextPosition
>>>- SegmentTermDocs.read
>>>have a larger portion of cpu time spent in the method.
>>>Looking at the expanded query, seeing PhrasePositions.nextPosition
>>>here is not a surprise.
>>>
>>>The fact that SegmentTermDocs.read shows up might explain
>>>the reason why only a little heap is used:  Lucene normally leaves
>>>the file buffering to the operating system, and
>>>when the file buffers are read the index data is decompressed
>>>mostly by the readVInt method.
>>> 
>>>
>>>      
>>>
>>Yes, I understand it now. But perhaps it's time to contest this 
>>approach: if there is so much heap available, does it still make sense 
>>to rely so much on OS caching, if we have the space to do the caching 
>>inside JVM (at least for a large portion of the index)?
>>    
>>
>
>You can try and use Lucene's RAMDirectory when you have enough RAM.
>However, for larger indexes, It is easier to fit the index files  in RAM
>(in OS buffers) than the decompressed index data (inside the JVM).
>
>Also, caching query results by query text is probably more effective
>than caching the JVM version of the searched index data.
>  
>

The problem here is not with the disk I/O, so I don't think RAMDirectory 
would help, even if I had the required 30GB of RAM - the problem is with 
the number of invocations of readVInt(), which for my test scenarios was 
called on average 2.5 times for a single doc in the index, per query (in 
this case, for a single query run against a 10 mln docs index it was 
invoked ~25 mln times). I tried changing the TermIndexInterval (I tested 
it with values ranging from 16 to 512), but there were no significant 
differences in speed, although of course the memory consumption was very 
different. I would happily trade a lot of heap space in order to 
increase performence of such complex queries. But at the moment there is 
no way to do it that I can see...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene performance bottlenecks

Posted by Yonik Seeley <ys...@gmail.com>.

I checked out readVInt() to see if I could optimize it any...
For a random distribution of integers <200 I was able to speed it up a
little bit, but nothing to write home about:

                     old       new    percent
Java14-client : 13547  12468   8%
Java14-server:  6047     5266  14%
Java15-client:  11688  11234    4%
Java15-server:   5813    4875  19%
Java16-client:   11125  10719   4%
Java16-server:  6031     4859   24%

Then I tested it with integers <128, and it was slower (up to 25%) for
Java15-server, Java16-server, Java16-client.  Since <128 could be an
important case, I stopped there.

On a P4 2.8GHz, I was getting around 180M readVInt() calls per second
for single byte VInts (including loop and method call overhead).

Here is the fastest version I could come up with on a P4.  It's faster
with variable length vInts, slower will single bytes.

  public int readVInt() throws IOException {
    byte b = readByte();
    if ((b&0x80)==0) return b;
    b &= 0x7f;
    byte b2 = readByte();
    if ((b&0x80)==0) return (b2<<7) | b;
    b2 &= 0x7f;
    byte b3 = readByte();
    if ((b&0x80)==0) return (b3<<14) | (b2<<7) | b;
    b3 &= 0x7f;
    byte b4 = readByte();
    if ((b&0x80)==0) return (b4<<21) | (b3<<14) | (b2<<7) | b;
    b4 &= 0x7f;
    byte b5 = readByte();
    return (b5<<28) | (b4<<21) | (b3<<14) | (b2<<7) | b;
  }


-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene performance bottlenecks

Posted by Paul Elschot <pa...@xs4all.nl>.

On Saturday 03 December 2005 14:09, Andrzej Bialecki wrote:
> Paul Elschot wrote:
> 
> >In somewhat more readable layout:
> > 
> >+(url:term1^4.0 anchor:term1^2.0 content:term1
> >   title:term1^1.5  host:term1^2.0)
> >+(url:term2^4.0 anchor:term2^2.0 content:term2
> >   title:term2^1.5 host:term2^2.0)
> >url:"term1 term2"~2147483647^4.0 
> >anchor:"term1 term2"~4^2.0
> >content:"term1 term2"~2147483647
> >title:"term1 term2"~2147483647^1.5
> >host:"term1 term2"~2147483647^2.0
> >
> >The first two clauses with + prefixes are required, and I would guess
> >that the 5 way disjunctions inside these clauses take most of the cpu time
> >during search.
> >  
> >
> 
> That's an interesting observation. This suggests that it could pay off 
> to glue these fields together and change this to a query on a single 
> combined field, right? I.e. to trade off space for speed.

Yes.
Querying the host field like this in a web page index can be dangerous
business. For example when term1 is "wikipedia" and term2 is "org",
the query will match at least all pages from wikipedia.org.

> 
> >The remaining clauses will be skipped to only when the two required
> >clauses are both present, so these are probably not the problem.
> >In the call tree for scoring these can be identified by the skipTo()
> >being called inside the score() method at the top level.
> >
> >This is one of the cases in which BooleanScorer2 can be faster
> >than the 1.4 BooleanScorer because the 1.4 BooleanScorer does
> >not use skipTo() for the optional clauses.
> >Could you try this by calling the static method
> >BooleanQuery.setUseScorer14(true) and repeating the test?
> >  
> >
> 
> Ok, I'll be able to do this on Monday.
> 
> >
> >>For those interested in profiler info, look here:
> >>
> >>http://www.getopt.org/lucene/20051202/
> >>    
> >>
> >
> >Thanks for making this available.
> >I had a short look and noticed that the times in the call tree are 
cumulative.
> >Would it be possible to record cpu time per method by subtracting the
> >times of the called methods?
> >  
> >
> 
> I think so, I'll check it back on Monday.
> 
> >Another interesting statistic is the actual numbers of times the methods 
get
> >called. The scorers used by BooleanScorer2 can use rather deep
> >call trees for scoring. These method calls are not free, and I would like
> >to know whether these form a bottleneck. 
> >  
> >
> 
> This statistics is available when I run with a high-detail profiling 
> (which slows down JVM considerably, like 10x). I can collect these too, 
> if it's interesting.

One measurement at a time is more than enough for me, and the slow
down does not seem to make this a good candidate to do first.

> >>From the hot spots output of the profiler info I see that the following 
methods:
> >- PhrasePositions.nextPosition
> >- SegmentTermDocs.read
> >have a larger portion of cpu time spent in the method.
> >Looking at the expanded query, seeing PhrasePositions.nextPosition
> >here is not a surprise.
> >
> >The fact that SegmentTermDocs.read shows up might explain
> >the reason why only a little heap is used:  Lucene normally leaves
> >the file buffering to the operating system, and
> >when the file buffers are read the index data is decompressed
> >mostly by the readVInt method.
> >  
> >
> 
> Yes, I understand it now. But perhaps it's time to contest this 
> approach: if there is so much heap available, does it still make sense 
> to rely so much on OS caching, if we have the space to do the caching 
> inside JVM (at least for a large portion of the index)?

You can try and use Lucene's RAMDirectory when you have enough RAM.
However, for larger indexes, It is easier to fit the index files  in RAM
(in OS buffers) than the decompressed index data (inside the JVM).

Also, caching query results by query text is probably more effective
than caching the JVM version of the searched index data.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene performance bottlenecks

Posted by Andrzej Bialecki <ab...@getopt.org>.

Paul Elschot wrote:

>In somewhat more readable layout:
> 
>+(url:term1^4.0 anchor:term1^2.0 content:term1
>   title:term1^1.5  host:term1^2.0)
>+(url:term2^4.0 anchor:term2^2.0 content:term2
>   title:term2^1.5 host:term2^2.0)
>url:"term1 term2"~2147483647^4.0 
>anchor:"term1 term2"~4^2.0
>content:"term1 term2"~2147483647
>title:"term1 term2"~2147483647^1.5
>host:"term1 term2"~2147483647^2.0
>
>The first two clauses with + prefixes are required, and I would guess
>that the 5 way disjunctions inside these clauses take most of the cpu time
>during search.
>  
>

That's an interesting observation. This suggests that it could pay off 
to glue these fields together and change this to a query on a single 
combined field, right? I.e. to trade off space for speed.

>The remaining clauses will be skipped to only when the two required
>clauses are both present, so these are probably not the problem.
>In the call tree for scoring these can be identified by the skipTo()
>being called inside the score() method at the top level.
>
>This is one of the cases in which BooleanScorer2 can be faster
>than the 1.4 BooleanScorer because the 1.4 BooleanScorer does
>not use skipTo() for the optional clauses.
>Could you try this by calling the static method
>BooleanQuery.setUseScorer14(true) and repeating the test?
>  
>

Ok, I'll be able to do this on Monday.

>
>>For those interested in profiler info, look here:
>>
>>http://www.getopt.org/lucene/20051202/
>>    
>>
>
>Thanks for making this available.
>I had a short look and noticed that the times in the call tree are cumulative.
>Would it be possible to record cpu time per method by subtracting the
>times of the called methods?
>  
>

I think so, I'll check it back on Monday.

>Another interesting statistic is the actual numbers of times the methods get
>called. The scorers used by BooleanScorer2 can use rather deep
>call trees for scoring. These method calls are not free, and I would like
>to know whether these form a bottleneck. 
>  
>

This statistics is available when I run with a high-detail profiling 
(which slows down JVM considerably, like 10x). I can collect these too, 
if it's interesting.

>>>From the hot spots output of the profiler info I see that the following methods:
>- PhrasePositions.nextPosition
>- SegmentTermDocs.read
>have a larger portion of cpu time spent in the method.
>Looking at the expanded query, seeing PhrasePositions.nextPosition
>here is not a surprise.
>
>The fact that SegmentTermDocs.read shows up might explain
>the reason why only a little heap is used:  Lucene normally leaves
>the file buffering to the operating system, and
>when the file buffers are read the index data is decompressed
>mostly by the readVInt method.
>  
>

Yes, I understand it now. But perhaps it's time to contest this 
approach: if there is so much heap available, does it still make sense 
to rely so much on OS caching, if we have the space to do the caching 
inside JVM (at least for a large portion of the index)?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene performance bottlenecks

Posted by Paul Elschot <pa...@xs4all.nl>.

Andrzej,

On Friday 02 December 2005 12:55, Andrzej Bialecki wrote:
> Hi,
> 
> I'm doing some performance profiling of a Nutch installation, working 
> with relatively large individual indexes (10 mln docs), and I'm puzzled 
> with the results.
> 
> Here's the listing of the index:
> -rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f0
> -rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f1
> -rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f2
> -rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f3
> -rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f4
> -rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f5
> -rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f6
> -rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f7
> -rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f8
> -rw-r--r--  1 andrzej andrzej  2494445020 Dec  2 04:58 _0.fdt
> -rw-r--r--  1 andrzej andrzej    78424800 Dec  2 04:58 _0.fdx
> -rw-r--r--  1 andrzej andrzej          92 Dec  2 04:55 _0.fnm
> -rw-r--r--  1 andrzej andrzej  7436259508 Dec  2 05:24 _0.frq
> -rw-r--r--  1 andrzej andrzej 12885589796 Dec  2 05:24 _0.prx
> -rw-r--r--  1 andrzej andrzej     3483642 Dec  2 05:24 _0.tii
> -rw-r--r--  1 andrzej andrzej   280376933 Dec  2 05:24 _0.tis
> -rw-r--r--  1 andrzej andrzej           4 Dec  2 05:25 deletable
> -rw-r--r--  1 andrzej andrzej          27 Dec  2 05:25 segments
> 
> 
> I run it on an AMD Opteron 246, 2Ghz, 4GB RAM, java -version says:
> 
> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_05-b05, mixed mode)
> 
> I run it with a heap of 1.5-2.5 GB, which doesn't make any difference 
> (see below). I'm using the latest SVN code (from yesterday) + 
> performance enhancements to ConjunctionScorer and BooleanScorer2 from JIRA.
> 
> The performance is less than impressive, response times being more than 
> 1 sec. Nutch produces complex queries for phrases, so the user query 
> "term1 term2" gets rewritten like this:

In somewhat more readable layout:
 
+(url:term1^4.0 anchor:term1^2.0 content:term1
   title:term1^1.5  host:term1^2.0)
+(url:term2^4.0 anchor:term2^2.0 content:term2
   title:term2^1.5 host:term2^2.0)
url:"term1 term2"~2147483647^4.0 
anchor:"term1 term2"~4^2.0
content:"term1 term2"~2147483647
title:"term1 term2"~2147483647^1.5
host:"term1 term2"~2147483647^2.0

The first two clauses with + prefixes are required, and I would guess
that the 5 way disjunctions inside these clauses take most of the cpu time
during search.
The remaining clauses will be skipped to only when the two required
clauses are both present, so these are probably not the problem.
In the call tree for scoring these can be identified by the skipTo()
being called inside the score() method at the top level.

This is one of the cases in which BooleanScorer2 can be faster
than the 1.4 BooleanScorer because the 1.4 BooleanScorer does
not use skipTo() for the optional clauses.
Could you try this by calling the static method
BooleanQuery.setUseScorer14(true) and repeating the test?
 
> For a simple TermQuery, if the DF(term) is above 10%, the response time 
> from IndexSearcher.search() is around 400ms (repeatable, after warm-up). 
> For such complex phrase queries the response time is around 1 sec or 
> more (again, after warm-up).
> 
> Initially I thought the process is I/O or heap/GC bound, this is a large 
> index after all, but the profiler shows it's purely CPU bound. I tracked 
> the bottleneck to the scorers (see my previous email on this), but also 
> to IndexInput.readVInt.. What's even more curious, most of the heap is 
> unused - I had the impression that Lucene tries to read as much of the 
> index as it can to memory in order to speed up the access, but 
> apparently that's not the case. The heap consumption was always in the 
> order of 100-200MB, no matter how large heap I set (and I tried values 
> between 1-4GB).
> 
> For those interested in profiler info, look here:
> 
> http://www.getopt.org/lucene/20051202/

Thanks for making this available.
I had a short look and noticed that the times in the call tree are cumulative.
Would it be possible to record cpu time per method by subtracting the
times of the called methods?
Another interesting statistic is the actual numbers of times the methods get
called. The scorers used by BooleanScorer2 can use rather deep
call trees for scoring. These method calls are not free, and I would like
to know whether these form a bottleneck. 

From the hot spots output of the profiler info I see that the following methods:
- PhrasePositions.nextPosition
- SegmentTermDocs.read
have a larger portion of cpu time spent in the method.
Looking at the expanded query, seeing PhrasePositions.nextPosition
here is not a surprise.

The fact that SegmentTermDocs.read shows up might explain
the reason why only a little heap is used:  Lucene normally leaves
the file buffering to the operating system, and
when the file buffers are read the index data is decompressed
mostly by the readVInt method.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene performance bottlenecks

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> For a simple TermQuery, if the DF(term) is above 10%, the response 
>> time from IndexSearcher.search() is around 400ms (repeatable, after 
>> warm-up). For such complex phrase queries the response time is around 
>> 1 sec or more (again, after warm-up).
>
>
> Are you specifying -server to the JVM?


Yes.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene performance bottlenecks

Posted by Doug Cutting <cu...@apache.org>.

Andrzej Bialecki wrote:
> For a simple TermQuery, if the DF(term) is above 10%, the response time 
> from IndexSearcher.search() is around 400ms (repeatable, after warm-up). 
> For such complex phrase queries the response time is around 1 sec or 
> more (again, after warm-up).

Are you specifying -server to the JVM?

> I tracked 
> the bottleneck to the scorers (see my previous email on this), but also 
> to IndexInput.readVInt.. 

It might be interesting to benchmark GCJ-compiled Lucene, since 
IndexInput.readVInt is highly optimized there.

> What's even more curious, most of the heap is 
> unused - I had the impression that Lucene tries to read as much of the 
> index as it can to memory in order to speed up the access, but 
> apparently that's not the case. The heap consumption was always in the 
> order of 100-200MB, no matter how large heap I set (and I tried values 
> between 1-4GB).

Lucene relies on the kernel to cache i/o, outside of the Java heap.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org