You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/11/22 12:49:45 UTC

Performance issues with ConjunctionScorer

Hi,

I've been profiling a Nutch installation, and to my surprise the largest 
amount of throwaway allocations and the most time spent was not in Nutch 
specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method. 
This method operates on a LinkedList, which seems to be a huge 
bottleneck. Perhaps it would be possible to replace LinkedList with a table?

Nutch Summarizer also needlessly re-tokenizes the text over and over 
again - perhaps it would be better to save already tokenized text in 
parse_text, instead of the raw plain text? After all, the only use for 
that text is to index it and then build the summaries.

Please see the profiles here:

    http://www.getopt.org/nutch/profile/index.html

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Performance issues with ConjunctionScorer

Posted by Doug Cutting <cu...@nutch.org>.

Andrzej Bialecki wrote:
> Further input into this: after replacing the ConjunctionScorer with the
> fixed version from JIRA, now the bottleneck seems to be ... in
> Summarizer, of all things. :-)

While making the summarizer faster would of course be good, keep in mind 
that the cost of summarizing ten hits is constant as the size of the 
collection grows.  In search running on a single node, ten summaries are 
computed per query.  On ten nodes, one summary is computed per query. 
On 100 nodes, one summary is computed per ten queries.

Also note that we must save the raw text in order to form the text 
snippets of the summary.  So we might store the token stream, but I 
think we'd still have to store the raw text too.

Doug

Re: Performance issues with ConjunctionScorer

Posted by Andrzej Bialecki <ab...@getopt.org>.

Andrzej Bialecki wrote:

> Hi,
>
> I've been profiling a Nutch installation, and to my surprise the 
> largest amount of throwaway allocations and the most time spent was 
> not in Nutch specific code, or IPC, but in Lucene 
> ConjunctionScorer.doNext() method. This method operates on a 
> LinkedList, which seems to be a huge bottleneck. Perhaps it would be 
> possible to replace LinkedList with a table?
>
> Nutch Summarizer also needlessly re-tokenizes the text over and over 
> again - perhaps it would be better to save already tokenized text in 
> parse_text, instead of the raw plain text? After all, the only use for 
> that text is to index it and then build the summaries.
>
> Please see the profiles here:
>
>    http://www.getopt.org/nutch/profile/index.html
>

Further input into this: after replacing the ConjunctionScorer with the
fixed version from JIRA, now the bottleneck seems to be ... in
Summarizer, of all things. :-)

I'm loading the DistributedSearch$Server to 100% CPU, and then the split
is as follows:

* 82% NutchBean.getSummary() -> Summarizer.getSummary() -> getTokens()
-> 65% NutchDocumentTokenizer.next()
* 14% NutchBean.search()
* 2% IPC

which is slightly ridiculuous... I think this makes a good case for
storing pre-tokenized text in segments.

Regarding the allocation hot spots, we have the following top entries:

* 19.1% - 22,109 kB - 535,903 alloc.
org.apache.lucene.index.TermBuffer.toTerm
* 38.8% - 44,998 kB - 937,937 alloc.
org.apache.nutch.analysis.CommonGrams$Filter.next
-> 29.6% - 34,380 kB - 717,713 alloc.
org.apache.nutch.analysis.NutchDocumentTokenizer.next
* 13.8% - 15,989 kB - 12 alloc. org.apache.lucene.index.SegmentReader.norms

It seems that Nutch is uselessly re-tokenizing a lot of stuff - at this
stage we shouldn't need any re-tokenization except for the user query...
so I would argue that these parts should be redesigned to store and
retrieve pre-tokenized values.

-- 
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Performance issues with ConjunctionScorer

Posted by Piotr Kosiorowski <pk...@gmail.com>.

You are right - it is still not committed but the patch is here:
http://issues.apache.org/jira/browse/LUCENE-443.
During tests of my patch - it was very,very similar to this one- I had up to
5% perfomance increase. But probably it will mainly result in nicer GC
behaviour.

Piotr

On 11/22/05, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Piotr Kosiorowski wrote:
>
> >On 11/22/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> >
> >
> >>Hi,
> >>
> >>I've been profiling a Nutch installation, and to my surprise the largest
> >>amount of throwaway allocations and the most time spent was not in Nutch
> >>specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method.
> >>This method operates on a LinkedList, which seems to be a huge
> >>bottleneck. Perhaps it would be possible to replace LinkedList with a
> >>table?
> >>
> >>
> >>
> >>
> >I had exactly the same findings some time ago and even replaced
> LinkedList
> >with a table and started to prepare the patch and summarize my finding as
> at
> >the same time this subject was rised on lucene mailing list with patch -
> >doing exactly the same thing. I cannot find the link to thread right now
> -
> >but as far as I remember it is already commited in SVN trunk.
> >
> >
>
> Can't be - I'm working with the latest revision of Lucene from trunk/
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>

Re: Performance issues with ConjunctionScorer

Posted by Andrzej Bialecki <ab...@getopt.org>.

Piotr Kosiorowski wrote:

>On 11/22/05, Andrzej Bialecki <ab...@getopt.org> wrote:
>  
>
>>Hi,
>>
>>I've been profiling a Nutch installation, and to my surprise the largest
>>amount of throwaway allocations and the most time spent was not in Nutch
>>specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method.
>>This method operates on a LinkedList, which seems to be a huge
>>bottleneck. Perhaps it would be possible to replace LinkedList with a
>>table?
>>
>>
>>    
>>
>I had exactly the same findings some time ago and even replaced LinkedList
>with a table and started to prepare the patch and summarize my finding as at
>the same time this subject was rised on lucene mailing list with patch -
>doing exactly the same thing. I cannot find the link to thread right now -
>but as far as I remember it is already commited in SVN trunk.
>  
>

Can't be - I'm working with the latest revision of Lucene from trunk/

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Performance issues with ConjunctionScorer

Posted by Piotr Kosiorowski <pk...@gmail.com>.

On 11/22/05, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Hi,
>
> I've been profiling a Nutch installation, and to my surprise the largest
> amount of throwaway allocations and the most time spent was not in Nutch
> specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method.
> This method operates on a LinkedList, which seems to be a huge
> bottleneck. Perhaps it would be possible to replace LinkedList with a
> table?
>
>
I had exactly the same findings some time ago and even replaced LinkedList
with a table and started to prepare the patch and summarize my finding as at
the same time this subject was rised on lucene mailing list with patch -
doing exactly the same thing. I cannot find the link to thread right now -
but as far as I remember it is already commited in SVN trunk.
Regards
Piotr

Re: Performance issues with ConjunctionScorer

Posted by mark harwood <ma...@yahoo.co.uk>.

The Highlighter in the lucene "contrib" section has a
class called TokenSources which tries to find the best
way of getting a TokenStream.
It can build a TokenStream from either:
a) an Analyzer
b) TermPositionVector (if the field was created with
one in the index)

You may find that using TermPositionVectors in your
index gives you a speed up but it all depends on the
cost of processing done by your analyzer. Using
TermPositionVector incurs extra data reads to get the
list of tokens from disk whereas using Analyzer is
extra CPU load processing the document text you've
already read from disk.
Both approaches typically need to read the original
document text when highlighting in order to retain the
stop words that make it readable. 
I have noticed before now that the StandardAnalyzer
was quite slow but other Analyzers are much quicker so
it can really depend on your choice.

Cheers
Mark


		
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Performance issues with ConjunctionScorer

Posted by Andrzej Bialecki <ab...@getopt.org>.

Andrzej Bialecki wrote:

> Hi,
>
> I've been profiling a Nutch installation, and to my surprise the 
> largest amount of throwaway allocations and the most time spent was 
> not in Nutch specific code, or IPC, but in Lucene 
> ConjunctionScorer.doNext() method. This method operates on a 
> LinkedList, which seems to be a huge bottleneck. Perhaps it would be 
> possible to replace LinkedList with a table?
>
> Nutch Summarizer also needlessly re-tokenizes the text over and over 
> again - perhaps it would be better to save already tokenized text in 
> parse_text, instead of the raw plain text? After all, the only use for 
> that text is to index it and then build the summaries.
>
> Please see the profiles here:
>
>    http://www.getopt.org/nutch/profile/index.html
>
Further input into this: after replacing the ConjunctionScorer with the 
fixed version from JIRA, now the bottleneck seems to be ... in 
Summarizer, of all things. :-)

I'm loading the DistributedSearch$Server to 100% CPU, and then the split 
is as follows:

* 82% NutchBean.getSummary() -> Summarizer.getSummary() -> getTokens() 
-> 65% NutchDocumentTokenizer.next()
* 14% NutchBean.search()
* 2% IPC

which is slightly ridiculuous... I think this makes a good case for 
storing pre-tokenized text in segments.

Regarding the allocation hot spots, we have the following top entries:

* 19.1% - 22,109 kB - 535,903 alloc. 
org.apache.lucene.index.TermBuffer.toTerm
* 38.8% - 44,998 kB - 937,937 alloc. 
org.apache.nutch.analysis.CommonGrams$Filter.next
 -> 29.6% - 34,380 kB - 717,713 alloc. 
org.apache.nutch.analysis.NutchDocumentTokenizer.next
* 13.8% - 15,989 kB - 12 alloc. org.apache.lucene.index.SegmentReader.norms

It seems that Nutch is uselessly re-tokenizing a lot of stuff - at this 
stage we shouldn't need any re-tokenization except for the user query... 
so I would argue that these parts should be redesigned to store and 
retrieve pre-tokenized values.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Performance issues with ConjunctionScorer

Posted by Stefan Groschupf <sg...@media-style.com>.

Andrzej,
very interesting!!!

> Nutch Summarizer also needlessly re-tokenizes the text over and  
> over again - perhaps it would be better to save already tokenized  
> text in parse_text, instead of the raw plain text? After all, the  
> only use for that text is to index it and then build the summaries.

Sounds like a good improvement suggestion.
Do you think it would require a lot of changes?

Stefan