You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/11/22 12:49:45 UTC
Performance issues with ConjunctionScorer
Hi,
I've been profiling a Nutch installation, and to my surprise the largest
amount of throwaway allocations and the most time spent was not in Nutch
specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method.
This method operates on a LinkedList, which seems to be a huge
bottleneck. Perhaps it would be possible to replace LinkedList with a table?
Nutch Summarizer also needlessly re-tokenizes the text over and over
again - perhaps it would be better to save already tokenized text in
parse_text, instead of the raw plain text? After all, the only use for
that text is to index it and then build the summaries.
Please see the profiles here:
http://www.getopt.org/nutch/profile/index.html
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Performance issues with ConjunctionScorer
Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> Further input into this: after replacing the ConjunctionScorer with the
> fixed version from JIRA, now the bottleneck seems to be ... in
> Summarizer, of all things. :-)
While making the summarizer faster would of course be good, keep in mind
that the cost of summarizing ten hits is constant as the size of the
collection grows. In search running on a single node, ten summaries are
computed per query. On ten nodes, one summary is computed per query.
On 100 nodes, one summary is computed per ten queries.
Also note that we must save the raw text in order to form the text
snippets of the summary. So we might store the token stream, but I
think we'd still have to store the raw text too.
Doug
Re: Performance issues with ConjunctionScorer
Posted by Andrzej Bialecki <ab...@getopt.org>.
Andrzej Bialecki wrote:
> Hi,
>
> I've been profiling a Nutch installation, and to my surprise the
> largest amount of throwaway allocations and the most time spent was
> not in Nutch specific code, or IPC, but in Lucene
> ConjunctionScorer.doNext() method. This method operates on a
> LinkedList, which seems to be a huge bottleneck. Perhaps it would be
> possible to replace LinkedList with a table?
>
> Nutch Summarizer also needlessly re-tokenizes the text over and over
> again - perhaps it would be better to save already tokenized text in
> parse_text, instead of the raw plain text? After all, the only use for
> that text is to index it and then build the summaries.
>
> Please see the profiles here:
>
> http://www.getopt.org/nutch/profile/index.html
>
Further input into this: after replacing the ConjunctionScorer with the
fixed version from JIRA, now the bottleneck seems to be ... in
Summarizer, of all things. :-)
I'm loading the DistributedSearch$Server to 100% CPU, and then the split
is as follows:
* 82% NutchBean.getSummary() -> Summarizer.getSummary() -> getTokens()
-> 65% NutchDocumentTokenizer.next()
* 14% NutchBean.search()
* 2% IPC
which is slightly ridiculuous... I think this makes a good case for
storing pre-tokenized text in segments.
Regarding the allocation hot spots, we have the following top entries:
* 19.1% - 22,109 kB - 535,903 alloc.
org.apache.lucene.index.TermBuffer.toTerm
* 38.8% - 44,998 kB - 937,937 alloc.
org.apache.nutch.analysis.CommonGrams$Filter.next
-> 29.6% - 34,380 kB - 717,713 alloc.
org.apache.nutch.analysis.NutchDocumentTokenizer.next
* 13.8% - 15,989 kB - 12 alloc. org.apache.lucene.index.SegmentReader.norms
It seems that Nutch is uselessly re-tokenizing a lot of stuff - at this
stage we shouldn't need any re-tokenization except for the user query...
so I would argue that these parts should be redesigned to store and
retrieve pre-tokenized values.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Performance issues with ConjunctionScorer
Posted by Piotr Kosiorowski <pk...@gmail.com>.
You are right - it is still not committed but the patch is here:
http://issues.apache.org/jira/browse/LUCENE-443.
During tests of my patch - it was very,very similar to this one- I had up to
5% perfomance increase. But probably it will mainly result in nicer GC
behaviour.
Piotr
On 11/22/05, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Piotr Kosiorowski wrote:
>
> >On 11/22/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> >
> >
> >>Hi,
> >>
> >>I've been profiling a Nutch installation, and to my surprise the largest
> >>amount of throwaway allocations and the most time spent was not in Nutch
> >>specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method.
> >>This method operates on a LinkedList, which seems to be a huge
> >>bottleneck. Perhaps it would be possible to replace LinkedList with a
> >>table?
> >>
> >>
> >>
> >>
> >I had exactly the same findings some time ago and even replaced
> LinkedList
> >with a table and started to prepare the patch and summarize my finding as
> at
> >the same time this subject was rised on lucene mailing list with patch -
> >doing exactly the same thing. I cannot find the link to thread right now
> -
> >but as far as I remember it is already commited in SVN trunk.
> >
> >
>
> Can't be - I'm working with the latest revision of Lucene from trunk/
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>
Re: Performance issues with ConjunctionScorer
Posted by Andrzej Bialecki <ab...@getopt.org>.
Piotr Kosiorowski wrote:
>On 11/22/05, Andrzej Bialecki <ab...@getopt.org> wrote:
>
>
>>Hi,
>>
>>I've been profiling a Nutch installation, and to my surprise the largest
>>amount of throwaway allocations and the most time spent was not in Nutch
>>specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method.
>>This method operates on a LinkedList, which seems to be a huge
>>bottleneck. Perhaps it would be possible to replace LinkedList with a
>>table?
>>
>>
>>
>>
>I had exactly the same findings some time ago and even replaced LinkedList
>with a table and started to prepare the patch and summarize my finding as at
>the same time this subject was rised on lucene mailing list with patch -
>doing exactly the same thing. I cannot find the link to thread right now -
>but as far as I remember it is already commited in SVN trunk.
>
>
Can't be - I'm working with the latest revision of Lucene from trunk/
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Performance issues with ConjunctionScorer
Posted by Piotr Kosiorowski <pk...@gmail.com>.
On 11/22/05, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Hi,
>
> I've been profiling a Nutch installation, and to my surprise the largest
> amount of throwaway allocations and the most time spent was not in Nutch
> specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method.
> This method operates on a LinkedList, which seems to be a huge
> bottleneck. Perhaps it would be possible to replace LinkedList with a
> table?
>
>
I had exactly the same findings some time ago and even replaced LinkedList
with a table and started to prepare the patch and summarize my finding as at
the same time this subject was rised on lucene mailing list with patch -
doing exactly the same thing. I cannot find the link to thread right now -
but as far as I remember it is already commited in SVN trunk.
Regards
Piotr
Re: Performance issues with ConjunctionScorer
Posted by mark harwood <ma...@yahoo.co.uk>.
The Highlighter in the lucene "contrib" section has a
class called TokenSources which tries to find the best
way of getting a TokenStream.
It can build a TokenStream from either:
a) an Analyzer
b) TermPositionVector (if the field was created with
one in the index)
You may find that using TermPositionVectors in your
index gives you a speed up but it all depends on the
cost of processing done by your analyzer. Using
TermPositionVector incurs extra data reads to get the
list of tokens from disk whereas using Analyzer is
extra CPU load processing the document text you've
already read from disk.
Both approaches typically need to read the original
document text when highlighting in order to retain the
stop words that make it readable.
I have noticed before now that the StandardAnalyzer
was quite slow but other Analyzers are much quicker so
it can really depend on your choice.
Cheers
Mark
___________________________________________________________
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Performance issues with ConjunctionScorer
Posted by Andrzej Bialecki <ab...@getopt.org>.
Andrzej Bialecki wrote:
> Hi,
>
> I've been profiling a Nutch installation, and to my surprise the
> largest amount of throwaway allocations and the most time spent was
> not in Nutch specific code, or IPC, but in Lucene
> ConjunctionScorer.doNext() method. This method operates on a
> LinkedList, which seems to be a huge bottleneck. Perhaps it would be
> possible to replace LinkedList with a table?
>
> Nutch Summarizer also needlessly re-tokenizes the text over and over
> again - perhaps it would be better to save already tokenized text in
> parse_text, instead of the raw plain text? After all, the only use for
> that text is to index it and then build the summaries.
>
> Please see the profiles here:
>
> http://www.getopt.org/nutch/profile/index.html
>
Further input into this: after replacing the ConjunctionScorer with the
fixed version from JIRA, now the bottleneck seems to be ... in
Summarizer, of all things. :-)
I'm loading the DistributedSearch$Server to 100% CPU, and then the split
is as follows:
* 82% NutchBean.getSummary() -> Summarizer.getSummary() -> getTokens()
-> 65% NutchDocumentTokenizer.next()
* 14% NutchBean.search()
* 2% IPC
which is slightly ridiculuous... I think this makes a good case for
storing pre-tokenized text in segments.
Regarding the allocation hot spots, we have the following top entries:
* 19.1% - 22,109 kB - 535,903 alloc.
org.apache.lucene.index.TermBuffer.toTerm
* 38.8% - 44,998 kB - 937,937 alloc.
org.apache.nutch.analysis.CommonGrams$Filter.next
-> 29.6% - 34,380 kB - 717,713 alloc.
org.apache.nutch.analysis.NutchDocumentTokenizer.next
* 13.8% - 15,989 kB - 12 alloc. org.apache.lucene.index.SegmentReader.norms
It seems that Nutch is uselessly re-tokenizing a lot of stuff - at this
stage we shouldn't need any re-tokenization except for the user query...
so I would argue that these parts should be redesigned to store and
retrieve pre-tokenized values.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Performance issues with ConjunctionScorer
Posted by Stefan Groschupf <sg...@media-style.com>.
Andrzej,
very interesting!!!
> Nutch Summarizer also needlessly re-tokenizes the text over and
> over again - perhaps it would be better to save already tokenized
> text in parse_text, instead of the raw plain text? After all, the
> only use for that text is to index it and then build the summaries.
Sounds like a good improvement suggestion.
Do you think it would require a lot of changes?
Stefan