You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Bryan Loofbourrow <bl...@knowledgemosaic.com> on 2012/02/17 04:06:30 UTC

Improving proximity search performance

Here’s my use case. I expect to set up a Solr index that is approximately
1.4GB (this is a real number from the proof-of-concept using the real data,
which consists of about 10 million documents, many of significant size, and
making use of the FastVectorHighlighter to do highlighting on the body text
field, which is of course stored, and with termVectors, termPositions, and
termOffsets on).



I no longer have the proof-of-concept Solr core available (our live site
uses Solr 1.4 and the ordinary Highlighter), so I can’t get an empirical
answer to this question: Will storing that extra information about the
location of terms help the performance of proximity searches?



A significant and important subset of my users make extensive use of
proximity searches. These sophisticated users have found that they are best
able to locate what they want by doing searches about THISWORD within 5
words of THATWORD, or much more sophisticated variants on that theme,
including plenty of booleans and wildcards. The problem I’m facing is
performance. Some of these searches, when common words are used, can take
many minutes, even with the index on an SSD.



The question is, how to improve the performance. It occurred to me as
possible that all of that term vector information, stored for the benefit
of the FastVectorHighlighter, might be a significant aid to the performance
of these searches.



First question: is that already the case? Will storing this extra
information automatically improve my proximity search performance?



Second question: If not, I’m very willing to dive into the code and come up
with a patch that would do this. Can someone with knowledge of the
internals comment on whether this is a plausible strategy for improving
performance, and, if so, give tips about the outlines of what a successful
approach to the problem might look like?



Third question: Any tips in general for improving the performance of these
proximity searches? I have explored the question of whether the customers
might be weaned off of them, and that does not appear to be an option.



Thanks,



-- Bryan Loofbourrow

Re: Improving proximity search performance

Posted by Jack Krupansky <ja...@basetechnology.com>.

Add &debugQuery=true to your query request and look at the timings for the 
various search components. That should be the first step in figuring out 
where to focus your attention for performance improvement.

-- Jack Krupansky

-----Original Message----- 
From: 蒋明原
Sent: Saturday, September 15, 2012 6:27 AM
To: solr-user@lucene.apache.org
Subject: RE: Improving proximity search performance

i have the same problem.and did you got some good idea? wish you can share
it.thanks
在 2012-2-18 上午8:52，"Bryan Loofbourrow" <bl...@knowledgemosaic.com>写道：

> Apologies. I meant to type “1.4 TB” and somehow typed “1.4 GB.” Little
> wonder that no one thought the question was interesting, or figured I must
> be using Sneakernet to run my searches.
>
>
>
> -- Bryan Loofbourrow
>
>
>   ------------------------------
>
> *From:* Bryan Loofbourrow [mailto:bloofbourrow@knowledgemosaic.com]
> *Sent:* Thursday, February 16, 2012 7:07 PM
> *To:* 'solr-user@lucene.apache.org'
> *Subject:* Improving proximity search performance
>
>
>
> Here’s my use case. I expect to set up a Solr index that is approximately
> 1.4GB (this is a real number from the proof-of-concept using the real 
> data,
> which consists of about 10 million documents, many of significant size, 
> and
> making use of the FastVectorHighlighter to do highlighting on the body 
> text
> field, which is of course stored, and with termVectors, termPositions, and
> termOffsets on).
>
>
>
> I no longer have the proof-of-concept Solr core available (our live site
> uses Solr 1.4 and the ordinary Highlighter), so I can’t get an empirical
> answer to this question: Will storing that extra information about the
> location of terms help the performance of proximity searches?
>
>
>
> A significant and important subset of my users make extensive use of
> proximity searches. These sophisticated users have found that they are 
> best
> able to locate what they want by doing searches about THISWORD within 5
> words of THATWORD, or much more sophisticated variants on that theme,
> including plenty of booleans and wildcards. The problem I’m facing is
> performance. Some of these searches, when common words are used, can take
> many minutes, even with the index on an SSD.
>
>
>
> The question is, how to improve the performance. It occurred to me as
> possible that all of that term vector information, stored for the benefit
> of the FastVectorHighlighter, might be a significant aid to the 
> performance
> of these searches.
>
>
>
> First question: is that already the case? Will storing this extra
> information automatically improve my proximity search performance?
>
>
>
> Second question: If not, I’m very willing to dive into the code and come 
> up
> with a patch that would do this. Can someone with knowledge of the
> internals comment on whether this is a plausible strategy for improving
> performance, and, if so, give tips about the outlines of what a successful
> approach to the problem might look like?
>
>
>
> Third question: Any tips in general for improving the performance of these
> proximity searches? I have explored the question of whether the customers
> might be weaned off of them, and that does not appear to be an option.
>
>
>
> Thanks,
>
>
>
> -- Bryan Loofbourrow
>

RE: Improving proximity search performance

Posted by 蒋明原 <ma...@gmail.com>.

i have the same problem.and did you got some good idea? wish you can share
it.thanks
在 2012-2-18 上午8:52，"Bryan Loofbourrow" <bl...@knowledgemosaic.com>写道：

> Apologies. I meant to type “1.4 TB” and somehow typed “1.4 GB.” Little
> wonder that no one thought the question was interesting, or figured I must
> be using Sneakernet to run my searches.
>
>
>
> -- Bryan Loofbourrow
>
>
>   ------------------------------
>
> *From:* Bryan Loofbourrow [mailto:bloofbourrow@knowledgemosaic.com]
> *Sent:* Thursday, February 16, 2012 7:07 PM
> *To:* 'solr-user@lucene.apache.org'
> *Subject:* Improving proximity search performance
>
>
>
> Here’s my use case. I expect to set up a Solr index that is approximately
> 1.4GB (this is a real number from the proof-of-concept using the real data,
> which consists of about 10 million documents, many of significant size, and
> making use of the FastVectorHighlighter to do highlighting on the body text
> field, which is of course stored, and with termVectors, termPositions, and
> termOffsets on).
>
>
>
> I no longer have the proof-of-concept Solr core available (our live site
> uses Solr 1.4 and the ordinary Highlighter), so I can’t get an empirical
> answer to this question: Will storing that extra information about the
> location of terms help the performance of proximity searches?
>
>
>
> A significant and important subset of my users make extensive use of
> proximity searches. These sophisticated users have found that they are best
> able to locate what they want by doing searches about THISWORD within 5
> words of THATWORD, or much more sophisticated variants on that theme,
> including plenty of booleans and wildcards. The problem I’m facing is
> performance. Some of these searches, when common words are used, can take
> many minutes, even with the index on an SSD.
>
>
>
> The question is, how to improve the performance. It occurred to me as
> possible that all of that term vector information, stored for the benefit
> of the FastVectorHighlighter, might be a significant aid to the performance
> of these searches.
>
>
>
> First question: is that already the case? Will storing this extra
> information automatically improve my proximity search performance?
>
>
>
> Second question: If not, I’m very willing to dive into the code and come up
> with a patch that would do this. Can someone with knowledge of the
> internals comment on whether this is a plausible strategy for improving
> performance, and, if so, give tips about the outlines of what a successful
> approach to the problem might look like?
>
>
>
> Third question: Any tips in general for improving the performance of these
> proximity searches? I have explored the question of whether the customers
> might be weaned off of them, and that does not appear to be an option.
>
>
>
> Thanks,
>
>
>
> -- Bryan Loofbourrow
>

RE: Improving proximity search performance

Posted by Bryan Loofbourrow <bl...@knowledgemosaic.com>.

Apologies. I meant to type “1.4 TB” and somehow typed “1.4 GB.” Little
wonder that no one thought the question was interesting, or figured I must
be using Sneakernet to run my searches.



-- Bryan Loofbourrow


  ------------------------------

*From:* Bryan Loofbourrow [mailto:bloofbourrow@knowledgemosaic.com]
*Sent:* Thursday, February 16, 2012 7:07 PM
*To:* 'solr-user@lucene.apache.org'
*Subject:* Improving proximity search performance



Here’s my use case. I expect to set up a Solr index that is approximately
1.4GB (this is a real number from the proof-of-concept using the real data,
which consists of about 10 million documents, many of significant size, and
making use of the FastVectorHighlighter to do highlighting on the body text
field, which is of course stored, and with termVectors, termPositions, and
termOffsets on).



I no longer have the proof-of-concept Solr core available (our live site
uses Solr 1.4 and the ordinary Highlighter), so I can’t get an empirical
answer to this question: Will storing that extra information about the
location of terms help the performance of proximity searches?



A significant and important subset of my users make extensive use of
proximity searches. These sophisticated users have found that they are best
able to locate what they want by doing searches about THISWORD within 5
words of THATWORD, or much more sophisticated variants on that theme,
including plenty of booleans and wildcards. The problem I’m facing is
performance. Some of these searches, when common words are used, can take
many minutes, even with the index on an SSD.



The question is, how to improve the performance. It occurred to me as
possible that all of that term vector information, stored for the benefit
of the FastVectorHighlighter, might be a significant aid to the performance
of these searches.



First question: is that already the case? Will storing this extra
information automatically improve my proximity search performance?



Second question: If not, I’m very willing to dive into the code and come up
with a patch that would do this. Can someone with knowledge of the
internals comment on whether this is a plausible strategy for improving
performance, and, if so, give tips about the outlines of what a successful
approach to the problem might look like?



Third question: Any tips in general for improving the performance of these
proximity searches? I have explored the question of whether the customers
might be weaned off of them, and that does not appear to be an option.



Thanks,



-- Bryan Loofbourrow