You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by jchen2000 <jc...@yahoo.com> on 2012/11/10 02:00:47 UTC

customize solr search/scoring for performance

Hi 

we have 20million short docs (about 60 terms, less than 1k in total bytes
each) on each box, and we wanted to rank results based on how many terms got
matched only. In particular we are only interested in top N with best scores
(say a small number like 5). 

With some help from the forum users (Thanks to Otis), we chose to use
edismax with mm set properly (something like 85% or 80% as we wanted to have
reasonable recall). It seems like the recall is good but performance is way
off. The results vary from 30ms to 2s but we need 200 ~ 300ms for 99% of
searches.   Since our searching requirement is really straightforward, we
don't need tf, idf, positions etc, nor do we need fancy tokenizers since our
terms are all pre-processed. In addition, we also don't need to evaluate
scores, or sorting over a large doc set as long as we know the top N that
has to most terms matched. 

Any advice on how to custom the process to make it faster? And what could be
potential perf bottlenecks (searching in the index, or scoring or sorting)? 
Could this be done by plugin or we need deeper hacking? 

Some facts
1) the machine we use are good, so hardware is not a solution
2) dismax seems not working but edismax works (I though dismax could have an
edge in perf but I couldn't run it)



--
View this message in context: http://lucene.472066.n3.nabble.com/customize-solr-search-scoring-for-performance-tp4019444.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: customize solr search/scoring for performance

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi Jeremy,

The "what's expected" is not really possible to answer precisely without
seeing the cluster.  100 QPS may be a lot for a 1-core server with 4 GB RAM
and a 20M+ docs index, but would be a a joke for a 32-core system with 96GB
RAM, for example.

The coord reference still stands.
See
http://search-lucene.com/?q=coord
http://search-lucene.com/jd/lucene/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#coord(int,
int)

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Sun, Nov 11, 2012 at 11:50 PM, jchen2000 <jc...@yahoo.com> wrote:

> Yes, we only need term overlap information to choose top candidates (we may
> incorporate boost factor for different terms later but that's another
> story).
>
> we are quite new to solr so haven't really profiled the process. Is there
> any rough guess on what could be expected latency from such cases?  our
> throughput is only around 100 qps so that might not be a significant factor
> here.
>
> Thanks,
>
> Jeremy
>
>
> Otis Gospodnetic-5 wrote
> > Fuzzy "answer":
> > Can you verify the bottleneck, especially in slow cases is indeed
> scoring?
> > Profiler?
> > Not sure if coord method in Similarity is still around... are you saying
> > you need just term overlap for scoring/ordering?
> > 20m small docs and 2s queries on good hardware sounds suspicious ... do
> > slow queries correspond to GC or something else?
> >
> > Otis
> > --
> > Performance Monitoring - http://sematext.com/spm
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/customize-solr-search-scoring-for-performance-tp4019444p4019675.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: customize solr search/scoring for performance

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Robert,

I also wonder why it always request to collect doclist in-order
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L1469
Do you think it make sense to raise a JIRA to allow out of order
collecting?



On Tue, Nov 13, 2012 at 6:34 AM, Robert Muir <rc...@gmail.com> wrote:

> Whenever I look at solr users' stacktraces for disjunctions, I always
> notice they get BooleanScorer2.
>
> Is there some reason for this or is it not intentional (e.g. maybe a
> in-order collector is always being used when its possible at least in
> simple cases to allow for out-of-order hits?)
>
> When I examine test contributions from clover reports (e.g.
> https://builds.apache.org/job/Lucene-Solr-Clover-4.x/49/clover-report/),
> I notice that only lucene tests, and solr spellchecking tests actually
> hit BooleanScorer's collect. All other solr tests hit BooleanScorer2.
>
> If its possible to allow for an out of order collector in some common
> cases (e.g. large disjunctions w/ minShouldMatch generated by solr
> queryparsers), it could be a nice performance improvement.
>
> On Mon, Nov 12, 2012 at 3:48 PM, jchen2000 <jc...@yahoo.com> wrote:
> > The following was generated from jvisualvm. Seems like the perf is
> related to
> > scoring a lot. Any idea/pointer on how to customize that part?
> >
> > <http://lucene.472066.n3.nabble.com/file/n4019850/profilingResult.png>
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/customize-solr-search-scoring-for-performance-tp4019444p4019850.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: customize solr search/scoring for performance

Posted by Robert Muir <rc...@gmail.com>.
Whenever I look at solr users' stacktraces for disjunctions, I always
notice they get BooleanScorer2.

Is there some reason for this or is it not intentional (e.g. maybe a
in-order collector is always being used when its possible at least in
simple cases to allow for out-of-order hits?)

When I examine test contributions from clover reports (e.g.
https://builds.apache.org/job/Lucene-Solr-Clover-4.x/49/clover-report/),
I notice that only lucene tests, and solr spellchecking tests actually
hit BooleanScorer's collect. All other solr tests hit BooleanScorer2.

If its possible to allow for an out of order collector in some common
cases (e.g. large disjunctions w/ minShouldMatch generated by solr
queryparsers), it could be a nice performance improvement.

On Mon, Nov 12, 2012 at 3:48 PM, jchen2000 <jc...@yahoo.com> wrote:
> The following was generated from jvisualvm. Seems like the perf is related to
> scoring a lot. Any idea/pointer on how to customize that part?
>
> <http://lucene.472066.n3.nabble.com/file/n4019850/profilingResult.png>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/customize-solr-search-scoring-for-performance-tp4019444p4019850.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: customize solr search/scoring for performance

Posted by jchen2000 <jc...@yahoo.com>.
The following was generated from jvisualvm. Seems like the perf is related to
scoring a lot. Any idea/pointer on how to customize that part?

<http://lucene.472066.n3.nabble.com/file/n4019850/profilingResult.png> 



--
View this message in context: http://lucene.472066.n3.nabble.com/customize-solr-search-scoring-for-performance-tp4019444p4019850.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: customize solr search/scoring for performance

Posted by jchen2000 <jc...@yahoo.com>.
Yes, we only need term overlap information to choose top candidates (we may
incorporate boost factor for different terms later but that's another
story).

we are quite new to solr so haven't really profiled the process. Is there
any rough guess on what could be expected latency from such cases?  our
throughput is only around 100 qps so that might not be a significant factor
here. 

Thanks,

Jeremy
  

Otis Gospodnetic-5 wrote
> Fuzzy "answer":
> Can you verify the bottleneck, especially in slow cases is indeed scoring?
> Profiler?
> Not sure if coord method in Similarity is still around... are you saying
> you need just term overlap for scoring/ordering?
> 20m small docs and 2s queries on good hardware sounds suspicious ... do
> slow queries correspond to GC or something else?
> 
> Otis
> --
> Performance Monitoring - http://sematext.com/spm





--
View this message in context: http://lucene.472066.n3.nabble.com/customize-solr-search-scoring-for-performance-tp4019444p4019675.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: customize solr search/scoring for performance

Posted by Otis Gospodnetic <ot...@gmail.com>.
Fuzzy "answer":
Can you verify the bottleneck, especially in slow cases is indeed scoring?
Profiler?
Not sure if coord method in Similarity is still around... are you saying
you need just term overlap for scoring/ordering?
20m small docs and 2s queries on good hardware sounds suspicious ... do
slow queries correspond to GC or something else?

Otis
--
Performance Monitoring - http://sematext.com/spm
On Nov 9, 2012 8:01 PM, "jchen2000" <jc...@yahoo.com> wrote:

> Hi
>
> we have 20million short docs (about 60 terms, less than 1k in total bytes
> each) on each box, and we wanted to rank results based on how many terms
> got
> matched only. In particular we are only interested in top N with best
> scores
> (say a small number like 5).
>
> With some help from the forum users (Thanks to Otis), we chose to use
> edismax with mm set properly (something like 85% or 80% as we wanted to
> have
> reasonable recall). It seems like the recall is good but performance is way
> off. The results vary from 30ms to 2s but we need 200 ~ 300ms for 99% of
> searches.   Since our searching requirement is really straightforward, we
> don't need tf, idf, positions etc, nor do we need fancy tokenizers since
> our
> terms are all pre-processed. In addition, we also don't need to evaluate
> scores, or sorting over a large doc set as long as we know the top N that
> has to most terms matched.
>
> Any advice on how to custom the process to make it faster? And what could
> be
> potential perf bottlenecks (searching in the index, or scoring or sorting)?
> Could this be done by plugin or we need deeper hacking?
>
> Some facts
> 1) the machine we use are good, so hardware is not a solution
> 2) dismax seems not working but edismax works (I though dismax could have
> an
> edge in perf but I couldn't run it)
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/customize-solr-search-scoring-for-performance-tp4019444.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>