You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Fuad Efendi <fu...@efendi.ca> on 2017/02/21 02:28:37 UTC

CPU Intensive Scoring Alternatives

Hello,


Default TF-IDF performs poorly with the indexed 200 millions documents.
Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3
seconds. eDisMax. Because default operator "OR" and stopword "The" we have
50-70 millions documents as a query result, and scoring is CPU intensive.
What to do? Our typical queries return over million documents, and response
times of simple queries ranges from 50 milliseconds to 5-10 seconds
depending on result set.

This was just an exaggerated example with stopword “the”, but even simplest
query “Michael Jackson” runs 300ms instead of 3ms just because huge number
of hits and TF-IDF calculations. Solr 6.3.


Thanks,

--

Fuad Efendi

(416) 993-2060

http://www.tokenizer.ca
Search Relevancy, Recommender Systems

Re: CPU Intensive Scoring Alternatives

Posted by Fuad Efendi <fu...@efendi.ca>.

Walter, I use BM25 which is default for Solr 6.3, and I clearly visually
saw correlation between number of hits and response times in Solr logs, it
is almost linear.   With underloaded system.

With “solrmeter” 10-requests-per-second CPU goes to 400% on
12-core-hyperthread machine, and with 20-requests-per-second goes to 1100%.
No issues with GC. Java 8  121 from Oracle, 64-bit. 20 requests per second,
Solr 6, (to SOlr) kidding? I never expected that for simplest queries

Doug, I was never been able to make “mm” parameter work for me; I cannot
understand how it works. I use eDisMax, and few “text_general” fields, with
default for Solr operator “OR”, and default “mm” (which should be “1” for
“OR)

From: Walter Underwood <wu...@wunderwood.org> <wu...@wunderwood.org>
Reply: solr-user@lucene.apache.org <so...@lucene.apache.org>
<so...@lucene.apache.org>
Date: February 21, 2017 at 5:24:23 PM
To: solr-user@lucene.apache.org <so...@lucene.apache.org>
<so...@lucene.apache.org>
Subject:  Re: CPU Intensive Scoring Alternatives

300 ms seems pretty good for 200 million documents. Is that average?
Median? 95th percentile?

Why are you sure it is because the huge number of hits? That would be
unusual. The size of the posting lists is a more common cause.

Why do you think it is caused by tf.idf? That should be faster than BM25.

Does host have enough RAM to hold most or all of the index in file buffers?

What are the hit rates on your caches?

Are you using fuzzy matches? N-gram prefix matching? Phrase matching?
Shingles?

What version of Java are you running? What garbage collector?

wunder
Walter Underwood
wunder@wunderwood.org <ma...@wunderwood.org>
http://observer.wunderwood.org/ (my blog)

> On Feb 21, 2017, at 10:42 AM, Doug Turnbull <
dturnbull@opensourceconnections.com <mailto:
dturnbull@opensourceconnections.com>> wrote:
>
> With that many documents, why not start with an AND search and reissue an
> OR query if there's no results? My strategy is to prefer an AND for large
> collections (or a higher mm than 1) and prefer closer to an OR for
smaller
> collections.
>
> -Doug
>
> On Tue, Feb 21, 2017 at 1:39 PM Fuad Efendi <fuad@efendi.ca <mailto:
fuad@efendi.ca>> wrote:
>
>> Thank you Ahmet, I will try it; sounds reasonable
>>
>>
>> From: Ahmet Arslan <iorixxx@yahoo.com.invalid <mailto:
iorixxx@yahoo.com.invalid>> <iorixxx@yahoo.com.invalid <mailto:
iorixxx@yahoo.com.invalid>>
>> Reply: solr-user@lucene.apache.org <ma...@lucene.apache.org> <
solr-user@lucene.apache.org <ma...@lucene.apache.org>>
>> <solr-user@lucene.apache.org <ma...@lucene.apache.org>>,
Ahmet Arslan <iorixxx@yahoo.com <ma...@yahoo.com>>
>> <iorixxx@yahoo.com <ma...@yahoo.com>>
>> Date: February 21, 2017 at 3:02:11 AM
>> To: solr-user@lucene.apache.org <ma...@lucene.apache.org> <
solr-user@lucene.apache.org <ma...@lucene.apache.org>>
>> <solr-user@lucene.apache.org <ma...@lucene.apache.org>>
>> Subject: Re: CPU Intensive Scoring Alternatives
>>
>> Hi,
>>
>> New default similarity is BM25.
>> May be explicitly set similarity to tf-idf and see how it goes?
>>
>> Ahmet
>>
>>
>> On Tuesday, February 21, 2017 4:28 AM, Fuad Efendi <fuad@efendi.ca
<ma...@efendi.ca>> wrote:
>> Hello,
>>
>>
>> Default TF-IDF performs poorly with the indexed 200 millions documents.
>> Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3
>> seconds. eDisMax. Because default operator "OR" and stopword "The" we
have
>> 50-70 millions documents as a query result, and scoring is CPU
intensive.
>> What to do? Our typical queries return over million documents, and
response
>> times of simple queries ranges from 50 milliseconds to 5-10 seconds
>> depending on result set.
>>
>> This was just an exaggerated example with stopword “the”, but even
simplest
>> query “Michael Jackson” runs 300ms instead of 3ms just because huge
number
>> of hits and TF-IDF calculations. Solr 6.3.
>>
>>
>> Thanks,
>>
>> --
>>
>> Fuad Efendi
>>
>> (416) 993-2060
>>
>> http://www.tokenizer.ca <http://www.tokenizer.ca/>
>> Search Relevancy, Recommender Systems
>>

Re: CPU Intensive Scoring Alternatives

Posted by Walter Underwood <wu...@wunderwood.org>.

300 ms seems pretty good for 200 million documents. Is that average? Median? 95th percentile?

Why are you sure it is because the huge number of hits? That would be unusual. The size of the posting lists is a more common cause.

Why do you think it is caused by tf.idf? That should be faster than BM25.

Does host have enough RAM to hold most or all of the index in file buffers?

What are the hit rates on your caches?

Are you using fuzzy matches? N-gram prefix matching? Phrase matching? Shingles?

What version of Java are you running? What garbage collector?

wunder
Walter Underwood
wunder@wunderwood.org <ma...@wunderwood.org>
http://observer.wunderwood.org/  (my blog)


> On Feb 21, 2017, at 10:42 AM, Doug Turnbull <dturnbull@opensourceconnections.com <ma...@opensourceconnections.com>> wrote:
> 
> With that many documents, why not start with an AND search and reissue an
> OR query if there's no results? My strategy is to prefer an AND for large
> collections (or a higher mm than 1) and prefer closer to an OR for smaller
> collections.
> 
> -Doug
> 
> On Tue, Feb 21, 2017 at 1:39 PM Fuad Efendi <fuad@efendi.ca <ma...@efendi.ca>> wrote:
> 
>> Thank you Ahmet, I will try it; sounds reasonable
>> 
>> 
>> From: Ahmet Arslan <iorixxx@yahoo.com.invalid <ma...@yahoo.com.invalid>> <iorixxx@yahoo.com.invalid <ma...@yahoo.com.invalid>>
>> Reply: solr-user@lucene.apache.org <ma...@lucene.apache.org> <solr-user@lucene.apache.org <ma...@lucene.apache.org>>
>> <solr-user@lucene.apache.org <ma...@lucene.apache.org>>, Ahmet Arslan <iorixxx@yahoo.com <ma...@yahoo.com>>
>> <iorixxx@yahoo.com <ma...@yahoo.com>>
>> Date: February 21, 2017 at 3:02:11 AM
>> To: solr-user@lucene.apache.org <ma...@lucene.apache.org> <solr-user@lucene.apache.org <ma...@lucene.apache.org>>
>> <solr-user@lucene.apache.org <ma...@lucene.apache.org>>
>> Subject:  Re: CPU Intensive Scoring Alternatives
>> 
>> Hi,
>> 
>> New default similarity is BM25.
>> May be explicitly set similarity to tf-idf and see how it goes?
>> 
>> Ahmet
>> 
>> 
>> On Tuesday, February 21, 2017 4:28 AM, Fuad Efendi <fuad@efendi.ca <ma...@efendi.ca>> wrote:
>> Hello,
>> 
>> 
>> Default TF-IDF performs poorly with the indexed 200 millions documents.
>> Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3
>> seconds. eDisMax. Because default operator "OR" and stopword "The" we have
>> 50-70 millions documents as a query result, and scoring is CPU intensive.
>> What to do? Our typical queries return over million documents, and response
>> times of simple queries ranges from 50 milliseconds to 5-10 seconds
>> depending on result set.
>> 
>> This was just an exaggerated example with stopword “the”, but even simplest
>> query “Michael Jackson” runs 300ms instead of 3ms just because huge number
>> of hits and TF-IDF calculations. Solr 6.3.
>> 
>> 
>> Thanks,
>> 
>> --
>> 
>> Fuad Efendi
>> 
>> (416) 993-2060
>> 
>> http://www.tokenizer.ca <http://www.tokenizer.ca/>
>> Search Relevancy, Recommender Systems
>>

Re: CPU Intensive Scoring Alternatives

Posted by Doug Turnbull <dt...@opensourceconnections.com>.

With that many documents, why not start with an AND search and reissue an
OR query if there's no results? My strategy is to prefer an AND for large
collections (or a higher mm than 1) and prefer closer to an OR for smaller
collections.

-Doug

On Tue, Feb 21, 2017 at 1:39 PM Fuad Efendi <fu...@efendi.ca> wrote:

> Thank you Ahmet, I will try it; sounds reasonable
>
>
> From: Ahmet Arslan <io...@yahoo.com.invalid> <io...@yahoo.com.invalid>
> Reply: solr-user@lucene.apache.org <so...@lucene.apache.org>
> <so...@lucene.apache.org>, Ahmet Arslan <io...@yahoo.com>
> <io...@yahoo.com>
> Date: February 21, 2017 at 3:02:11 AM
> To: solr-user@lucene.apache.org <so...@lucene.apache.org>
> <so...@lucene.apache.org>
> Subject:  Re: CPU Intensive Scoring Alternatives
>
> Hi,
>
> New default similarity is BM25.
> May be explicitly set similarity to tf-idf and see how it goes?
>
> Ahmet
>
>
> On Tuesday, February 21, 2017 4:28 AM, Fuad Efendi <fu...@efendi.ca> wrote:
> Hello,
>
>
> Default TF-IDF performs poorly with the indexed 200 millions documents.
> Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3
> seconds. eDisMax. Because default operator "OR" and stopword "The" we have
> 50-70 millions documents as a query result, and scoring is CPU intensive.
> What to do? Our typical queries return over million documents, and response
> times of simple queries ranges from 50 milliseconds to 5-10 seconds
> depending on result set.
>
> This was just an exaggerated example with stopword “the”, but even simplest
> query “Michael Jackson” runs 300ms instead of 3ms just because huge number
> of hits and TF-IDF calculations. Solr 6.3.
>
>
> Thanks,
>
> --
>
> Fuad Efendi
>
> (416) 993-2060
>
> http://www.tokenizer.ca
> Search Relevancy, Recommender Systems
>

Re: CPU Intensive Scoring Alternatives

Posted by Fuad Efendi <fu...@efendi.ca>.

Thank you Ahmet, I will try it; sounds reasonable

From: Ahmet Arslan <io...@yahoo.com.invalid> <io...@yahoo.com.invalid>
Reply: solr-user@lucene.apache.org <so...@lucene.apache.org>
<so...@lucene.apache.org>, Ahmet Arslan <io...@yahoo.com>
<io...@yahoo.com>
Date: February 21, 2017 at 3:02:11 AM
To: solr-user@lucene.apache.org <so...@lucene.apache.org>
<so...@lucene.apache.org>
Subject:  Re: CPU Intensive Scoring Alternatives

Hi,

New default similarity is BM25.
May be explicitly set similarity to tf-idf and see how it goes?

Ahmet

On Tuesday, February 21, 2017 4:28 AM, Fuad Efendi <fu...@efendi.ca> wrote:
Hello,

Default TF-IDF performs poorly with the indexed 200 millions documents.
Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3
seconds. eDisMax. Because default operator "OR" and stopword "The" we have
50-70 millions documents as a query result, and scoring is CPU intensive.
What to do? Our typical queries return over million documents, and response
times of simple queries ranges from 50 milliseconds to 5-10 seconds
depending on result set.

This was just an exaggerated example with stopword “the”, but even simplest
query “Michael Jackson” runs 300ms instead of 3ms just because huge number
of hits and TF-IDF calculations. Solr 6.3.

Thanks,

-- 

Fuad Efendi

(416) 993-2060

http://www.tokenizer.ca
Search Relevancy, Recommender Systems

Re: CPU Intensive Scoring Alternatives

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi,

New default similarity is BM25. 
May be explicitly set similarity to tf-idf and see how it goes?

Ahmet


On Tuesday, February 21, 2017 4:28 AM, Fuad Efendi <fu...@efendi.ca> wrote:
Hello,


Default TF-IDF performs poorly with the indexed 200 millions documents.
Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3
seconds. eDisMax. Because default operator "OR" and stopword "The" we have
50-70 millions documents as a query result, and scoring is CPU intensive.
What to do? Our typical queries return over million documents, and response
times of simple queries ranges from 50 milliseconds to 5-10 seconds
depending on result set.

This was just an exaggerated example with stopword “the”, but even simplest
query “Michael Jackson” runs 300ms instead of 3ms just because huge number
of hits and TF-IDF calculations. Solr 6.3.


Thanks,

--

Fuad Efendi

(416) 993-2060

http://www.tokenizer.ca
Search Relevancy, Recommender Systems