You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Haishan Chen <ha...@msn.com> on 2007/11/05 20:41:07 UTC

RE: Phrase Query Performance Question and score threshold

Hoss,
 
If I limit the documents returned based on a score threshold (filter by score) will it be able to improve query performance? My intuition is it won't be able to because you will still have to calculate the score and then compare to the threshold.
 
I know it may not be meaningful to do so based on the following explanation
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03
 
But it might work for me because the documents are not "natural language" but constructed following certain rules. If I really want to try this can you offer advice on the best way to implement score threshold in SOLR with minimum overhead? 
 
Appreciate if anyone can help
 
Thank you
Haishan
 
 
 



> Date: Fri, 2 Nov 2007 12:31:29 -0700> From: hossman_lucene@fucit.org> To: solr-user@lucene.apache.org> Subject: Re: Phrase Query Performance Question> > > : It still feels to me that you are trying doing something unique with your> : phrase queries. Unfortunately, you still haven't said what you are trying to> : do in general terms, which makes it very difficult for people to help you.> > Agreed. This seems very special case, but we dont' know what the case is.> > If there are specific phrases you know in advance that you will care > about, and those phrases occur as frequetnly as the individual > "words", then the best way to deal with them is to index each "phrase" as > a single Term (and ignore the individual words)> > Speaking more generally to mike's point...> > http://people.apache.org/~hossman/#xyproblem> Your question appears to be an "XY Problem" ... that is: you are dealing> with "X", you are assuming "Y" will help you, and you are asking about "Y"> without giving more details about the "X" so that we can understand the> full issue. Perhaps the best solution doesn't involve "Y" at all?> See Also: http://www.perlmonks.org/index.pl?node_id=542341> > > > > > -Hoss> 
_________________________________________________________________
Windows Live Hotmail and Microsoft Office Outlook – together at last.  Get it now.
http://office.microsoft.com/en-us/outlook/HA102225181033.aspx?pid=CL100626971033

Re: Phrase Query Performance Question and score threshold

Posted by Yonik Seeley <yo...@apache.org>.

On 11/5/07, Haishan Chen <ha...@msn.com> wrote:
> As for the first issues. The number of different phrase queries have performance issues I found so far are about 10.

If these are normal phrase queries (no slop), a good solution might be
to simply index and query these phrases as a single token.  One could
do this with a SynonymFilter.

Oh, and no, a score threshold won't help performance.

> I believe there will be a lot more I just haven't tried.  It can be solve by using faster hard
> ware though.  Also I believe it will help if SOLR has samilar distributed search
> architecture like NUTCH so that it can scale out instead of scale up.

It's coming...

-Yonik

RE: Phrase Query Performance Question and score threshold

Posted by Haishan Chen <ha...@msn.com>.

> Date: Mon, 5 Nov 2007 14:55:21 -0500> From: yonik@apache.org> To: solr-user@lucene.apache.org> Subject: Re: Phrase Query Performance Question and score threshold> > On 11/5/07, Haishan Chen <ha...@msn.com> wrote:> > If I limit the documents returned based on a score threshold (filter by score) will it be able to improve query performance?> > No.> > Taking a different approach can really speed up queries though.> To figure out what approach you should take, we need to know what you> are trying to do.> As Hoss said: http://people.apache.org/~hossman/#xyproblem> > > How many different phrase queries are you having performance issues with?> > -Yonik

Thanks for replying Yonik.

Out of my strong curiosity I was trying to implement a search application that my colleague already did very successfully. I tried to to use SOLR to build the same application and see if it works. Basically there are millions of documents. They are categorized and the content of the document is constructed by program using its category as input. A search application will search the content and bring up the document. The way of constructing the document has been proven to be excellent in terms of relevancy. Of course it rely on using slop phrase queries. Now I want to build something that is able to search the content and bring up the document fast. That is basically what I want to do.

I can't go any more detail on how the document content was constructed because the company I work for has patent pending on it. I dare not to discuss it in public. But the way it was constructed seems to be the reason of why document frequency was so high (for many phrase) and a search usually bring up large result set. But top score documents have very good relevancy. So I am facing two issue. One is to make the slop phrase query faster, second is to make result set smaller.

Using a score threshold may solve the second issue. That will be great if you can point me how to achieve that.

As for the first issues. The number of different phrase queries have performance issues I found so far are about 10. I believe there will be a lot more I just haven't tried. It can be solve by using faster hard ware though. Also I believe it will help if SOLR has samilar distributed search architecture like NUTCH so that it can scale out instead of scale up.

Thanks a lot
Haishan
_________________________________________________________________
Help yourself to FREE treats served up daily at the Messenger Café. Stop by today.
http://www.cafemessenger.com/info/info_sweetstuff2.html?ocid=TXT_TAGLM_OctWLtagline

Re: Phrase Query Performance Question and score threshold

Posted by Yonik Seeley <yo...@apache.org>.

On 11/5/07, Haishan Chen <ha...@msn.com> wrote:
> If I limit the documents returned based on a score threshold (filter by score) will it be able to improve query performance?

No.

Taking a different approach can really speed up queries though.
To figure out what approach you should take, we need to know what you
are trying to do.
As Hoss said: http://people.apache.org/~hossman/#xyproblem


How many different phrase queries are you having performance issues with?

-Yonik