You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by anuvenk <an...@hotmail.com> on 2008/11/23 04:53:20 UTC

Question about Query Phrase Slop (qs) in dismax

>From the solr wiki, it sounded like if qs is set to 5 for example, & if the
search term is 'child custody', only docs with 'child' & 'custody' within 5
words of one another would be returned in results. Is this correct? If so,
it doesn't seem to be working for me. I see docs with 'child' & 'custody'
more than 5 words of one another (excluding stop words) which is resulting
in bad user experience as those docs are not so relevant. What more could i
do to improve quality in the results?
-- 
View this message in context: http://www.nabble.com/Question-about-Query-Phrase-Slop-%28qs%29-in-dismax-tp20643003p20643003.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Please Help !! Question about Query Phrase Slop (qs) in dismax

Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: Re: Please Help !! Question about Query Phrase Slop (qs) in dismax
: 
: 
: Please help someone...i've been waiting for an answer for the last couple of
: days & no one seems to be helping out here. I did search the wiki & this

Please don't send messages like this.  

This is a volunteer community -- no one (that I know of) is paid to 
read/reply to questions on the solr-user list.  Many of us do our best to 
make sure that all user questions get addressed, but this is a fairly high 
volume list, and sometimes other things in life (work, health, 
relationships, family, etc...) make that take a little longer then we 
would like -- sometimes questions don't get answered for a few days, it's 
just the way it is, please be patient.  Sending multiple "please help, 
still no reply" type messages just adds noise to the list, and give people 
who *do* want to help more to read which means it takes that much longer 
to actually reply.

If you need an answer to a question in a hurry: read the archives and the 
docs, experiment, read the code (if you know java), or hire a consultant 
to help you figure it out.

In this specific case, debugQuery=true would have quickly shown you that 
your qs=5 value wasn't making it's way into the "parsedquery" at all, 
which might have helped you understand what was happening.



-Hoss


Re: Query for Distributed search -

Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: Query for Distributed search -
: In-Reply-To: <c6...@mail.gmail.com>



http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking



-Hoss


Re: Query for Distributed search -

Posted by James liu <li...@gmail.com>.
Up to your solr client.

On Mon, Nov 24, 2008 at 1:24 PM, souravm <SO...@infosys.com> wrote:

> Hi,
>
> Looking for some insight on distributed search.
>
> Say I have an index distributed in 3 boxes and the index contains time and
> text data (typical log file). Each box has index for different timeline -
> say Box 1 for all Jan to April, Box 2 for May to August and Box 3 for Sep to
> Dec.
>
> Now if I try to search for a text string, will the search would happen in
> parallel in all 3 boxes or sequentially?
>
> Regards,
> Sourav
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>



-- 
regards
j.L

RE: Query for Distributed search -

Posted by souravm <SO...@infosys.com>.
Hi,

I understand your point on how do I do it myself in my Java code. 

However, I'm more interested to know how the default behaviour of DistributedSearch work when I issue a command like "curl 'http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=ipod+solr'" as mentioned in the wiki.

Regards,
Sourav

-----Original Message-----
From: Aleksander M. Stensby [mailto:aleksander.stensby@integrasco.no] 
Sent: Monday, November 24, 2008 12:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Query for Distributed search -

If you for instance use SolrJ and the HttpSolrServer, you could for  
instance add logic to your querying making your searches more efficient!  
That is partially the idea of sharding, right? :) So if the user wants to  
search for a log file in June, your application knows that June logs are  
stored on the second box, and hence will redirect the search to that box.  
Alternatively if he wants to search for logs spanning two boxes, you  
merely add the shards parameter to your query and just include the path to  
those to shards in question. I'm not really sure about how solr handles  
the merging of results etc and wether or not the requests are done in  
paralell or sequentially, but I do know that you could easily manage this  
on your own through java if you want to. (Simply setting up one  
HttpSolrServer in your code for each shard, and searching them in  
parallell in separate threads. => then reducing the results afterwards).

Have a look at http://wiki.apache.org/solr/DistributedSearch for more info.
You could also take a look at Hadoop. (http://hadoop.apache.org/)

regards,
  Aleks

On Mon, 24 Nov 2008 06:24:51 +0100, souravm <SO...@infosys.com> wrote:

> Hi,
>
> Looking for some insight on distributed search.
>
> Say I have an index distributed in 3 boxes and the index contains time  
> and text data (typical log file). Each box has index for different  
> timeline - say Box 1 for all Jan to April, Box 2 for May to August and  
> Box 3 for Sep to Dec.
>
> Now if I try to search for a text string, will the search would happen  
> in parallel in all 3 boxes or sequentially?
>
> Regards,
> Sourav
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended  
> solely
> for the use of the addressee(s). If you are not the intended recipient,  
> please
> notify the sender by e-mail and delete the original message. Further,  
> you are not
> to copy, disclose, or distribute this e-mail or its contents to any  
> other person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys  
> has taken
> every reasonable precaution to minimize this risk, but is not liable for  
> any damage
> you may sustain as a result of any virus in this e-mail. You should  
> carry out your
> own virus checks before opening the e-mail or attachment. Infosys  
> reserves the
> right to monitor and review the content of all messages sent to or from  
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on  
> the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>



-- 
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

Re: Query for Distributed search -

Posted by "Aleksander M. Stensby" <al...@integrasco.no>.
If you for instance use SolrJ and the HttpSolrServer, you could for  
instance add logic to your querying making your searches more efficient!  
That is partially the idea of sharding, right? :) So if the user wants to  
search for a log file in June, your application knows that June logs are  
stored on the second box, and hence will redirect the search to that box.  
Alternatively if he wants to search for logs spanning two boxes, you  
merely add the shards parameter to your query and just include the path to  
those to shards in question. I'm not really sure about how solr handles  
the merging of results etc and wether or not the requests are done in  
paralell or sequentially, but I do know that you could easily manage this  
on your own through java if you want to. (Simply setting up one  
HttpSolrServer in your code for each shard, and searching them in  
parallell in separate threads. => then reducing the results afterwards).

Have a look at http://wiki.apache.org/solr/DistributedSearch for more info.
You could also take a look at Hadoop. (http://hadoop.apache.org/)

regards,
  Aleks

On Mon, 24 Nov 2008 06:24:51 +0100, souravm <SO...@infosys.com> wrote:

> Hi,
>
> Looking for some insight on distributed search.
>
> Say I have an index distributed in 3 boxes and the index contains time  
> and text data (typical log file). Each box has index for different  
> timeline - say Box 1 for all Jan to April, Box 2 for May to August and  
> Box 3 for Sep to Dec.
>
> Now if I try to search for a text string, will the search would happen  
> in parallel in all 3 boxes or sequentially?
>
> Regards,
> Sourav
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended  
> solely
> for the use of the addressee(s). If you are not the intended recipient,  
> please
> notify the sender by e-mail and delete the original message. Further,  
> you are not
> to copy, disclose, or distribute this e-mail or its contents to any  
> other person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys  
> has taken
> every reasonable precaution to minimize this risk, but is not liable for  
> any damage
> you may sustain as a result of any virus in this e-mail. You should  
> carry out your
> own virus checks before opening the e-mail or attachment. Infosys  
> reserves the
> right to monitor and review the content of all messages sent to or from  
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on  
> the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>



-- 
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

Query for Distributed search -

Posted by souravm <SO...@infosys.com>.
Hi,

Looking for some insight on distributed search.

Say I have an index distributed in 3 boxes and the index contains time and text data (typical log file). Each box has index for different timeline - say Box 1 for all Jan to April, Box 2 for May to August and Box 3 for Sep to Dec.

Now if I try to search for a text string, will the search would happen in parallel in all 3 boxes or sequentially?

Regards,
Sourav

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are not 
to copy, disclose, or distribute this e-mail or its contents to any other person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken 
every reasonable precaution to minimize this risk, but is not liable for any damage 
you may sustain as a result of any virus in this e-mail. You should carry out your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Re: Please Help !! Question about Query Phrase Slop (qs) in dismax

Posted by Yonik Seeley <yo...@apache.org>.
If you boost the phrase queries by enough, you could tell when you hit
the less relevant documents by the score.

-Yonik

On Mon, Nov 24, 2008 at 12:07 AM, anuvenk <an...@hotmail.com> wrote:
>
> Thanks for the response. Well my current ps setting works great for most
> search terms. But say this typical example, north dakota 1031 exchange
> lawyers - we don't have any relevant docs in the index. Solr is returning
> the irrelevant doc, just because it found 'lawyer', exchange, north & dakota
> somewhere. I thought if there is a way to just not return any results if
> they are not within close proximity, it would be great.
>
> Yonik Seeley wrote:
>>
>> On Sun, Nov 23, 2008 at 11:51 PM, anuvenk <an...@hotmail.com>
>> wrote:
>>> Please help someone...i've been waiting for an answer for the last couple
>>> of
>>> days & no one seems to be helping out here. I did search the wiki & this
>>> forum for an answer. But couldn't find an answer. I know if ps is set to
>>> 5
>>> words within 5 words of one another receive a boost in score. But is
>>> there a
>>> way to not return results that have the words in search terms more than 5
>>> words apart. ?
>>
>> Not with dismax.  I'm not sure why it's a problem, given that with
>> enough boost you should be able to ensure that all of the results with
>> a slop less than 5 appear before other results.
>> Anyway, if you want to restrict results to those with a slop of 5, use
>> the standard query parser with an explicit sloppy phrase query:
>>
>> "north dakota 1031 exchange lawyers"~5
>>
>> -Yonik
>>
>>
>>> Typical example: north dakota 1031 exchange lawyers
>>> My first result is absolutely ir-relevant. It returned a north dakota doc
>>> though but had an occurrence of attorney somewhere & an occurrence of
>>> exchange (not related to 1031 exchange though). They were not within 5
>>> words
>>> of one another. My guys have been hammering me reg this relevancy issue.
>>> Please help someone.
>>>
>>> anuvenk wrote:
>>>>
>>>> From the solr wiki, it sounded like if qs is set to 5 for example, & if
>>>> the search term is 'child custody', only docs with 'child' & 'custody'
>>>> within 5 words of one another would be returned in results. Is this
>>>> correct? If so, it doesn't seem to be working for me. I see docs with
>>>> 'child' & 'custody' more than 5 words of one another (excluding stop
>>>> words) which is resulting in bad user experience as those docs are not
>>>> so
>>>> relevant. What more could i do to improve quality in the results?
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Please-Help-%21%21-Question-about-Query-Phrase-Slop-%28qs%29-in-dismax-tp20643003p20654906.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Please-Help-%21%21-Question-about-Query-Phrase-Slop-%28qs%29-in-dismax-tp20643003p20655014.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Please Help !! Question about Query Phrase Slop (qs) in dismax

Posted by anuvenk <an...@hotmail.com>.
Thanks for the response. Well my current ps setting works great for most
search terms. But say this typical example, north dakota 1031 exchange
lawyers - we don't have any relevant docs in the index. Solr is returning
the irrelevant doc, just because it found 'lawyer', exchange, north & dakota
somewhere. I thought if there is a way to just not return any results if
they are not within close proximity, it would be great. 

Yonik Seeley wrote:
> 
> On Sun, Nov 23, 2008 at 11:51 PM, anuvenk <an...@hotmail.com>
> wrote:
>> Please help someone...i've been waiting for an answer for the last couple
>> of
>> days & no one seems to be helping out here. I did search the wiki & this
>> forum for an answer. But couldn't find an answer. I know if ps is set to
>> 5
>> words within 5 words of one another receive a boost in score. But is
>> there a
>> way to not return results that have the words in search terms more than 5
>> words apart. ?
> 
> Not with dismax.  I'm not sure why it's a problem, given that with
> enough boost you should be able to ensure that all of the results with
> a slop less than 5 appear before other results.
> Anyway, if you want to restrict results to those with a slop of 5, use
> the standard query parser with an explicit sloppy phrase query:
> 
> "north dakota 1031 exchange lawyers"~5
> 
> -Yonik
> 
> 
>> Typical example: north dakota 1031 exchange lawyers
>> My first result is absolutely ir-relevant. It returned a north dakota doc
>> though but had an occurrence of attorney somewhere & an occurrence of
>> exchange (not related to 1031 exchange though). They were not within 5
>> words
>> of one another. My guys have been hammering me reg this relevancy issue.
>> Please help someone.
>>
>> anuvenk wrote:
>>>
>>> From the solr wiki, it sounded like if qs is set to 5 for example, & if
>>> the search term is 'child custody', only docs with 'child' & 'custody'
>>> within 5 words of one another would be returned in results. Is this
>>> correct? If so, it doesn't seem to be working for me. I see docs with
>>> 'child' & 'custody' more than 5 words of one another (excluding stop
>>> words) which is resulting in bad user experience as those docs are not
>>> so
>>> relevant. What more could i do to improve quality in the results?
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Please-Help-%21%21-Question-about-Query-Phrase-Slop-%28qs%29-in-dismax-tp20643003p20654906.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Please-Help-%21%21-Question-about-Query-Phrase-Slop-%28qs%29-in-dismax-tp20643003p20655014.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Please Help !! Question about Query Phrase Slop (qs) in dismax

Posted by Yonik Seeley <yo...@apache.org>.
On Sun, Nov 23, 2008 at 11:51 PM, anuvenk <an...@hotmail.com> wrote:
> Please help someone...i've been waiting for an answer for the last couple of
> days & no one seems to be helping out here. I did search the wiki & this
> forum for an answer. But couldn't find an answer. I know if ps is set to 5
> words within 5 words of one another receive a boost in score. But is there a
> way to not return results that have the words in search terms more than 5
> words apart. ?

Not with dismax.  I'm not sure why it's a problem, given that with
enough boost you should be able to ensure that all of the results with
a slop less than 5 appear before other results.
Anyway, if you want to restrict results to those with a slop of 5, use
the standard query parser with an explicit sloppy phrase query:

"north dakota 1031 exchange lawyers"~5

-Yonik


> Typical example: north dakota 1031 exchange lawyers
> My first result is absolutely ir-relevant. It returned a north dakota doc
> though but had an occurrence of attorney somewhere & an occurrence of
> exchange (not related to 1031 exchange though). They were not within 5 words
> of one another. My guys have been hammering me reg this relevancy issue.
> Please help someone.
>
> anuvenk wrote:
>>
>> From the solr wiki, it sounded like if qs is set to 5 for example, & if
>> the search term is 'child custody', only docs with 'child' & 'custody'
>> within 5 words of one another would be returned in results. Is this
>> correct? If so, it doesn't seem to be working for me. I see docs with
>> 'child' & 'custody' more than 5 words of one another (excluding stop
>> words) which is resulting in bad user experience as those docs are not so
>> relevant. What more could i do to improve quality in the results?
>>
>
> --
> View this message in context: http://www.nabble.com/Please-Help-%21%21-Question-about-Query-Phrase-Slop-%28qs%29-in-dismax-tp20643003p20654906.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Please Help !! Question about Query Phrase Slop (qs) in dismax

Posted by anuvenk <an...@hotmail.com>.
Please help someone...i've been waiting for an answer for the last couple of
days & no one seems to be helping out here. I did search the wiki & this
forum for an answer. But couldn't find an answer. I know if ps is set to 5
words within 5 words of one another receive a boost in score. But is there a
way to not return results that have the words in search terms more than 5
words apart. ?
Typical example: north dakota 1031 exchange lawyers
My first result is absolutely ir-relevant. It returned a north dakota doc
though but had an occurrence of attorney somewhere & an occurrence of
exchange (not related to 1031 exchange though). They were not within 5 words
of one another. My guys have been hammering me reg this relevancy issue.
Please help someone.

anuvenk wrote:
> 
> From the solr wiki, it sounded like if qs is set to 5 for example, & if
> the search term is 'child custody', only docs with 'child' & 'custody'
> within 5 words of one another would be returned in results. Is this
> correct? If so, it doesn't seem to be working for me. I see docs with
> 'child' & 'custody' more than 5 words of one another (excluding stop
> words) which is resulting in bad user experience as those docs are not so
> relevant. What more could i do to improve quality in the results?
> 

-- 
View this message in context: http://www.nabble.com/Please-Help-%21%21-Question-about-Query-Phrase-Slop-%28qs%29-in-dismax-tp20643003p20654906.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question about Query Phrase Slop (qs) in dismax

Posted by anuvenk <an...@hotmail.com>.
Somebody please help clear this doubt. What more could i do with the dismax
handler to remove results that don't have 'word1'', 'word2', 'word3' etc in
a search phrase not within 5 words of one another, to not come up in the
results?


anuvenk wrote:
> 
> From the solr wiki, it sounded like if qs is set to 5 for example, & if
> the search term is 'child custody', only docs with 'child' & 'custody'
> within 5 words of one another would be returned in results. Is this
> correct? If so, it doesn't seem to be working for me. I see docs with
> 'child' & 'custody' more than 5 words of one another (excluding stop
> words) which is resulting in bad user experience as those docs are not so
> relevant. What more could i do to improve quality in the results?
> 

-- 
View this message in context: http://www.nabble.com/Question-about-Query-Phrase-Slop-%28qs%29-in-dismax-tp20643003p20648109.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question about Query Phrase Slop (qs) in dismax

Posted by Chris Hostetter <ho...@fucit.org>.
: >From the solr wiki, it sounded like if qs is set to 5 for example, & if the
: search term is 'child custody', only docs with 'child' & 'custody' within 5
: words of one another would be returned in results. Is this correct? If so,

No.  as explained on the wiki...

>> Amount of slop on phrase queries explicitly included in the 
>> user's query string

note the "explicitly included" part ... if the query string doesn't 
contain any quotation marks, 'qs' isn't used at all.  (as opposed to 'ps' 
which is "Amount of slop on phrase queries built for 'pf' fields")

in a query like this...

   q=child+custody&qs=5&qf=...

...the 'qs' is ignored.  if you want to require that the input words all 
appear within a set slop of eachother (in at least one 'qf' field) you 
need to quote the users input...

   q="child+custody"&qs=5&qf=...
  
: in bad user experience as those docs are not so relevant. What more could i
: do to improve quality in the results?

use 'pf' with very high boosts (compared to the 'qf' boosts) so that phrse 
matching docs appear before non phrase matching docs.



-Hoss