You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by al...@aim.com on 2011/11/04 00:30:45 UTC

how to achieve google.com like results for phrase queries

Hello,

I use nutch-1.3 crawled results in solr-3.4. I noticed that for two word phrases like newspaper latimes, latimes.com is not in results at all.
This may be due to the dismax def type that I use in  request handler 

<str name="defType">dismax</str>
<str name="qf">url^1.5 id^1.5 content^ title^1.2</str>
<str name="pf">url^1.5 id^1.5 content^0.5 title^1.2</str>


 with mm as
<str name="mm">2&lt;-1 5&lt;-2 6&lt;90%</str> 

However, changing it to 
<str name="mm">1&lt;-1 2&lt;-1 5&lt;-2 6&lt;90%</str> 

and q.op to OR or AND 

do not solve the problem. In this case latimes.com is ranked higher, but still is not in the first place.
Also in this case results with both words are ranked very low, almost at the end.

We need to be able to achieve the case when latimes.com is placed in the first place then results with both words and etc.

Any ideas how to modify config to this end?

Thanks in advance.
Alex.


Re: how to achieve google.com like results for phrase queries

Posted by al...@aim.com.
Solr also can query link(url) text and rank them higher if we specify url in qf field. Only problem is that why it does not rank pages with both words higher when mm is set as 
1&lt;-1. It seems to me that this is a bug.

Thanks.
Alex.

 
 

 

-----Original Message-----
From: Ted Dunning <te...@gmail.com>
To: solr-user <so...@lucene.apache.org>
Sent: Sat, Nov 5, 2011 8:59 pm
Subject: Re: how to achieve google.com like results for phrase queries


Google achieves their results by using data not found in the web pages
themselves.  This additional data critically includes link text, but also
is derived from behavioral information.



On Sat, Nov 5, 2011 at 5:07 PM, <al...@aim.com> wrote:

> Hi Erick,
>
> The term  "newspaper latimes" is not found in latimes.com. However,
> google places it in the first place. My guess is that mm parameter must
>  not be set as 2&lt;-1 in order to achieve google.com like ranking for
> two word phrase queries.
>
> My goal is to set mm parameter in such a way that latimes.com will be
> ranked in 1-3rd places and sites with both words will be placed after them.
> As I wrote in my previous letter
> setting mm as 1&lt;-1 solves this issue partially. Problem in this case is
> that sites with both words are placed at the bottom or are not in the
> search results at all.
>
> Thanks.
> Alex.
>
>
>
>
>
>
> -----Original Message-----
> From: Erick Erickson <er...@gmail.com>
> To: solr-user <so...@lucene.apache.org>
> Sent: Sat, Nov 5, 2011 9:01 am
> Subject: Re: how to achieve google.com like results for phrase queries
>
>
> First, the default query operator is ignored by edismax, so that's
> not doing anything.
>
> Why would you expect "newspaper latimes" to be found at all in
> "latimes.com"? What
> proof do you have that the two terms are even in the "latimes.com"
> document?
>
> You can look at the Query Elevation Component to force certain known
> documents to the top of the results based on the search terms, but that's
> not a very elegant solution.
>
> What business requirement are you trying to accomplish here? Because as
> asked, there's really not enough information to provide a meaningful
> suggestion.
>
> Best
> Erick
>
> On Thu, Nov 3, 2011 at 7:30 PM,  <al...@aim.com> wrote:
> > Hello,
> >
> > I use nutch-1.3 crawled results in solr-3.4. I noticed that for two word
> phrases like newspaper latimes, latimes.com is not in results at all.
> > This may be due to the dismax def type that I use in  request handler
> >
> > <str name="defType">dismax</str>
> > <str name="qf">url^1.5 id^1.5 content^ title^1.2</str>
> > <str name="pf">url^1.5 id^1.5 content^0.5 title^1.2</str>
> >
> >
> >  with mm as
> > <str name="mm">2&lt;-1 5&lt;-2 6&lt;90%</str>
> >
> > However, changing it to
> > <str name="mm">1&lt;-1 2&lt;-1 5&lt;-2 6&lt;90%</str>
> >
> > and q.op to OR or AND
> >
> > do not solve the problem. In this case latimes.com is ranked higher,
> but still
> is not in the first place.
> > Also in this case results with both words are ranked very low, almost at
> the
> end.
> >
> > We need to be able to achieve the case when latimes.com is placed in
> the first
> place then results with both words and etc.
> >
> > Any ideas how to modify config to this end?
> >
> > Thanks in advance.
> > Alex.
> >
> >
>
>
>
>

 

Re: how to achieve google.com like results for phrase queries

Posted by Ted Dunning <te...@gmail.com>.
Google achieves their results by using data not found in the web pages
themselves.  This additional data critically includes link text, but also
is derived from behavioral information.



On Sat, Nov 5, 2011 at 5:07 PM, <al...@aim.com> wrote:

> Hi Erick,
>
> The term  "newspaper latimes" is not found in latimes.com. However,
> google places it in the first place. My guess is that mm parameter must
>  not be set as 2&lt;-1 in order to achieve google.com like ranking for
> two word phrase queries.
>
> My goal is to set mm parameter in such a way that latimes.com will be
> ranked in 1-3rd places and sites with both words will be placed after them.
> As I wrote in my previous letter
> setting mm as 1&lt;-1 solves this issue partially. Problem in this case is
> that sites with both words are placed at the bottom or are not in the
> search results at all.
>
> Thanks.
> Alex.
>
>
>
>
>
>
> -----Original Message-----
> From: Erick Erickson <er...@gmail.com>
> To: solr-user <so...@lucene.apache.org>
> Sent: Sat, Nov 5, 2011 9:01 am
> Subject: Re: how to achieve google.com like results for phrase queries
>
>
> First, the default query operator is ignored by edismax, so that's
> not doing anything.
>
> Why would you expect "newspaper latimes" to be found at all in
> "latimes.com"? What
> proof do you have that the two terms are even in the "latimes.com"
> document?
>
> You can look at the Query Elevation Component to force certain known
> documents to the top of the results based on the search terms, but that's
> not a very elegant solution.
>
> What business requirement are you trying to accomplish here? Because as
> asked, there's really not enough information to provide a meaningful
> suggestion.
>
> Best
> Erick
>
> On Thu, Nov 3, 2011 at 7:30 PM,  <al...@aim.com> wrote:
> > Hello,
> >
> > I use nutch-1.3 crawled results in solr-3.4. I noticed that for two word
> phrases like newspaper latimes, latimes.com is not in results at all.
> > This may be due to the dismax def type that I use in  request handler
> >
> > <str name="defType">dismax</str>
> > <str name="qf">url^1.5 id^1.5 content^ title^1.2</str>
> > <str name="pf">url^1.5 id^1.5 content^0.5 title^1.2</str>
> >
> >
> >  with mm as
> > <str name="mm">2&lt;-1 5&lt;-2 6&lt;90%</str>
> >
> > However, changing it to
> > <str name="mm">1&lt;-1 2&lt;-1 5&lt;-2 6&lt;90%</str>
> >
> > and q.op to OR or AND
> >
> > do not solve the problem. In this case latimes.com is ranked higher,
> but still
> is not in the first place.
> > Also in this case results with both words are ranked very low, almost at
> the
> end.
> >
> > We need to be able to achieve the case when latimes.com is placed in
> the first
> place then results with both words and etc.
> >
> > Any ideas how to modify config to this end?
> >
> > Thanks in advance.
> > Alex.
> >
> >
>
>
>
>

Re: how to achieve google.com like results for phrase queries

Posted by al...@aim.com.
Hi Erick,

The term  "newspaper latimes" is not found in latimes.com. However, google places it in the first place. My guess is that mm parameter must  not be set as 2&lt;-1 in order to achieve google.com like ranking for two word phrase queries.

My goal is to set mm parameter in such a way that latimes.com will be ranked in 1-3rd places and sites with both words will be placed after them. As I wrote in my previous letter
setting mm as 1&lt;-1 solves this issue partially. Problem in this case is that sites with both words are placed at the bottom or are not in the search results at all.

Thanks.
Alex.

 
 

 

-----Original Message-----
From: Erick Erickson <er...@gmail.com>
To: solr-user <so...@lucene.apache.org>
Sent: Sat, Nov 5, 2011 9:01 am
Subject: Re: how to achieve google.com like results for phrase queries


First, the default query operator is ignored by edismax, so that's
not doing anything.

Why would you expect "newspaper latimes" to be found at all in
"latimes.com"? What
proof do you have that the two terms are even in the "latimes.com" document?

You can look at the Query Elevation Component to force certain known
documents to the top of the results based on the search terms, but that's
not a very elegant solution.

What business requirement are you trying to accomplish here? Because as
asked, there's really not enough information to provide a meaningful
suggestion.

Best
Erick

On Thu, Nov 3, 2011 at 7:30 PM,  <al...@aim.com> wrote:
> Hello,
>
> I use nutch-1.3 crawled results in solr-3.4. I noticed that for two word 
phrases like newspaper latimes, latimes.com is not in results at all.
> This may be due to the dismax def type that I use in  request handler
>
> <str name="defType">dismax</str>
> <str name="qf">url^1.5 id^1.5 content^ title^1.2</str>
> <str name="pf">url^1.5 id^1.5 content^0.5 title^1.2</str>
>
>
>  with mm as
> <str name="mm">2&lt;-1 5&lt;-2 6&lt;90%</str>
>
> However, changing it to
> <str name="mm">1&lt;-1 2&lt;-1 5&lt;-2 6&lt;90%</str>
>
> and q.op to OR or AND
>
> do not solve the problem. In this case latimes.com is ranked higher, but still 
is not in the first place.
> Also in this case results with both words are ranked very low, almost at the 
end.
>
> We need to be able to achieve the case when latimes.com is placed in the first 
place then results with both words and etc.
>
> Any ideas how to modify config to this end?
>
> Thanks in advance.
> Alex.
>
>

 
 

Re: how to achieve google.com like results for phrase queries

Posted by Erick Erickson <er...@gmail.com>.
First, the default query operator is ignored by edismax, so that's
not doing anything.

Why would you expect "newspaper latimes" to be found at all in
"latimes.com"? What
proof do you have that the two terms are even in the "latimes.com" document?

You can look at the Query Elevation Component to force certain known
documents to the top of the results based on the search terms, but that's
not a very elegant solution.

What business requirement are you trying to accomplish here? Because as
asked, there's really not enough information to provide a meaningful
suggestion.

Best
Erick

On Thu, Nov 3, 2011 at 7:30 PM,  <al...@aim.com> wrote:
> Hello,
>
> I use nutch-1.3 crawled results in solr-3.4. I noticed that for two word phrases like newspaper latimes, latimes.com is not in results at all.
> This may be due to the dismax def type that I use in  request handler
>
> <str name="defType">dismax</str>
> <str name="qf">url^1.5 id^1.5 content^ title^1.2</str>
> <str name="pf">url^1.5 id^1.5 content^0.5 title^1.2</str>
>
>
>  with mm as
> <str name="mm">2&lt;-1 5&lt;-2 6&lt;90%</str>
>
> However, changing it to
> <str name="mm">1&lt;-1 2&lt;-1 5&lt;-2 6&lt;90%</str>
>
> and q.op to OR or AND
>
> do not solve the problem. In this case latimes.com is ranked higher, but still is not in the first place.
> Also in this case results with both words are ranked very low, almost at the end.
>
> We need to be able to achieve the case when latimes.com is placed in the first place then results with both words and etc.
>
> Any ideas how to modify config to this end?
>
> Thanks in advance.
> Alex.
>
>