You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Johan Svensson <jo...@euroling.se> on 2011/08/31 10:22:34 UTC

Weight servers differently

I want to put different weights to different domains, so that I can push up
results from my main site. Say for example, I have www.example.com with a
few but important pages, and blog.example.com with zillions of
not-really-that-good pages. When searching, I'd like the hits at
www.example.com be up-weighted by some factor over them from
blog.example.com. Of course, other weights must be considered as usual, just
that the blog pages is generally not really as important as the main pages.
Is this possible, and how? Using nutch+solr.

Re: Weight servers differently

Posted by Johan Svensson <jo...@euroling.se>.
This works magically well. Also this &bq=site:www.example.com, after I got
that one right. :) Thank you!

2011/8/31 Markus Jelsma <ma...@openindex.io>

> hmm. Better use functions instead
>
> bf=query($qq)^20&qq=site:example.com
>
> This will boost all example.com sites in the result set
>
> On Wednesday 31 August 2011 15:58:18 Johan Svensson wrote:
> > Thank you, Markus,
> >
> > At current rate, I just want this to work. I have no idea whether I want
> to
> > omitNorms or not. At the moment of writing, I don't feel like that,
> anyway.
> > More importantly, I want to boost pages which site field is
> www.example.com
> >  over blog.example.com, but without omitting hits on blog.example.com.
> >
> > The query boost seems to filter out hits from blog.example.comcompletely,
> > so that is not what I want.
> >
> > Abusing the boost field might be a nice idea. Can you please show me an
> > example, presuming I don't really understand the connection between all
> the
> > xml files and binaries. Not even really which one of solr and nutch is
> > responsible for which task... :)
> >
> > 2011/8/31 Markus Jelsma <ma...@openindex.io>
> >
> > > Index-time boosting is not something very common and raises issues if
> you
> > > want
> > > to omitNorms in Solr.
> > >
> > > In Solr DisMax you can use a bq (boost query) to boost site:
> example.com
> > > ^10.
> > > All results that match the boost query receive a ^10 boost. This is
> only
> > > client side.
> > >
> > > You can also abuse the boost field Nutch is writing. By default this is
> > > 1.0f.
> > > You can write a simple scoring filter or even an indexing filter that
> > > check's
> > > the site field for your site and sets the boost field accordingly.
> > >
> > > On Wednesday 31 August 2011 15:30:08 Johan Svensson wrote:
> > > > I guess this is the solution. Though, I have been trying to implement
> > >
> > > this
> > >
> > > > the whole afternoon with no success. I have a field "site" in my
> > > > scheme.xml, stored and indexed. I'm using nutch -solrindex to tell
> solr
> > >
> > > to
> > >
> > > > index what nutch has crawled. How can I tell nutch to tell solr to
> > > > boost all documents with the value "www.example.com" of the "site"
> > > > field? An example would be perfect for a loser like myself. I've
> > > > googled all the Internets over and over.
> > > >
> > > > 2011/8/31 Gora Mohanty <go...@mimirtech.com>
> > > >
> > > > > On Wed, Aug 31, 2011 at 2:51 PM, Johan Svensson
> > > > >
> > > > > <jo...@euroling.se> wrote:
> > > > > > Thank you! This looks interesting. However, I wonder if it really
> > > > > > can
> > > > >
> > > > > solve
> > > > >
> > > > > > this problem. No part of the search query is by necessary means
> > > > > > part
> > >
> > > of
> > >
> > > > > the
> > > > >
> > > > > > domain name. Let's say for example that we search for "foobar".
> On
> > > > > > www.example.com/page42.html this word is found, as well for lots
> of
> > > > >
> > > > > pages
> > > > >
> > > > > > with different names at blog.example.com/. Can you apply
> boosting
> > >
> > > magic
> > >
> > > > > for
> > > > >
> > > > > > the hit at www.example.com although the search term is not a
> part
> > > > > > of the url?
> > > > >
> > > > > Presumably, you know the domain name from which the
> > > > > document originates at indexing time. If so, you can use
> > > > > index-time boosting:
> > > > > http://wiki.apache.org/solr/SolrRelevancyFAQ#index-time_boosts
> > > > > E.g., this can be used to boost all documents from www.example.com
> > > > > over those from blog.example.com.
> > > > >
> > > > > Regards,
> > > > > Gora
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: Weight servers differently

Posted by Markus Jelsma <ma...@openindex.io>.
hmm. Better use functions instead

bf=query($qq)^20&qq=site:example.com

This will boost all example.com sites in the result set

On Wednesday 31 August 2011 15:58:18 Johan Svensson wrote:
> Thank you, Markus,
> 
> At current rate, I just want this to work. I have no idea whether I want to
> omitNorms or not. At the moment of writing, I don't feel like that, anyway.
> More importantly, I want to boost pages which site field is www.example.com
>  over blog.example.com, but without omitting hits on blog.example.com.
> 
> The query boost seems to filter out hits from blog.example.com completely,
> so that is not what I want.
> 
> Abusing the boost field might be a nice idea. Can you please show me an
> example, presuming I don't really understand the connection between all the
> xml files and binaries. Not even really which one of solr and nutch is
> responsible for which task... :)
> 
> 2011/8/31 Markus Jelsma <ma...@openindex.io>
> 
> > Index-time boosting is not something very common and raises issues if you
> > want
> > to omitNorms in Solr.
> > 
> > In Solr DisMax you can use a bq (boost query) to boost site:example.com
> > ^10.
> > All results that match the boost query receive a ^10 boost. This is only
> > client side.
> > 
> > You can also abuse the boost field Nutch is writing. By default this is
> > 1.0f.
> > You can write a simple scoring filter or even an indexing filter that
> > check's
> > the site field for your site and sets the boost field accordingly.
> > 
> > On Wednesday 31 August 2011 15:30:08 Johan Svensson wrote:
> > > I guess this is the solution. Though, I have been trying to implement
> > 
> > this
> > 
> > > the whole afternoon with no success. I have a field "site" in my
> > > scheme.xml, stored and indexed. I'm using nutch -solrindex to tell solr
> > 
> > to
> > 
> > > index what nutch has crawled. How can I tell nutch to tell solr to
> > > boost all documents with the value "www.example.com" of the "site"
> > > field? An example would be perfect for a loser like myself. I've
> > > googled all the Internets over and over.
> > > 
> > > 2011/8/31 Gora Mohanty <go...@mimirtech.com>
> > > 
> > > > On Wed, Aug 31, 2011 at 2:51 PM, Johan Svensson
> > > > 
> > > > <jo...@euroling.se> wrote:
> > > > > Thank you! This looks interesting. However, I wonder if it really
> > > > > can
> > > > 
> > > > solve
> > > > 
> > > > > this problem. No part of the search query is by necessary means
> > > > > part
> > 
> > of
> > 
> > > > the
> > > > 
> > > > > domain name. Let's say for example that we search for "foobar". On
> > > > > www.example.com/page42.html this word is found, as well for lots of
> > > > 
> > > > pages
> > > > 
> > > > > with different names at blog.example.com/. Can you apply boosting
> > 
> > magic
> > 
> > > > for
> > > > 
> > > > > the hit at www.example.com although the search term is not a part
> > > > > of the url?
> > > > 
> > > > Presumably, you know the domain name from which the
> > > > document originates at indexing time. If so, you can use
> > > > index-time boosting:
> > > > http://wiki.apache.org/solr/SolrRelevancyFAQ#index-time_boosts
> > > > E.g., this can be used to boost all documents from www.example.com
> > > > over those from blog.example.com.
> > > > 
> > > > Regards,
> > > > Gora
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Weight servers differently

Posted by Johan Svensson <jo...@euroling.se>.
Thank you, Markus,

At current rate, I just want this to work. I have no idea whether I want to
omitNorms or not. At the moment of writing, I don't feel like that, anyway.
More importantly, I want to boost pages which site field is www.example.com
 over blog.example.com, but without omitting hits on blog.example.com.

The query boost seems to filter out hits from blog.example.com completely,
so that is not what I want.

Abusing the boost field might be a nice idea. Can you please show me an
example, presuming I don't really understand the connection between all the
xml files and binaries. Not even really which one of solr and nutch is
responsible for which task... :)

2011/8/31 Markus Jelsma <ma...@openindex.io>

> Index-time boosting is not something very common and raises issues if you
> want
> to omitNorms in Solr.
>
> In Solr DisMax you can use a bq (boost query) to boost site:example.com
> ^10.
> All results that match the boost query receive a ^10 boost. This is only
> client side.
>
> You can also abuse the boost field Nutch is writing. By default this is
> 1.0f.
> You can write a simple scoring filter or even an indexing filter that
> check's
> the site field for your site and sets the boost field accordingly.
>
> On Wednesday 31 August 2011 15:30:08 Johan Svensson wrote:
> > I guess this is the solution. Though, I have been trying to implement
> this
> > the whole afternoon with no success. I have a field "site" in my
> > scheme.xml, stored and indexed. I'm using nutch -solrindex to tell solr
> to
> > index what nutch has crawled. How can I tell nutch to tell solr to boost
> > all documents with the value "www.example.com" of the "site" field? An
> > example would be perfect for a loser like myself. I've googled all the
> > Internets over and over.
> >
> > 2011/8/31 Gora Mohanty <go...@mimirtech.com>
> >
> > > On Wed, Aug 31, 2011 at 2:51 PM, Johan Svensson
> > >
> > > <jo...@euroling.se> wrote:
> > > > Thank you! This looks interesting. However, I wonder if it really can
> > >
> > > solve
> > >
> > > > this problem. No part of the search query is by necessary means part
> of
> > >
> > > the
> > >
> > > > domain name. Let's say for example that we search for "foobar". On
> > > > www.example.com/page42.html this word is found, as well for lots of
> > >
> > > pages
> > >
> > > > with different names at blog.example.com/. Can you apply boosting
> magic
> > >
> > > for
> > >
> > > > the hit at www.example.com although the search term is not a part of
> > > > the url?
> > >
> > > Presumably, you know the domain name from which the
> > > document originates at indexing time. If so, you can use
> > > index-time boosting:
> > > http://wiki.apache.org/solr/SolrRelevancyFAQ#index-time_boosts
> > > E.g., this can be used to boost all documents from www.example.com
> > > over those from blog.example.com.
> > >
> > > Regards,
> > > Gora
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: Weight servers differently

Posted by Markus Jelsma <ma...@openindex.io>.
Index-time boosting is not something very common and raises issues if you want 
to omitNorms in Solr. 

In Solr DisMax you can use a bq (boost query) to boost site:example.com^10. 
All results that match the boost query receive a ^10 boost. This is only 
client side.

You can also abuse the boost field Nutch is writing. By default this is 1.0f. 
You can write a simple scoring filter or even an indexing filter that check's 
the site field for your site and sets the boost field accordingly.

On Wednesday 31 August 2011 15:30:08 Johan Svensson wrote:
> I guess this is the solution. Though, I have been trying to implement this
> the whole afternoon with no success. I have a field "site" in my
> scheme.xml, stored and indexed. I'm using nutch -solrindex to tell solr to
> index what nutch has crawled. How can I tell nutch to tell solr to boost
> all documents with the value "www.example.com" of the "site" field? An
> example would be perfect for a loser like myself. I've googled all the
> Internets over and over.
> 
> 2011/8/31 Gora Mohanty <go...@mimirtech.com>
> 
> > On Wed, Aug 31, 2011 at 2:51 PM, Johan Svensson
> > 
> > <jo...@euroling.se> wrote:
> > > Thank you! This looks interesting. However, I wonder if it really can
> > 
> > solve
> > 
> > > this problem. No part of the search query is by necessary means part of
> > 
> > the
> > 
> > > domain name. Let's say for example that we search for "foobar". On
> > > www.example.com/page42.html this word is found, as well for lots of
> > 
> > pages
> > 
> > > with different names at blog.example.com/. Can you apply boosting magic
> > 
> > for
> > 
> > > the hit at www.example.com although the search term is not a part of
> > > the url?
> > 
> > Presumably, you know the domain name from which the
> > document originates at indexing time. If so, you can use
> > index-time boosting:
> > http://wiki.apache.org/solr/SolrRelevancyFAQ#index-time_boosts
> > E.g., this can be used to boost all documents from www.example.com
> > over those from blog.example.com.
> > 
> > Regards,
> > Gora

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Weight servers differently

Posted by Johan Svensson <jo...@euroling.se>.
I guess this is the solution. Though, I have been trying to implement this
the whole afternoon with no success. I have a field "site" in my scheme.xml,
stored and indexed. I'm using nutch -solrindex to tell solr to index what
nutch has crawled. How can I tell nutch to tell solr to boost all documents
with the value "www.example.com" of the "site" field? An example would be
perfect for a loser like myself. I've googled all the Internets over and
over.

2011/8/31 Gora Mohanty <go...@mimirtech.com>

> On Wed, Aug 31, 2011 at 2:51 PM, Johan Svensson
> <jo...@euroling.se> wrote:
> > Thank you! This looks interesting. However, I wonder if it really can
> solve
> > this problem. No part of the search query is by necessary means part of
> the
> > domain name. Let's say for example that we search for "foobar". On
> > www.example.com/page42.html this word is found, as well for lots of
> pages
> > with different names at blog.example.com/. Can you apply boosting magic
> for
> > the hit at www.example.com although the search term is not a part of the
> > url?
>
> Presumably, you know the domain name from which the
> document originates at indexing time. If so, you can use
> index-time boosting:
> http://wiki.apache.org/solr/SolrRelevancyFAQ#index-time_boosts
> E.g., this can be used to boost all documents from www.example.com
> over those from blog.example.com.
>
> Regards,
> Gora
>

Re: Weight servers differently

Posted by Gora Mohanty <go...@mimirtech.com>.
On Wed, Aug 31, 2011 at 2:51 PM, Johan Svensson
<jo...@euroling.se> wrote:
> Thank you! This looks interesting. However, I wonder if it really can solve
> this problem. No part of the search query is by necessary means part of the
> domain name. Let's say for example that we search for "foobar". On
> www.example.com/page42.html this word is found, as well for lots of pages
> with different names at blog.example.com/. Can you apply boosting magic for
> the hit at www.example.com although the search term is not a part of the
> url?

Presumably, you know the domain name from which the
document originates at indexing time. If so, you can use
index-time boosting:
http://wiki.apache.org/solr/SolrRelevancyFAQ#index-time_boosts
E.g., this can be used to boost all documents from www.example.com
over those from blog.example.com.

Regards,
Gora

Re: Weight servers differently

Posted by Johan Svensson <jo...@euroling.se>.
Thank you! This looks interesting. However, I wonder if it really can solve
this problem. No part of the search query is by necessary means part of the
domain name. Let's say for example that we search for "foobar". On
www.example.com/page42.html this word is found, as well for lots of pages
with different names at blog.example.com/. Can you apply boosting magic for
the hit at www.example.com although the search term is not a part of the
url?

2011/8/31 Gora Mohanty <go...@mimirtech.com>

> On Wed, Aug 31, 2011 at 1:52 PM, Johan Svensson
> <jo...@euroling.se> wrote:
> > I want to put different weights to different domains, so that I can push
> up
> > results from my main site. Say for example, I have www.example.com with
> a
> > few but important pages, and blog.example.com with zillions of
> > not-really-that-good pages. When searching, I'd like the hits at
> > www.example.com be up-weighted by some factor over them from
> > blog.example.com. Of course, other weights must be considered as usual,
> just
> > that the blog pages is generally not really as important as the main
> pages.
> > Is this possible, and how? Using nutch+solr.
> [...]
>
> Not sure how far this is possible within the Nutch search
> engine itself, but if you are pushing the results to Solr,
> you can use boosting, and other techniques. This might
> be a good starting point:
> http://wiki.apache.org/solr/SolrRelevancyCookbook
>
> Regards,
> Gora
>

Re: Weight servers differently

Posted by Gora Mohanty <go...@mimirtech.com>.
On Wed, Aug 31, 2011 at 1:52 PM, Johan Svensson
<jo...@euroling.se> wrote:
> I want to put different weights to different domains, so that I can push up
> results from my main site. Say for example, I have www.example.com with a
> few but important pages, and blog.example.com with zillions of
> not-really-that-good pages. When searching, I'd like the hits at
> www.example.com be up-weighted by some factor over them from
> blog.example.com. Of course, other weights must be considered as usual, just
> that the blog pages is generally not really as important as the main pages.
> Is this possible, and how? Using nutch+solr.
[...]

Not sure how far this is possible within the Nutch search
engine itself, but if you are pushing the results to Solr,
you can use boosting, and other techniques. This might
be a good starting point:
http://wiki.apache.org/solr/SolrRelevancyCookbook

Regards,
Gora