You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Piotr Kosiorowski <pk...@gmail.com> on 2005/04/19 23:23:43 UTC

Configurable boost

Hello,

During my experiment with adding title and host to scoring I have 
changed nutch source to allow setting boost values for fields in nutch 
config file. I am attaching patch for latest SVN version.

I was testing many values of these properties for our site and finally 
we use: 2.0 for host, 1.0 for all others. It was helping a little when 
our index contained a lot of pages from spam sites. When we removed 
those bad pages - it really does not matter as much as I hoped. Anyway I 
am attaching a patch so anyone can play with it.
Regards
Piotr


Re: Configurable boost

Posted by Stefan Groschupf <sg...@media-style.com>.
Piotr,

the combination of things improve ranking, so you will get my
+1 for the patch.
However since nutch allows adding custom field via index filter it  
would be great to see the patch more generic for fields in general.

Thanks.
Stefan

Am 19.04.2005 um 23:23 schrieb Piotr Kosiorowski:

> Hello,
>
> During my experiment with adding title and host to scoring I have  
> changed nutch source to allow setting boost values for fields in nutch  
> config file. I am attaching patch for latest SVN version.
>
> I was testing many values of these properties for our site and finally  
> we use: 2.0 for host, 1.0 for all others. It was helping a little when  
> our index contained a lot of pages from spam sites. When we removed  
> those bad pages - it really does not matter as much as I hoped. Anyway  
> I am attaching a patch so anyone can play with it.
> Regards
> Piotr
>
> Index: conf/nutch-default.xml
> ===================================================================
> --- conf/nutch-default.xml	(revision 161968)
> +++ conf/nutch-default.xml	(working copy)
> @@ -669,4 +669,43 @@
>    </description>
>  </property>
>
> +<!-- query-basic plugin properties -->
> +
> +  <property>
> +    <name>query.url.boost</name>
> +    <value>4.0</value>
> +    <description> Used as a boost for url field in Lucene query.
> +    </description>
> +  </property>
> +
> +  <property>
> +    <name>query.anchor.boost</name>
> +    <value>2.0</value>
> +    <description> Used as a boost for anchor field in Lucene query.
> +    </description>
> +  </property>
> +
> +
> +  <property>
> +    <name>query.title.boost</name>
> +    <value>1.5</value>
> +    <description> Used as a boost for title field in Lucene query.
> +    </description>
> +  </property>
> +
> +  <property>
> +    <name>query.host.boost</name>
> +    <value>2.0</value>
> +    <description> Used as a boost for host field in Lucene query.
> +    </description>
> +  </property>
> +
> +  <property>
> +    <name>query.phrase.boost</name>
> +    <value>1.0</value>
> +    <description> Used as a boost for phrase in Lucene query.
> +    Multiplied by boost for field phrase is matched in.
> +    </description>
> +  </property>
> +
>  </nutch-conf>
> Index:  
> src/plugin/query-basic/src/java/org/apache/nutch/searcher/basic/ 
> BasicQueryFilter.java
> ===================================================================
> ---  
> src/plugin/query-basic/src/java/org/apache/nutch/searcher/basic/ 
> BasicQueryFilter.java	(revision 161968)
> +++  
> src/plugin/query-basic/src/java/org/apache/nutch/searcher/basic/ 
> BasicQueryFilter.java	(working copy)
> @@ -26,6 +26,7 @@
>  import org.apache.nutch.searcher.QueryFilter;
>  import org.apache.nutch.searcher.Query;
>  import org.apache.nutch.searcher.Query.*;
> +import org.apache.nutch.util.NutchConf;
>
>  import java.io.IOException;
>  import java.util.HashSet;
> @@ -33,15 +34,24 @@
>  /** The default query filter.  Query terms in the default query field  
> are
>   * expanded to search the url, anchor and content document fields.*/
>  public class BasicQueryFilter implements QueryFilter {
> +
> +    private static float URL_BOOST = NutchConf.get().getFloat(
> +            "query.url.boost", 4.0f);
>
> -  private static float URL_BOOST = 4.0f;
> -  private static float ANCHOR_BOOST = 2.0f;
> -  private static float TITLE_BOOST = 1.5f;
> -  private static float HOST_BOOST = 2.0f;
> +    private static float ANCHOR_BOOST = NutchConf.get().getFloat(
> +            "query.anchor.boost", 2.0f);
>
> -  private static int SLOP = Integer.MAX_VALUE;
> -  private static float PHRASE_BOOST = 1.0f;
> +    private static float TITLE_BOOST = NutchConf.get().getFloat(
> +            "query.title.boost", 1.5f);
>
> +    private static float HOST_BOOST = NutchConf.get().getFloat(
> +            "query.host.boost", 2.0f);
> +
> +    private static int SLOP = Integer.MAX_VALUE;
> +
> +    private static float PHRASE_BOOST = NutchConf.get().getFloat(
> +            "query.phrase.boost", 1.0f);
> +
>    private static final String[] FIELDS =
>    { "url", "anchor", "content", "title", "host" };
>
>
---------------------------------------------------------------
company:		http://www.media-style.com
forum:		http://www.text-mining.org
blog:			http://www.find23.net
-------------------------------------------------------------
Hommingberger Gepardenforelle
http://wiki.media-style.com/display/~hommingbergergepardenforelle


Re: Configurable boost

Posted by Doug Cutting <cu...@nutch.org>.
Piotr Kosiorowski wrote:
> During my experiment with adding title and host to scoring I have 
> changed nutch source to allow setting boost values for fields in nutch 
> config file. I am attaching patch for latest SVN version.

Thanks!  This has long been needed.

I just committed your patch, with a few indentation changes.

Cheers,

Doug