You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Piotr Kosiorowski <pk...@gmail.com> on 2005/04/19 23:23:43 UTC
Configurable boost
Hello,
During my experiment with adding title and host to scoring I have
changed nutch source to allow setting boost values for fields in nutch
config file. I am attaching patch for latest SVN version.
I was testing many values of these properties for our site and finally
we use: 2.0 for host, 1.0 for all others. It was helping a little when
our index contained a lot of pages from spam sites. When we removed
those bad pages - it really does not matter as much as I hoped. Anyway I
am attaching a patch so anyone can play with it.
Regards
Piotr
Re: Configurable boost
Posted by Stefan Groschupf <sg...@media-style.com>.
Piotr,
the combination of things improve ranking, so you will get my
+1 for the patch.
However since nutch allows adding custom field via index filter it
would be great to see the patch more generic for fields in general.
Thanks.
Stefan
Am 19.04.2005 um 23:23 schrieb Piotr Kosiorowski:
> Hello,
>
> During my experiment with adding title and host to scoring I have
> changed nutch source to allow setting boost values for fields in nutch
> config file. I am attaching patch for latest SVN version.
>
> I was testing many values of these properties for our site and finally
> we use: 2.0 for host, 1.0 for all others. It was helping a little when
> our index contained a lot of pages from spam sites. When we removed
> those bad pages - it really does not matter as much as I hoped. Anyway
> I am attaching a patch so anyone can play with it.
> Regards
> Piotr
>
> Index: conf/nutch-default.xml
> ===================================================================
> --- conf/nutch-default.xml (revision 161968)
> +++ conf/nutch-default.xml (working copy)
> @@ -669,4 +669,43 @@
> </description>
> </property>
>
> +<!-- query-basic plugin properties -->
> +
> + <property>
> + <name>query.url.boost</name>
> + <value>4.0</value>
> + <description> Used as a boost for url field in Lucene query.
> + </description>
> + </property>
> +
> + <property>
> + <name>query.anchor.boost</name>
> + <value>2.0</value>
> + <description> Used as a boost for anchor field in Lucene query.
> + </description>
> + </property>
> +
> +
> + <property>
> + <name>query.title.boost</name>
> + <value>1.5</value>
> + <description> Used as a boost for title field in Lucene query.
> + </description>
> + </property>
> +
> + <property>
> + <name>query.host.boost</name>
> + <value>2.0</value>
> + <description> Used as a boost for host field in Lucene query.
> + </description>
> + </property>
> +
> + <property>
> + <name>query.phrase.boost</name>
> + <value>1.0</value>
> + <description> Used as a boost for phrase in Lucene query.
> + Multiplied by boost for field phrase is matched in.
> + </description>
> + </property>
> +
> </nutch-conf>
> Index:
> src/plugin/query-basic/src/java/org/apache/nutch/searcher/basic/
> BasicQueryFilter.java
> ===================================================================
> ---
> src/plugin/query-basic/src/java/org/apache/nutch/searcher/basic/
> BasicQueryFilter.java (revision 161968)
> +++
> src/plugin/query-basic/src/java/org/apache/nutch/searcher/basic/
> BasicQueryFilter.java (working copy)
> @@ -26,6 +26,7 @@
> import org.apache.nutch.searcher.QueryFilter;
> import org.apache.nutch.searcher.Query;
> import org.apache.nutch.searcher.Query.*;
> +import org.apache.nutch.util.NutchConf;
>
> import java.io.IOException;
> import java.util.HashSet;
> @@ -33,15 +34,24 @@
> /** The default query filter. Query terms in the default query field
> are
> * expanded to search the url, anchor and content document fields.*/
> public class BasicQueryFilter implements QueryFilter {
> +
> + private static float URL_BOOST = NutchConf.get().getFloat(
> + "query.url.boost", 4.0f);
>
> - private static float URL_BOOST = 4.0f;
> - private static float ANCHOR_BOOST = 2.0f;
> - private static float TITLE_BOOST = 1.5f;
> - private static float HOST_BOOST = 2.0f;
> + private static float ANCHOR_BOOST = NutchConf.get().getFloat(
> + "query.anchor.boost", 2.0f);
>
> - private static int SLOP = Integer.MAX_VALUE;
> - private static float PHRASE_BOOST = 1.0f;
> + private static float TITLE_BOOST = NutchConf.get().getFloat(
> + "query.title.boost", 1.5f);
>
> + private static float HOST_BOOST = NutchConf.get().getFloat(
> + "query.host.boost", 2.0f);
> +
> + private static int SLOP = Integer.MAX_VALUE;
> +
> + private static float PHRASE_BOOST = NutchConf.get().getFloat(
> + "query.phrase.boost", 1.0f);
> +
> private static final String[] FIELDS =
> { "url", "anchor", "content", "title", "host" };
>
>
---------------------------------------------------------------
company: http://www.media-style.com
forum: http://www.text-mining.org
blog: http://www.find23.net
-------------------------------------------------------------
Hommingberger Gepardenforelle
http://wiki.media-style.com/display/~hommingbergergepardenforelle
Re: Configurable boost
Posted by Doug Cutting <cu...@nutch.org>.
Piotr Kosiorowski wrote:
> During my experiment with adding title and host to scoring I have
> changed nutch source to allow setting boost values for fields in nutch
> config file. I am attaching patch for latest SVN version.
Thanks! This has long been needed.
I just committed your patch, with a few indentation changes.
Cheers,
Doug