You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dima Gritsenko <di...@ekreative.com> on 2006/10/03 15:35:13 UTC

category string gets matched as a term

Hi, 

I have categorized web sites during crawl to provide filtered results similar to google Video, Images tabs. 
 
But when I enter 
category:video MySearchString 
nutch matches both the video and MySearchString as terms (though it filters results correctly and displays links to only video categorized pages) but the search is not relevant since "video" string is matched as well. 

How do I filter category string off during search?

Great thanks. 
Dima. 

Re: category string gets matched as a term

Posted by Alvaro Cabrerizo <to...@gmail.com>.
It looks you syntax is correct ( category:video searchString). Try to
write a LOG.info line into
org.apache.nutch.searcher.LuceneQueryOptimizer(Line 178), just at the
begining of the optimize method:

public TopDocs optimize(BooleanQuery original,
Searcher searcher, int numHits,
String sortField, boolean reverse)
throws IOException {
LOG.info("Query -> "+original.toString());

Recompile nutch a make a query, for example category:video funny if your
category plugin works fine you'll get an info line within hadoop.log similar
to this:

+(url:funny^0.0 anchor:funny^0.0 content:funny title:funny^0.0
host:funny^0.0) +category:video

First part means (+(url:funny^0.0 anchor:funny^0.0 content:funny
title:funny^0.0
host:funny^0.0)) that funny must appear at least in one of that fields (url,
anchor...). The second part filters results to obtain only the ones
tagged as video.

In your case it looks like the word video is being included into the first
part. Check your plugin implementation is correct, and the plugin.xml and
build.xml are correct. Your plugin.xml should look similar to this:

...
<extension id="..."
                    name="...."
                    point="org.apache.nutch.searcher.QueryFilter">
   <implementation id="..."  class="...."/>
   <parameter name="raw-fields" value="category"/>
</extension>

Hope it helps.

2006/10/3, Dima Gritsenko < dima@ekreative.com>:
>
> Hi,
>
> I have categorized web sites during crawl to provide filtered results
> similar to google Video, Images tabs.
>
> But when I enter
> category:video MySearchString
> nutch matches both the video and MySearchString as terms (though it
> filters results correctly and displays links to only video categorized
> pages) but the search is not relevant since "video" string is matched as
> well.
>
> How do I filter category string off during search?
>
> Great thanks.
> Dima.
>
>