You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Tomas Ukkonen <to...@helsinki.fi> on 2009/03/24 18:04:35 UTC
Problems writing QueryFilter plugin
Hi,
I have been trying to write url exclusion filter plugin (QueryFilter)
for dropping listed URLs away from results but for some reason it
doesn't seem to work.
The following plugin is configured to process 'url' fields.
I have checked that my plugin is correctly loaded by nutch(wax) and
my filter() function is:
--->>--->>--->>--->>--->>--->>--->>--->>--->>--->>--->>--->>--->>--->>
public BooleanQuery filter(Query input, BooleanQuery output)
throws QueryException
{
ListIterator iter = exclusionList.listIterator();
while(iter.hasNext()){
String url = (String)iter.next();
Term term = new Term(URL_FIELD, url)
org.apache.lucene.search.Query urlExclusion = new
TermQuery(term);
urlExclusion.setBoost(0.0f); // I have also tried to use 1.0
debug("excluded term: '"+term.field()+"' :
'"+term.text()+"'");
output.add(urlExclusion, BooleanClause.Occur.MUST_NOT);
}
return output;
}
<<---<<---<<---<<---<<---<<---<<---<<---<<---<<---<<---<<---<<---<<---<<
Also, debug() call tells me that the terms has a correct form
Term("url", "http://www.domain.com:1234/dir/file.txt"). But for some
reason the excluded URLs still appear in search results.
* Can some one tell me reason for this?
It appears that currently the plugin has no
effect on results at all.
If I exclude URLs explicitly by doing a query like:
'<some> <search> <terms> -url:"http://www.domain.com:1234/dir/file.txt"'
then the listed URLs are correctly filtered away from results.
The nutch version I'm using is nutch-1.0-dev. I'm quite sure that
all related settings in nutch-site.xml and plugin.xml should be correct.
Thanks in advance,
--
Tomas Ukkonen
Information Systems Specialist
Kansalliskirjasto /
The National Library of Finland
phone +358-50-4150557
email tomas.ukkonen@helsinki.fi
www http://www.kansalliskirjasto.fi
http://www.nationallibrary.fi