You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Adelaida Lejarazu <al...@gmail.com> on 2011/07/29 14:51:20 UTC

Nutch filters

Hello,

I´m quite new to Nutch and for the moment, I have successfully
insertedNutch Crawling code in my java application. I have the
configuration files
under /conf directory and my problem comes with the
*regex-urlfilter.txt*file. In this file I put the st
uff regarding the filtering process. I want to do some crawling in a digital
newspaper website and the filter for it is:
*+^http://www.elcorreo.com/.*?/20110729/.*?\.html*
This regular expression changes every day ( as the date is part of it :)). I
want to execute the crawling every day but I don´t want to be updating this
file manually every day since I have more filters like  this.

I could use Java IO classes to update the file but is there something in
Nutch like "getFilters" and "setFilters" or a way to write the current date
in a filter, i.e. something like *+^http://www.elcorreo.com/.*?/getDate
/.*?\.html*?




Thanks