You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Gaurav Agarwal <ga...@yahoo.com> on 2007/04/08 20:44:04 UTC

Nutch HTMLParseFilters

Hi,

I have started using Nutch recently for one of the academic research
projects involving crawling particular kind of web-pages.

While crawling, I did not need to crawl bmp,jpeg,mp3 etc. , so I went ahead
and updates my url-filter property file to block these. However, this did
not stop these urls (to jpeg etc.) from showing up as outlinks from a valid
html page. In fact, because I had put a limit on number of outgoing links as
100, these useless urls occupied the available slots and blocked a few valid
html pages from being fetched (of course, this can be resolved by increasing
the threshold on #outlinks/page).

I went ahead and created a filter for HTMLParseFilter extension point to
throw away any of these invalid urls at the parse time itself. I also
modified the HTMLParseFilter class to execute these filters in a particular
order according to a new property introduced in the nutch-site.xml) . This
was done because I wanted the pruning to happen after all the
HTMLParseFilters have executed (eg in the case of JSParseFilter).

Now, I am posting this mail to ask if this feature is already present and I
just redundantly did all this, or if it is not present in the core, will it
make any sense for anyone else to have this. i can send the code etc. back
to developers to put it in the core if they find it useful (it was trivially
easy, thanks to highly simple Nutch Plugin architecture and anyone can
implement it anyways).

thanks,
Gaurav

--
View this message in context: http://www.nabble.com/Nutch-HTMLParseFilters-tf3544419.html#a9894610
Sent from the Nutch - Dev mailing list archive at Nabble.com.