You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by naveen shukla <na...@gmail.com> on 2013/04/22 14:06:28 UTC

Regarding URL Filtering and Politeness

Hi All,

I am a developer want to write a plugin using JSOUP in nutch for parsing
the html file. But to get better feel of it i would need to understand the
whole functionality.

What i perceived is URLFilter, URLFilterChecker and URLFilters.java but i
get confused when i see the following files RegexURLFilter, PrefixURLFilter.

Please can anybody tell me exactly which java files are handling the URL
filtering and politeness of the crawler.

Awaiting for positive reply.

Thanks in advance.

From:

Naveen Shukla

Re: Regarding URL Filtering and Politeness

Posted by feng lu <am...@gmail.com>.
Hi
The URLFilter is a part of Nutch's plugin system. It implementations limit
the URLs that nutch attempts to fetch. [0] . About nutch plugin system, you
can see this [1]. and how to write a plugin, you can see this [2]. you can
config the plugin.includes property in nutch-site.xml that include any
plugins what you want. the default configuration property is in
nutch-default.xml file.

for politeness of nutch, you can see some properties of fetcher in
nutch-default.xml. such as fetcher.server.delay, it control the number of
seconds the fetcher will delay between successive requests to the same
server. fetcher.threads.fetch property, it control the number of
FetcherThreads the fetcher should use. etc.


[0] http://wiki.apache.org/nutch/AboutPlugins
[1] http://wiki.apache.org/nutch/PluginCentral
[2] http://wiki.apache.org/nutch/WritingPluginExample


On Mon, Apr 22, 2013 at 8:06 PM, naveen shukla <
naveenshuklasweetdreamer@gmail.com> wrote:

> Hi All,
>
> I am a developer want to write a plugin using JSOUP in nutch for parsing
> the html file. But to get better feel of it i would need to understand the
> whole functionality.
>
> What i perceived is URLFilter, URLFilterChecker and URLFilters.java but i
> get confused when i see the following files RegexURLFilter, PrefixURLFilter.
>
> Please can anybody tell me exactly which java files are handling the URL
> filtering and politeness of the crawler.
>
> Awaiting for positive reply.
>
> Thanks in advance.
>
> From:
>
> Naveen Shukla
>
>


-- 
Don't Grow Old, Grow Up... :-)