You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Ferdy <fe...@kalooga.com> on 2010/07/20 14:30:06 UTC

Component fetching during parsing. (vertical crawling)

Hello,

We are currently using a heavily modified version of nutch. The main 
reason for this is the fact that we do not only fetch the urls that the 
QueueFeeder submits, but also additional resources from urls that are 
constructed during parsing. So for example let's say the QueueFeeder 
submits a html page to the fetcher, and after the fetch the page gets 
parsed. Nothing special so far. However the parser decides it also needs 
some images on the page. Perhaps these images link to other html pages, 
and we might want to fetch these too. All this is needed to parse 
information about this particular url we started with. These extra fetch 
urls we like to call Components, because they are additional resources 
required to do the parsing of our initial html page that was selected 
for fetching.

At first we tried to solve this "vertical crawling" problem by using 
multiple crawl cycles. Each crawl simply selects outlinks that are 
needed for the parsing of the initial html page. A single inspection can 
possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph 
depth). There are several problems with this approach, for one that the 
crawldb is cluttered with all these component urls and secondly that 
inspection completion times can be very long.

As an alternative we decided to let the parser fetch needed components 
on-the-fly, so that additional urls are instantly added to the fetcher 
lists. Every fetched url can be either a non-component (the QueueFeeder 
fed it; start parsing this resource) or as a component (the fetcher 
hands the resource over to the parser that requested it). In order to 
keep parsers alive we always try to fetch components first, with respect 
to fetch politeness. A downside of this solution is that your fetch task 
total running time will be more difficult to anticipate to. For example, 
if you inject and generate 100 urls and they will be fetched in a single 
task, you might end up fetching a total of 1100 urls (in the assumption 
each inspection needs 10 components). We found this behaviour to be 
acceptable.

Because of our custom version of nutch we cannot upgrade easily to newer 
versions (we're still using modified fetcher classes from nutch 0.9). 
Often we end up fixing bugs that have already been fixed by the 
community. Also, other users might benefit from our changes too.

Therefore we propose to redesign our vertical crawling system from 
scratch for the newer nutch versions, should there be any interest from 
the community. Perhaps we are not the only one to implement such a 
system with nutch. So, what are your thoughts about this?

Ferdy.

Re: Component fetching during parsing. (vertical crawling)

Posted by Ferdy <fe...@kalooga.com>.
That's right. The most modified part is in Fetcher code.

In our use case, components can either be direct elements (images for 
example) but can also be linked-to pages that contain their own 
components. This introduces yet another layer in the link graph. Also, 
we don't necessarily need to fetch ALL elements, merely the ones the 
parser is interested in. We implemented some business specific logic 
that use a html parser to extract urls that need to be fetched in order 
to complete an inspection.

Just for comparison, for a single url, normally you would have a data 
flow in the fetcher like this:

QueueFeeder ---> Fetcher ---> Parser [perhaps, if enabled] ---> 
OutputCommitter.

Whereas we introduced a flow that enables us to do:

QueueFeeder ---> Fetcher ---> Parser (inspect) ---> Fetcher (components 
in service of inspection) ---> Parser (handle components) ---> 
OutputCommitter.

To sum it up, we introduced a hook in the fetcher that is callable from 
within parsing code.

Andrzej Bialecki wrote:
> On 2010-07-20 14:30, Ferdy wrote:
>   
>> Hello,
>>
>> We are currently using a heavily modified version of nutch. The main
>> reason for this is the fact that we do not only fetch the urls that the
>> QueueFeeder submits, but also additional resources from urls that are
>> constructed during parsing. So for example let's say the QueueFeeder
>> submits a html page to the fetcher, and after the fetch the page gets
>> parsed. Nothing special so far. However the parser decides it also needs
>> some images on the page. Perhaps these images link to other html pages,
>> and we might want to fetch these too. All this is needed to parse
>> information about this particular url we started with. These extra fetch
>> urls we like to call Components, because they are additional resources
>> required to do the parsing of our initial html page that was selected
>> for fetching.
>>
>> At first we tried to solve this "vertical crawling" problem by using
>> multiple crawl cycles. Each crawl simply selects outlinks that are
>> needed for the parsing of the initial html page. A single inspection can
>> possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph
>> depth). There are several problems with this approach, for one that the
>> crawldb is cluttered with all these component urls and secondly that
>> inspection completion times can be very long.
>>
>> As an alternative we decided to let the parser fetch needed components
>> on-the-fly, so that additional urls are instantly added to the fetcher
>> lists. Every fetched url can be either a non-component (the QueueFeeder
>> fed it; start parsing this resource) or as a component (the fetcher
>> hands the resource over to the parser that requested it). In order to
>> keep parsers alive we always try to fetch components first, with respect
>> to fetch politeness. A downside of this solution is that your fetch task
>> total running time will be more difficult to anticipate to. For example,
>> if you inject and generate 100 urls and they will be fetched in a single
>> task, you might end up fetching a total of 1100 urls (in the assumption
>> each inspection needs 10 components). We found this behaviour to be
>> acceptable.
>>
>> Because of our custom version of nutch we cannot upgrade easily to newer
>> versions (we're still using modified fetcher classes from nutch 0.9).
>> Often we end up fixing bugs that have already been fixed by the
>> community. Also, other users might benefit from our changes too.
>>
>> Therefore we propose to redesign our vertical crawling system from
>> scratch for the newer nutch versions, should there be any interest from
>> the community. Perhaps we are not the only one to implement such a
>> system with nutch. So, what are your thoughts about this?
>>     
>
> If I understand your use case properly, this is really a custom Fetcher
> that you are talking about - a strategy to fetch complete pages
> (together with its resources that relate to the display of the page)
> should be possible to implement in a custom fetcher without changing
> other Nutch areas.
>
>
>   

Re: Component fetching during parsing. (vertical crawling)

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-07-20 14:30, Ferdy wrote:
> Hello,
> 
> We are currently using a heavily modified version of nutch. The main
> reason for this is the fact that we do not only fetch the urls that the
> QueueFeeder submits, but also additional resources from urls that are
> constructed during parsing. So for example let's say the QueueFeeder
> submits a html page to the fetcher, and after the fetch the page gets
> parsed. Nothing special so far. However the parser decides it also needs
> some images on the page. Perhaps these images link to other html pages,
> and we might want to fetch these too. All this is needed to parse
> information about this particular url we started with. These extra fetch
> urls we like to call Components, because they are additional resources
> required to do the parsing of our initial html page that was selected
> for fetching.
> 
> At first we tried to solve this "vertical crawling" problem by using
> multiple crawl cycles. Each crawl simply selects outlinks that are
> needed for the parsing of the initial html page. A single inspection can
> possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph
> depth). There are several problems with this approach, for one that the
> crawldb is cluttered with all these component urls and secondly that
> inspection completion times can be very long.
> 
> As an alternative we decided to let the parser fetch needed components
> on-the-fly, so that additional urls are instantly added to the fetcher
> lists. Every fetched url can be either a non-component (the QueueFeeder
> fed it; start parsing this resource) or as a component (the fetcher
> hands the resource over to the parser that requested it). In order to
> keep parsers alive we always try to fetch components first, with respect
> to fetch politeness. A downside of this solution is that your fetch task
> total running time will be more difficult to anticipate to. For example,
> if you inject and generate 100 urls and they will be fetched in a single
> task, you might end up fetching a total of 1100 urls (in the assumption
> each inspection needs 10 components). We found this behaviour to be
> acceptable.
> 
> Because of our custom version of nutch we cannot upgrade easily to newer
> versions (we're still using modified fetcher classes from nutch 0.9).
> Often we end up fixing bugs that have already been fixed by the
> community. Also, other users might benefit from our changes too.
> 
> Therefore we propose to redesign our vertical crawling system from
> scratch for the newer nutch versions, should there be any interest from
> the community. Perhaps we are not the only one to implement such a
> system with nutch. So, what are your thoughts about this?

If I understand your use case properly, this is really a custom Fetcher
that you are talking about - a strategy to fetch complete pages
(together with its resources that relate to the display of the page)
should be possible to implement in a custom fetcher without changing
other Nutch areas.


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com