You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Ferdy <fe...@kalooga.com> on 2010/07/20 14:30:06 UTC
Component fetching during parsing. (vertical crawling)
Hello,
We are currently using a heavily modified version of nutch. The main
reason for this is the fact that we do not only fetch the urls that the
QueueFeeder submits, but also additional resources from urls that are
constructed during parsing. So for example let's say the QueueFeeder
submits a html page to the fetcher, and after the fetch the page gets
parsed. Nothing special so far. However the parser decides it also needs
some images on the page. Perhaps these images link to other html pages,
and we might want to fetch these too. All this is needed to parse
information about this particular url we started with. These extra fetch
urls we like to call Components, because they are additional resources
required to do the parsing of our initial html page that was selected
for fetching.
At first we tried to solve this "vertical crawling" problem by using
multiple crawl cycles. Each crawl simply selects outlinks that are
needed for the parsing of the initial html page. A single inspection can
possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph
depth). There are several problems with this approach, for one that the
crawldb is cluttered with all these component urls and secondly that
inspection completion times can be very long.
As an alternative we decided to let the parser fetch needed components
on-the-fly, so that additional urls are instantly added to the fetcher
lists. Every fetched url can be either a non-component (the QueueFeeder
fed it; start parsing this resource) or as a component (the fetcher
hands the resource over to the parser that requested it). In order to
keep parsers alive we always try to fetch components first, with respect
to fetch politeness. A downside of this solution is that your fetch task
total running time will be more difficult to anticipate to. For example,
if you inject and generate 100 urls and they will be fetched in a single
task, you might end up fetching a total of 1100 urls (in the assumption
each inspection needs 10 components). We found this behaviour to be
acceptable.
Because of our custom version of nutch we cannot upgrade easily to newer
versions (we're still using modified fetcher classes from nutch 0.9).
Often we end up fixing bugs that have already been fixed by the
community. Also, other users might benefit from our changes too.
Therefore we propose to redesign our vertical crawling system from
scratch for the newer nutch versions, should there be any interest from
the community. Perhaps we are not the only one to implement such a
system with nutch. So, what are your thoughts about this?
Ferdy.
Re: Component fetching during parsing. (vertical crawling)
Posted by Ferdy <fe...@kalooga.com>.
That's right. The most modified part is in Fetcher code.
In our use case, components can either be direct elements (images for
example) but can also be linked-to pages that contain their own
components. This introduces yet another layer in the link graph. Also,
we don't necessarily need to fetch ALL elements, merely the ones the
parser is interested in. We implemented some business specific logic
that use a html parser to extract urls that need to be fetched in order
to complete an inspection.
Just for comparison, for a single url, normally you would have a data
flow in the fetcher like this:
QueueFeeder ---> Fetcher ---> Parser [perhaps, if enabled] --->
OutputCommitter.
Whereas we introduced a flow that enables us to do:
QueueFeeder ---> Fetcher ---> Parser (inspect) ---> Fetcher (components
in service of inspection) ---> Parser (handle components) --->
OutputCommitter.
To sum it up, we introduced a hook in the fetcher that is callable from
within parsing code.
Andrzej Bialecki wrote:
> On 2010-07-20 14:30, Ferdy wrote:
>
>> Hello,
>>
>> We are currently using a heavily modified version of nutch. The main
>> reason for this is the fact that we do not only fetch the urls that the
>> QueueFeeder submits, but also additional resources from urls that are
>> constructed during parsing. So for example let's say the QueueFeeder
>> submits a html page to the fetcher, and after the fetch the page gets
>> parsed. Nothing special so far. However the parser decides it also needs
>> some images on the page. Perhaps these images link to other html pages,
>> and we might want to fetch these too. All this is needed to parse
>> information about this particular url we started with. These extra fetch
>> urls we like to call Components, because they are additional resources
>> required to do the parsing of our initial html page that was selected
>> for fetching.
>>
>> At first we tried to solve this "vertical crawling" problem by using
>> multiple crawl cycles. Each crawl simply selects outlinks that are
>> needed for the parsing of the initial html page. A single inspection can
>> possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph
>> depth). There are several problems with this approach, for one that the
>> crawldb is cluttered with all these component urls and secondly that
>> inspection completion times can be very long.
>>
>> As an alternative we decided to let the parser fetch needed components
>> on-the-fly, so that additional urls are instantly added to the fetcher
>> lists. Every fetched url can be either a non-component (the QueueFeeder
>> fed it; start parsing this resource) or as a component (the fetcher
>> hands the resource over to the parser that requested it). In order to
>> keep parsers alive we always try to fetch components first, with respect
>> to fetch politeness. A downside of this solution is that your fetch task
>> total running time will be more difficult to anticipate to. For example,
>> if you inject and generate 100 urls and they will be fetched in a single
>> task, you might end up fetching a total of 1100 urls (in the assumption
>> each inspection needs 10 components). We found this behaviour to be
>> acceptable.
>>
>> Because of our custom version of nutch we cannot upgrade easily to newer
>> versions (we're still using modified fetcher classes from nutch 0.9).
>> Often we end up fixing bugs that have already been fixed by the
>> community. Also, other users might benefit from our changes too.
>>
>> Therefore we propose to redesign our vertical crawling system from
>> scratch for the newer nutch versions, should there be any interest from
>> the community. Perhaps we are not the only one to implement such a
>> system with nutch. So, what are your thoughts about this?
>>
>
> If I understand your use case properly, this is really a custom Fetcher
> that you are talking about - a strategy to fetch complete pages
> (together with its resources that relate to the display of the page)
> should be possible to implement in a custom fetcher without changing
> other Nutch areas.
>
>
>
Re: Component fetching during parsing. (vertical crawling)
Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-07-20 14:30, Ferdy wrote:
> Hello,
>
> We are currently using a heavily modified version of nutch. The main
> reason for this is the fact that we do not only fetch the urls that the
> QueueFeeder submits, but also additional resources from urls that are
> constructed during parsing. So for example let's say the QueueFeeder
> submits a html page to the fetcher, and after the fetch the page gets
> parsed. Nothing special so far. However the parser decides it also needs
> some images on the page. Perhaps these images link to other html pages,
> and we might want to fetch these too. All this is needed to parse
> information about this particular url we started with. These extra fetch
> urls we like to call Components, because they are additional resources
> required to do the parsing of our initial html page that was selected
> for fetching.
>
> At first we tried to solve this "vertical crawling" problem by using
> multiple crawl cycles. Each crawl simply selects outlinks that are
> needed for the parsing of the initial html page. A single inspection can
> possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph
> depth). There are several problems with this approach, for one that the
> crawldb is cluttered with all these component urls and secondly that
> inspection completion times can be very long.
>
> As an alternative we decided to let the parser fetch needed components
> on-the-fly, so that additional urls are instantly added to the fetcher
> lists. Every fetched url can be either a non-component (the QueueFeeder
> fed it; start parsing this resource) or as a component (the fetcher
> hands the resource over to the parser that requested it). In order to
> keep parsers alive we always try to fetch components first, with respect
> to fetch politeness. A downside of this solution is that your fetch task
> total running time will be more difficult to anticipate to. For example,
> if you inject and generate 100 urls and they will be fetched in a single
> task, you might end up fetching a total of 1100 urls (in the assumption
> each inspection needs 10 components). We found this behaviour to be
> acceptable.
>
> Because of our custom version of nutch we cannot upgrade easily to newer
> versions (we're still using modified fetcher classes from nutch 0.9).
> Often we end up fixing bugs that have already been fixed by the
> community. Also, other users might benefit from our changes too.
>
> Therefore we propose to redesign our vertical crawling system from
> scratch for the newer nutch versions, should there be any interest from
> the community. Perhaps we are not the only one to implement such a
> system with nutch. So, what are your thoughts about this?
If I understand your use case properly, this is really a custom Fetcher
that you are talking about - a strategy to fetch complete pages
(together with its resources that relate to the display of the page)
should be possible to implement in a custom fetcher without changing
other Nutch areas.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com