You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Nutch User - 1 <nu...@gmail.com> on 2011/06/22 13:43:10 UTC

Depth-first crawling

As far as I have understood Nutch can be used to do breadth-first
crawling, at least when topN is large enough (<=> every new page gets
selected in the list of fetch candidates?). What about depth-first? Is
there any way to make Nutch perform it?

Re: Depth-first crawling

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

in the wiki there is an example of how you can write your own parser plugin
(class). Essentially you will receive from nutch (implementing the
interface) the page text and there start counting the terms (re-invent the
wheel, or copy-paste from solr code, or find a library) to do that.

On Thu, Jun 23, 2011 at 10:47 PM, Nutch User - 1 <nu...@gmail.com>wrote:

> Big thanks for answering in the first place! (This mailing list seems to be
> too passive in my opinion. Are there any other channels for Nutch related
> conversation? I'm relatively new to Nutch and need often help with it.) But
> could you elaborate your idea? Unfortunately I don't have at the moment
> time
> to explain in detail what I didn't understand, but I'll get back to that
> later.
>
> What do you mean by partition?
>
> Partition i mean split. Put each seed in a file and inject one file at a
time per crawl.


> Let's say that I would have a graph like this (
> https://sites.google.com/site/testa12b3c/home/Graph.png) to crawl and I
> would start from node A. As far as I have understood DFS crawling would
> occur so that either B or C but not both will be visited/fetched before D.
>
On the other hand when using BFS both B and C will be visited/fetched before
> D.
>
> I don't see how Nutch provides me tools to crawl in a way that it would
> bypass B or C before visiting/fetching D.
>

Okay, this is quite tricky. My strategy was BFS for the seeds level.
For 'discovered' tree levels you want a tool/way to get those discovered
links B and C and control the crawldb update and generate subsequent steps.
For example after fetching A and updating linkdb you want to know those urls
that were fetched (by dumping linkdb for example). Then you crawl again
injecting as a seed B or C (which you now know), and you keep doing that
until you exhausted the depth of A along one path.
At the end you merge the various crawldbs created in the process.

>
> If there's something you don't understand in this message, please tell me
> and I'll try to explain my thoughts better.
>

> On Thu, Jun 23, 2011 at 7:35 AM, Gabriele Kahlout
> <ga...@mysimpatico.com>wrote:
>
> > Yes, partition your seed list. Consider 3 seeds. If I inject all of
> > them and then crawl for a tree depth of 2 then this is what (could)
> > happen:
> >
> > 0: 3 seeds fetched
> > 1: urls found in 0
> > 2: urls found in 1
> >
> > To do breadth-first, inject only 1 of the 3 seeds and crawl:
> >
> > 0: 1 seed fetched
> > 1: urls found in 0
> > 2: urls found in 1
> > (assuming your topN was large enough to exhaust pass all the tree)
> >
> > inject seed 2 (of the 3) and so on
> >
> > On Wed, Jun 22, 2011 at 1:43 PM, Nutch User - 1 <nu...@gmail.com>
> > wrote:
> > > As far as I have understood Nutch can be used to do breadth-first
> > > crawling, at least when topN is large enough (<=> every new page gets
> > > selected in the list of fetch candidates?). What about depth-first? Is
> > > there any way to make Nutch perform it?
> > >
> >
> >
> >
> > --
> > Regards,
> > K. Gabriele
> >
> > --- unchanged since 20/9/10 ---
> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > receipt within 48 hours then I don't resend the email.
> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > time(x) < Now + 48h) ⇒ ¬resend(I, this).
> >
> > If an email is sent by a sender that is not a trusted contact or the
> > email does not contain a valid code then the email is not received. A
> > valid code starts with a hyphen and ends with "X".
> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
> > ∈ L(-[a-z]+[0-9]X)).
> >
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Depth-first crawling

Posted by Nutch User - 1 <nu...@gmail.com>.

Big thanks for answering in the first place! (This mailing list seems to be
too passive in my opinion. Are there any other channels for Nutch related
conversation? I'm relatively new to Nutch and need often help with it.) But
could you elaborate your idea? Unfortunately I don't have at the moment time
to explain in detail what I didn't understand, but I'll get back to that
later.

What do you mean by partition?

Let's say that I would have a graph like this (
https://sites.google.com/site/testa12b3c/home/Graph.png) to crawl and I
would start from node A. As far as I have understood DFS crawling would
occur so that either B or C but not both will be visited/fetched before D.
On the other hand when using BFS both B and C will be visited/fetched before
D.

I don't see how Nutch provides me tools to crawl in a way that it would
bypass B or C before visiting/fetching D.

If there's something you don't understand in this message, please tell me
and I'll try to explain my thoughts better.

On Thu, Jun 23, 2011 at 7:35 AM, Gabriele Kahlout
<ga...@mysimpatico.com>wrote:

> Yes, partition your seed list. Consider 3 seeds. If I inject all of
> them and then crawl for a tree depth of 2 then this is what (could)
> happen:
>
> 0: 3 seeds fetched
> 1: urls found in 0
> 2: urls found in 1
>
> To do breadth-first, inject only 1 of the 3 seeds and crawl:
>
> 0: 1 seed fetched
> 1: urls found in 0
> 2: urls found in 1
> (assuming your topN was large enough to exhaust pass all the tree)
>
> inject seed 2 (of the 3) and so on
>
> On Wed, Jun 22, 2011 at 1:43 PM, Nutch User - 1 <nu...@gmail.com>
> wrote:
> > As far as I have understood Nutch can be used to do breadth-first
> > crawling, at least when topN is large enough (<=> every new page gets
> > selected in the list of fetch candidates?). What about depth-first? Is
> > there any way to make Nutch perform it?
> >
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the
> email does not contain a valid code then the email is not received. A
> valid code starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
> ∈ L(-[a-z]+[0-9]X)).
>

Re: Depth-first crawling

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

Yes, partition your seed list. Consider 3 seeds. If I inject all of
them and then crawl for a tree depth of 2 then this is what (could)
happen:

0: 3 seeds fetched
1: urls found in 0
2: urls found in 1

To do breadth-first, inject only 1 of the 3 seeds and crawl:

0: 1 seed fetched
1: urls found in 0
2: urls found in 1
(assuming your topN was large enough to exhaust pass all the tree)

inject seed 2 (of the 3) and so on

On Wed, Jun 22, 2011 at 1:43 PM, Nutch User - 1 <nu...@gmail.com> wrote:
> As far as I have understood Nutch can be used to do breadth-first
> crawling, at least when topN is large enough (<=> every new page gets
> selected in the list of fetch candidates?). What about depth-first? Is
> there any way to make Nutch perform it?
>

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
time(x) < Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the
email does not contain a valid code then the email is not received. A
valid code starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
∈ L(-[a-z]+[0-9]X)).