You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gabriele Kahlout <ga...@mysimpatico.com> on 2011/06/03 15:24:50 UTC
Re: comparing nutch with and without hadoop

I'm also interested in calculating/approximating hadoop overhead while
crawling. The bottleneck is fetching (w/o parsing, default behavior since
some nutch 1.3 rev).
On a 3 machines cluster fetching 500 urls with topN 500  with depth 1 (so at
most 500) from the same host takes 47 mins (parsing: 40secs, indexing:
55secs).

So roughly 50 minutes for 500 urls (I checked that indeed all the 500 were
fetched), so 10 urls pm (3.3 x node). I'm not sure how much this improved
over Ibrahim's performance of 13 minutes since he doesn't mention how many
urls were fetched.

As discussed in Re: Index while
crawling<http://www.mail-archive.com/user@nutch.apache.org/msg02304.html>I
appreciate the ability of incrementally craw and index, so I may !
scale
to fetch all seeds at once (hundreds of thousands) and could tolerate
perhaps up to 3hours long iterations.
I've not run the experiment but since other tasks iteration tasks run
blazingly fast (eg. generating 50 secs) I suspect fetching is so slow due to
the politeness policy. I'm fetching from the same host and I'm inevitably
the victim of  fetcher.max.crawl.delay and fetcher.server.delay
innutch-default.xml.

That gives us that crawling from the same host with hadoop and ! is roughly
equivalent. Agreed?
When crawling different hosts incrementally it's therefore better to mix
different hosts urls in the same iteration than otherwise (contrary to the
belief that it will be more expensive to establish the connections than
fetch from the same host). Agreed?

Finally, this should make a case for using a host dumps rather fetching each
url, when one intends to fetch many pages from a host (eg. wikipedia).

<property>
<name>fetcher.max.crawl.delay</name>
<value>30</value>
<description>
If the Crawl-Delay in robots.txt is set to greater than this value (in
seconds) then the fetcher will skip this page, generating an error report.
If set to -1 the fetcher will never skip such pages and will wait the
amount of time retrieved from robots.txt Crawl-Delay, however long that
might be.
</description>
</property>

<property>
<name>fetcher.server.delay</name>
<value>5.0</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>

*User:* gkahlout
*Job Name:* fetch
hdfs://loocia-c1/user/gkahlout/gabriele/crawl/segments/1-segs/20110603131208
*Job File:*
hdfs://loocia-c1/hadoop/mapred/system/job_201103141146_0774/job.xml<http://192.168.34.51:50030/jobconf.jsp?jobid=job_201103141146_0774>
*Job Setup:* Successful<http://192.168.34.51:50030/jobtasks.jsp?jobid=job_201103141146_0774&type=setup&pagenum=1&state=completed>
*Status:* Succeeded
*Started at:* Fri Jun 03 13:12:39 CEST 2011
*Finished at:* Fri Jun 03 13:59:50 CEST 2011
*Finished in:* 47mins, 10sec
*Job Cleanup:* Successful<http://192.168.34.51:50030/jobtasks.jsp?jobid=job_201103141146_0774&type=cleanup&pagenum=1&state=completed>


On Wed, Mar 16, 2011 at 6:45 AM, Ibrahim Alkharashi
<kh...@kacst.edu.sa>wrote:

>
>   Yes, I'm using built-in hadoop.  Does hadoop take that much overhead?
>
>
>
>
> On Tue, 2011-03-15 at 13:00 +0100, Claudio Martella wrote:
> > Do you mean running nutch with hadoop in pseudo-distributed mode? Both
> > the situations with only one computer?
> >
> > That performance difference is predictable, you hit hadoop's overhead
> > without the benefits of a clustered system.
> >
> >
> > On 3/15/11 12:49 PM, Ibrahim Alkharashi wrote:
> > >   Hello everybody
> > >
> > >   I was running an experiment to compare performance of running nutch
> > > with and without hadoop (on a single machine) to crawl same site (depth
> > > 5 and topN 5)
> > > It took nutch about 1:22 min and nutch with hadoop about 13:21 min to
> > > crawl
> > >   Any explanation?
> > >
> > >    Ibrahim
> > >
> > >
> > >
> > > ________________________________
> > >
> > > Disclaimer: This message and its attachment, if any, are confidential
> and may contain legally privileged information. If you are not the intended
> recipient, please contact the sender immediately and delete the message and
> its attachment, if any. You should not copy the message or disclose its
> contents to any other person or use it for any purpose. Statements and
> opinions expressed in this e-mail are those of the sender, and do not
> necessarily reflect those of King Abdulaziz city for Science and Technology
> (KACST). KACST accepts no liability for damage caused by this email.
> > >
> >
> >
>
>
>
> ________________________________
>
> Disclaimer: This message and its attachment, if any, are confidential and
> may contain legally privileged information. If you are not the intended
> recipient, please contact the sender immediately and delete the message and
> its attachment, if any. You should not copy the message or disclose its
> contents to any other person or use it for any purpose. Statements and
> opinions expressed in this e-mail are those of the sender, and do not
> necessarily reflect those of King Abdulaziz city for Science and Technology
> (KACST). KACST accepts no liability for damage caused by this email.
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).