You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by ashish vyas <ma...@gmail.com> on 2012/07/12 10:43:59 UTC

Difference between Nutch crawl giving depth='N' and crawling in loop N times with depth='1'

Background of my problem: I am running Nutch1.4 on Hadoop0.20.203. There
are series of MapReduce jobs that i am performing on Nutch segments to get
final output. But waiting for whole crawl to happen before running
mapreduce causes solution to run for longer time. I am now triggering
MapReduce jobs on segments as soon as they are dumped. I am running crawl
in a loop('N=depth' times ) by giving depth=1.I am getting some urls
getting lost when i crawl with depth 1 in a loop N times vs crawl giving
depth N.

Please find below pseudo code:

*Case 1*: Nutch crawl on Hadoop giving depth=3.

// Create the list object to store arguments which we are going to pass to
NUTCH

List nutchArgsList = new ArrayList();

nutchArgsList.add("-depth");

nutchArgsList.add(Integer.toString(3));

<...other nutch args...>

ToolRunner.run(nutchConf, new Crawl(), nutchArgsList.toArray(new
String[nutchArgsList.size()]));

*Case 2*: Crawling in loop 3 times with depth='1'

for(int depthRun=0;depthRun< 3;depthRun++) {

// Create the list object to store arguments which we are going to pass to
NUTCH

List nutchArgsList = new ArrayList();

nutchArgsList.add("-depth");

nutchArgsList.add(Integer.toString(1)); //*NOTE* i have given depth as 1
here

<...other nutch args...>

ToolRunner.run(nutchConf, new Crawl(), nutchArgsList.toArray(new
String[nutchArgsList.size()]));

}

I am getting some urls getting lost(db unfetched) when i crawling in loop
as many times as depth.

I have tried this on standalone Nutch where i run with depth 3 vs running 3
times over same urls with depth 1. I have compared the crawldb and urls
difference is only 12. But when i do the same on Hadoop using toolrunner i
am getting 1000 urls as db_unfetched.

As far i understood till now,Nutch triggers crawl in a loop as many times
as depth value. Please suggest.

Also please let me know why difference is huge when i do this on Hadoop
using toolrunner vs doing the same on standalone Nutch.

Thanks in advandce.


Regards:

Ashish V

Re: Difference between Nutch crawl giving depth='N' and crawling in loop N times with depth='1'

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Ashish,

> As far i understood till now,Nutch triggers crawl in a loop as many times
> as depth value. Please suggest.
Yes. For every step (until depth is reached):
 - generate a list of URLs to be fetched
 - fetch this list
 - parse documents and extract outlinks
 - write these outlink URLs as new entries into CrawlDb

> I have tried this on standalone Nutch where i run with depth 3 vs running 3
> times over same urls with depth 1. I have compared the crawldb and urls
> difference is only 12. But when i do the same on Hadoop using toolrunner i
> am getting 1000 urls as db_unfetched.
For the difference of 12: that may be random. The web is not static, and
there may be just a few more failures by accident.
For the 1000: Have a more detailed look into your CrawlDb and log files:
 - number of retries,
 - type of transient errors (timeouts etc.)
Because a URL is blocked for one day after a transient error (to give the
requested server time to "recreate") it may happen that more URLs remain
in this state (tried once or twice) when your crawl finishes faster (expected
when run on multiple nodes).

That's only one explanation. Check the logs and crawldb to find out what happened
with the missing docs! In general, the counts should be roughly the same.

Sebastian


On 07/12/2012 11:02 AM, ashish vyas wrote:
> Hi All,
> 
> Background of my problem: I am running Nutch1.4 on Hadoop0.20.203. There
> are series of MapReduce jobs that i am performing on Nutch segments to get
> final output. But waiting for whole crawl to happen before running
> mapreduce causes solution to run for longer time. I am now triggering
> MapReduce jobs on segments as soon as they are dumped. I am running crawl
> in a loop('N=depth' times ) by giving depth=1.I am getting some urls
> getting lost when i crawl with depth 1 in a loop N times vs crawl giving
> depth N.
> 
> Please find below pseudo code:
> 
> *Case 1*: Nutch crawl on Hadoop giving depth=3.
> 
> // Create the list object to store arguments which we are going to pass to
> NUTCH
> 
> List nutchArgsList = new ArrayList();
> 
> nutchArgsList.add("-depth");
> 
> nutchArgsList.add(Integer.toString(3));
> 
> <...other nutch args...>
> 
> ToolRunner.run(nutchConf, new Crawl(), nutchArgsList.toArray(new
> String[nutchArgsList.size()]));
> 
> *Case 2*: Crawling in loop 3 times with depth='1'
> 
> for(int depthRun=0;depthRun< 3;depthRun++) {
> 
> // Create the list object to store arguments which we are going to pass to
> NUTCH
> 
> List nutchArgsList = new ArrayList();
> 
> nutchArgsList.add("-depth");
> 
> nutchArgsList.add(Integer.toString(1)); //*NOTE* i have given depth as 1
> here
> 
> <...other nutch args...>
> 
> ToolRunner.run(nutchConf, new Crawl(), nutchArgsList.toArray(new
> String[nutchArgsList.size()]));
> 
> }
> 
> I am getting some urls getting lost(db unfetched) when i crawling in loop
> as many times as depth.
> 
> I have tried this on standalone Nutch where i run with depth 3 vs running 3
> times over same urls with depth 1. I have compared the crawldb and urls
> difference is only 12. But when i do the same on Hadoop using toolrunner i
> am getting 1000 urls as db_unfetched.
> 
> As far i understood till now,Nutch triggers crawl in a loop as many times
> as depth value. Please suggest.
> 
> Also please let me know why difference is huge when i do this on Hadoop
> using toolrunner vs doing the same on standalone Nutch.
> 
> Thanks in advandce.
> 
> 
> Regards:
> 
> Ashish V
> 



Difference between Nutch crawl giving depth='N' and crawling in loop N times with depth='1'

Posted by ashish vyas <ma...@gmail.com>.
Hi All,

Background of my problem: I am running Nutch1.4 on Hadoop0.20.203. There
are series of MapReduce jobs that i am performing on Nutch segments to get
final output. But waiting for whole crawl to happen before running
mapreduce causes solution to run for longer time. I am now triggering
MapReduce jobs on segments as soon as they are dumped. I am running crawl
in a loop('N=depth' times ) by giving depth=1.I am getting some urls
getting lost when i crawl with depth 1 in a loop N times vs crawl giving
depth N.

Please find below pseudo code:

*Case 1*: Nutch crawl on Hadoop giving depth=3.

// Create the list object to store arguments which we are going to pass to
NUTCH

List nutchArgsList = new ArrayList();

nutchArgsList.add("-depth");

nutchArgsList.add(Integer.toString(3));

<...other nutch args...>

ToolRunner.run(nutchConf, new Crawl(), nutchArgsList.toArray(new
String[nutchArgsList.size()]));

*Case 2*: Crawling in loop 3 times with depth='1'

for(int depthRun=0;depthRun< 3;depthRun++) {

// Create the list object to store arguments which we are going to pass to
NUTCH

List nutchArgsList = new ArrayList();

nutchArgsList.add("-depth");

nutchArgsList.add(Integer.toString(1)); //*NOTE* i have given depth as 1
here

<...other nutch args...>

ToolRunner.run(nutchConf, new Crawl(), nutchArgsList.toArray(new
String[nutchArgsList.size()]));

}

I am getting some urls getting lost(db unfetched) when i crawling in loop
as many times as depth.

I have tried this on standalone Nutch where i run with depth 3 vs running 3
times over same urls with depth 1. I have compared the crawldb and urls
difference is only 12. But when i do the same on Hadoop using toolrunner i
am getting 1000 urls as db_unfetched.

As far i understood till now,Nutch triggers crawl in a loop as many times
as depth value. Please suggest.

Also please let me know why difference is huge when i do this on Hadoop
using toolrunner vs doing the same on standalone Nutch.

Thanks in advandce.


Regards:

Ashish V