You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Pushpesh Kr. Rajwanshi" <pu...@gmail.com> on 2005/12/28 07:12:49 UTC

Is any one able to successfully run Distributed Crawl?

Hi,

I want to know if anyone is able to successfully run distributed crawl on
multiple machines involving crawling millions of pages? and how hard is to
do that? Do i just have to do some configuration and set up or do some
implementations also?

Also can anyone tell me if i want to crawl around 20,000 websites (say for
depth 5) in a day, is it possible and if yes then how many machines would i
roughly require? and what all configurations i will need? I would appreciate
even some very approximate numbers also as i can understand it might not be
trivial to find out or may be :-)

TIA
Pushpesh

Re: Is any one able to successfully run Distributed Crawl?

Posted by "Pushpesh Kr. Rajwanshi" <pu...@gmail.com>.

Sorry for late reply but thanks for your quick response Doug. I really
appreiciate it.

Any ideas when would nutch-0.8 be released officially?

Thanks and Regards,
Pushpesh



On 1/10/06, Doug Cutting <cu...@nutch.org> wrote:
>
> Pushpesh Kr. Rajwanshi wrote:
> > Just wanted to confirm that this distributed crawl you
> > did using nutch version 0.7.1 or some other version? And was that a
> > successful distributed crawl using map reduce or some work around for
> > distributed crawl?
>
> No, this is 0.8-dev.  This was using in early December using the version
> of Nutch then in the mapred branch.  This version has since been merged
> into the trunk and will be eventually released as 0.8.  I believe
> everything in my previous message is still relevant to the current trunk.
>
> Doug
>

Re: Is any one able to successfully run Distributed Crawl?

Posted by Doug Cutting <cu...@nutch.org>.

Pushpesh Kr. Rajwanshi wrote:
> Just wanted to confirm that this distributed crawl you
> did using nutch version 0.7.1 or some other version? And was that a
> successful distributed crawl using map reduce or some work around for
> distributed crawl?

No, this is 0.8-dev.  This was using in early December using the version 
of Nutch then in the mapred branch.  This version has since been merged 
into the trunk and will be eventually released as 0.8.  I believe 
everything in my previous message is still relevant to the current trunk.

Doug

Re: Is any one able to successfully run Distributed Crawl?

Posted by "Pushpesh Kr. Rajwanshi" <pu...@gmail.com>.

Hi Doug,

Thanks alot for your precious time you gave for writing such a detailed and
informative reply. Just wanted to confirm that this distributed crawl you
did using nutch version 0.7.1 or some other version? And was that a
successful distributed crawl using map reduce or some work around for
distributed crawl?

Thanks and Regards,
Pushpesh


On 1/5/06, Doug Cutting <cu...@nutch.org> wrote:
>
> Earl Cahill wrote:
> > Any chance you could walk through your implementation?
> >  Like how the twenty boxes were assigned?  Maybe
> > upload your confs somewhere, and outline what commands
> > you actually ran?
>
> All 20 boxes are configured identically, running a Debian 2.4 kernel.
> These are dual-processor boxes with 2GB of RAM each.  Each machine has
> four drives, mounted as a RAID on /export/crawlspace.  This cluster uses
> NFS to mount home directories, so I did not have to set NUTCH_MASTER in
> order to rsync copies of nutch to all machines.
>
> I installed JDK 1.5 in ~/local/java, Ant in ~/local/ant and subversion
> in ~/local/svn.
>
> My ~/.ssh/environment contains:
>
> JAVA_HOME=/home/dcutting/local/java
> NUTCH_OPTS=-server
> NUTCH_LOG_DIR=/export/crawlspace/tmp/dcutting/logs
> NUTCH_SLAVES=/home/dcutting/.slaves
>
> I added the following to ~/.bash_profile, then logged out & back in.
>
> export `cat ~/.ssh/environment`
>
> I added the following to /etc/ssh/sshd_config on all hosts:
>
> PermitUserEnvironment yes
>
> My ~/.slaves file contains a list of all 20 slave hosts, one per line.
>
> My ~/src/nutch/conf/mapred-default.xml contains:
>
> <nutch-conf>
>
> <property>
>   <name>mapred.map.tasks</name>
>   <value>1000</value>
> </property>
>
> <property>
>   <name>mapred.reduce.tasks</name>
>   <value>39</value>
> </property>
>
> </nutch-conf>
>
> My ~/src/nutch/conf/nutch-site.xml contains:
>
> <nutch-conf>
>
> <property>
>   <name>fetcher.threads.fetch</name>
>   <value>100</value>
> </property>
>
> <property>
>   <name>generate.max.per.host</name>
>   <value>100</value>
> </property>
>
> <property>
>   <name>plugin.includes</name>
>
>
> <value>protocol-http|urlfilter-regex|parse-(html)|index-basic|query-(basic|site|url)</value>
> </property>
>
> <property>
>   <name>parser.html.impl</name>
>   <value>tagsoup</value>
> </property>
>
> <!-- NDFS -->
>
> <property>
>   <name>fs.default.name</name>
>   <value>adminhost:8009</value>
> </property>
>
> <property>
>   <name>ndfs.name.dir</name>
>   <value>/export/crawlspace/tmp/dcutting/ndfs/names</value>
> </property>
>
> <property>
>   <name>ndfs.data.dir</name>
>   <value>/export/crawlspace/tmp/dcutting/ndfs</value>
> </property>
>
> <!-- MapReduce -->
>
> <property>
>   <name>mapred.job.tracker</name>
>   <value>adminhost:8010</value>
> </property>
>
> <property>
>   <name>mapred.system.dir</name>
>   <value>/mapred/system</value>
> </property>
>
> <property>
>   <name>mapred.local.dir</name>
>   <value>/export/crawlspace/tmp/dcutting/local</value>
> </property>
>
> <property>
>   <name>mapred.child.heap.size</name>
>   <value>500m</value>
> </property>
>
> </nutch-conf>
>
> My ~/src/nutch/conf/crawl-urlfilter.txt contains:
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept everything else
> +.
>
> To run the crawl I gave the following commands on the master host:
>
> # checkout nutch sources and build them
> mkdir ~/src
> cd ~/src
> ~/local/svn co https://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
> cd nutch
> ~/local/ant/bin/ant
>
> # install config files named above in ~/src/nutch/conf
>
> # create dmoz/urls file
> wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
> gunzip content.rdf.u8.gz
> mkdir dmoz
> bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8.gz > dmoz/urls
>
> # create required directories on slaves
> bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/logs
> bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/local
> bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/ndfs/names
>
> # start nutch daemons
> bin/start-all.sh
>
> # copy dmoz/urls into ndfs
> bin/nutch ndfs -put dmoz dmoz
>
> # crawl
> nohup bin/nutch crawl dmoz -dir crawl -depth 4 -topN 16000000 <
> /dev/null >& crawl.log &
>
> Then I visited http://master:50030/ to monitor progress.
>
> I think that's it!
>
> Doug
>

Re: Is any one able to successfully run Distributed Crawl?

Posted by Doug Cutting <cu...@nutch.org>.

Earl Cahill wrote:
> Any chance you could walk through your implementation?
>  Like how the twenty boxes were assigned?  Maybe
> upload your confs somewhere, and outline what commands
> you actually ran?

All 20 boxes are configured identically, running a Debian 2.4 kernel. 
These are dual-processor boxes with 2GB of RAM each.  Each machine has 
four drives, mounted as a RAID on /export/crawlspace.  This cluster uses 
NFS to mount home directories, so I did not have to set NUTCH_MASTER in 
order to rsync copies of nutch to all machines.

I installed JDK 1.5 in ~/local/java, Ant in ~/local/ant and subversion 
in ~/local/svn.

My ~/.ssh/environment contains:

JAVA_HOME=/home/dcutting/local/java
NUTCH_OPTS=-server
NUTCH_LOG_DIR=/export/crawlspace/tmp/dcutting/logs
NUTCH_SLAVES=/home/dcutting/.slaves

I added the following to ~/.bash_profile, then logged out & back in.

export `cat ~/.ssh/environment`

I added the following to /etc/ssh/sshd_config on all hosts:

PermitUserEnvironment yes

My ~/.slaves file contains a list of all 20 slave hosts, one per line.

My ~/src/nutch/conf/mapred-default.xml contains:

<nutch-conf>

<property>
   <name>mapred.map.tasks</name>
   <value>1000</value>
</property>

<property>
   <name>mapred.reduce.tasks</name>
   <value>39</value>
</property>

</nutch-conf>

My ~/src/nutch/conf/nutch-site.xml contains:

<nutch-conf>

<property>
   <name>fetcher.threads.fetch</name>
   <value>100</value>
</property>

<property>
   <name>generate.max.per.host</name>
   <value>100</value>
</property>

<property>
   <name>plugin.includes</name>
 
<value>protocol-http|urlfilter-regex|parse-(html)|index-basic|query-(basic|site|url)</value>
</property>

<property>
   <name>parser.html.impl</name>
   <value>tagsoup</value>
</property>

<!-- NDFS -->

<property>
   <name>fs.default.name</name>
   <value>adminhost:8009</value>
</property>

<property>
   <name>ndfs.name.dir</name>
   <value>/export/crawlspace/tmp/dcutting/ndfs/names</value>
</property>

<property>
   <name>ndfs.data.dir</name>
   <value>/export/crawlspace/tmp/dcutting/ndfs</value>
</property>

<!-- MapReduce -->

<property>
   <name>mapred.job.tracker</name>
   <value>adminhost:8010</value>
</property>

<property>
   <name>mapred.system.dir</name>
   <value>/mapred/system</value>
</property>

<property>
   <name>mapred.local.dir</name>
   <value>/export/crawlspace/tmp/dcutting/local</value>
</property>

<property>
   <name>mapred.child.heap.size</name>
   <value>500m</value>
</property>

</nutch-conf>

My ~/src/nutch/conf/crawl-urlfilter.txt contains:

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break 
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept everything else
+.

To run the crawl I gave the following commands on the master host:

# checkout nutch sources and build them
mkdir ~/src
cd ~/src
~/local/svn co https://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
cd nutch
~/local/ant/bin/ant

# install config files named above in ~/src/nutch/conf

# create dmoz/urls file
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz
mkdir dmoz
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8.gz > dmoz/urls

# create required directories on slaves
bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/logs
bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/local
bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/ndfs/names

# start nutch daemons
bin/start-all.sh

# copy dmoz/urls into ndfs
bin/nutch ndfs -put dmoz dmoz

# crawl
nohup bin/nutch crawl dmoz -dir crawl -depth 4 -topN 16000000 < 
/dev/null >& crawl.log &

Then I visited http://master:50030/ to monitor progress.

I think that's it!

Doug

Re: Is any one able to successfully run Distributed Crawl?

Posted by Gal Nitzan <gn...@usa.net>.

+1

On Mon, 2006-01-02 at 13:39 -0800, Earl Cahill wrote:
> Any chance you could walk through your implementation?
>  Like how the twenty boxes were assigned?  Maybe
> upload your confs somewhere, and outline what commands
> you actually ran?
> 
> Thanks,
> Earl
> 
> --- Doug Cutting <cu...@nutch.org> wrote:
> 
> > Pushpesh Kr. Rajwanshi wrote:
> > > I want to know if anyone is able to successfully
> > run distributed crawl on
> > > multiple machines involving crawling millions of
> > pages? and how hard is to
> > > do that? Do i just have to do some configuration
> > and set up or do some
> > > implementations also?
> > 
> > I recently performed a four-level deep crawl,
> > starting from urls in 
> > DMOZ, limiting each level to 16M urls.  This ran on
> > 20 machines taking 
> > around 24 hours using about 100Mbit and retrieved
> > around 50M pages.  I 
> > used Nutch unmodified, specifying only a few
> > configuration options.  So, 
> > yes, it is possible.
> > 
> > Doug
> > 
> 
> 
> 
> 	
> 		
> __________________________________ 
> Yahoo! for Good - Make a difference this year. 
> http://brand.yahoo.com/cybergivingweek2005/
>

Re: Is any one able to successfully run Distributed Crawl?

Posted by Earl Cahill <ca...@yahoo.com>.

Any chance you could walk through your implementation?
 Like how the twenty boxes were assigned?  Maybe
upload your confs somewhere, and outline what commands
you actually ran?

Thanks,
Earl

--- Doug Cutting <cu...@nutch.org> wrote:

> Pushpesh Kr. Rajwanshi wrote:
> > I want to know if anyone is able to successfully
> run distributed crawl on
> > multiple machines involving crawling millions of
> pages? and how hard is to
> > do that? Do i just have to do some configuration
> and set up or do some
> > implementations also?
> 
> I recently performed a four-level deep crawl,
> starting from urls in 
> DMOZ, limiting each level to 16M urls.  This ran on
> 20 machines taking 
> around 24 hours using about 100Mbit and retrieved
> around 50M pages.  I 
> used Nutch unmodified, specifying only a few
> configuration options.  So, 
> yes, it is possible.
> 
> Doug
> 

__________________________________ 
Yahoo! for Good - Make a difference this year. 
http://brand.yahoo.com/cybergivingweek2005/

Re: Is any one able to successfully run Distributed Crawl?

Posted by Doug Cutting <cu...@nutch.org>.

Pushpesh Kr. Rajwanshi wrote:
> I want to know if anyone is able to successfully run distributed crawl on
> multiple machines involving crawling millions of pages? and how hard is to
> do that? Do i just have to do some configuration and set up or do some
> implementations also?

I recently performed a four-level deep crawl, starting from urls in 
DMOZ, limiting each level to 16M urls.  This ran on 20 machines taking 
around 24 hours using about 100Mbit and retrieved around 50M pages.  I 
used Nutch unmodified, specifying only a few configuration options.  So, 
yes, it is possible.

Doug

Re: Is any one able to successfully run Distributed Crawl?

Posted by "Pushpesh Kr. Rajwanshi" <pu...@gmail.com>.

Hi there,
Thanks for reply again. What volume of data you are crawling and on how many
machines? Which version of nutch you are using? 0.7.1 or any other? Actually
it is working more or less fine but i want to know how much resources i will
need (machines) for crawling 20,000 websites in a day? If anyone can give me
any information in this regard i would really appreciate for that.

Thanks
Pushpesh


On 12/28/05, Nutch Newbie <nu...@gmail.com> wrote:
>
> Hi
>
> I have had no problem doing distributed crawl.
>
> On 12/28/05, Pushpesh Kr. Rajwanshi <pu...@gmail.com> wrote:
> > Hi NN,
> >
> > Thanks for replying me. Actually I wanted to know if distributed
> crawling in
> > nutch is working fine and to what success? Like i am successful in
> setting
> > up distributed crawl for 2 machines (1 master and 1 slave) but when i
> try
> > with more than two machines there seems problem specially while
> injecting
> > urls in crawlDB.
>
> Could you please post your log files please. For example jobtracker
> and tasktracker log file...¨
>
> So was wondering if anyone is successful in doing a massive
> > crawl using nutch involving crawling of millions of pages successfully?
> >
> > My requirement is to crawl like 20,000 websites (for say depth 5) in a
> day
> > and i was wondering how many machines would it require to do that.
> >
> > Would truely appreciate any response on this.
> >
> > Thanks In Advance
> > Pushpesh
> >
> >
> > On 12/28/05, Nutch Newbie <nu...@gmail.com> wrote:
> > >
> > > Have you tried the following:
> > >
> > > http://wiki.apache.org/nutch/HardwareRequirements
> > >
> > > and
> > >
> > > http://wiki.apache.org/nutch/
> > >
> > > There are no quick answer if one is planning to crawl million
> > > pages..Read..Try.. Read..
> > >
> > >
> > > On 12/28/05, Pushpesh Kr. Rajwanshi <pu...@gmail.com> wrote:
> > > > Hi,
> > > >
> > > > I want to know if anyone is able to successfully run distributed
> crawl
> > > on
> > > > multiple machines involving crawling millions of pages? and how hard
> is
> > > to
> > > > do that? Do i just have to do some configuration and set up or do
> some
> > > > implementations also?
> > > >
> > > > Also can anyone tell me if i want to crawl around 20,000 websites
> (say
> > > for
> > > > depth 5) in a day, is it possible and if yes then how many machines
> > > would i
> > > > roughly require? and what all configurations i will need? I would
> > > appreciate
> > > > even some very approximate numbers also as i can understand it might
> not
> > > be
> > > > trivial to find out or may be :-)
> > > >
> > > > TIA
> > > > Pushpesh
> > > >
> > > >
> > >
> >
> >
>

Re: Is any one able to successfully run Distributed Crawl?

Posted by Nutch Newbie <nu...@gmail.com>.

Hi

I have had no problem doing distributed crawl.

On 12/28/05, Pushpesh Kr. Rajwanshi <pu...@gmail.com> wrote:
> Hi NN,
>
> Thanks for replying me. Actually I wanted to know if distributed crawling in
> nutch is working fine and to what success? Like i am successful in setting
> up distributed crawl for 2 machines (1 master and 1 slave) but when i try
> with more than two machines there seems problem specially while injecting
> urls in crawlDB.

Could you please post your log files please. For example jobtracker
and tasktracker log file...¨

So was wondering if anyone is successful in doing a massive
> crawl using nutch involving crawling of millions of pages successfully?
>
> My requirement is to crawl like 20,000 websites (for say depth 5) in a day
> and i was wondering how many machines would it require to do that.
>
> Would truely appreciate any response on this.
>
> Thanks In Advance
> Pushpesh
>
>
> On 12/28/05, Nutch Newbie <nu...@gmail.com> wrote:
> >
> > Have you tried the following:
> >
> > http://wiki.apache.org/nutch/HardwareRequirements
> >
> > and
> >
> > http://wiki.apache.org/nutch/
> >
> > There are no quick answer if one is planning to crawl million
> > pages..Read..Try.. Read..
> >
> >
> > On 12/28/05, Pushpesh Kr. Rajwanshi <pu...@gmail.com> wrote:
> > > Hi,
> > >
> > > I want to know if anyone is able to successfully run distributed crawl
> > on
> > > multiple machines involving crawling millions of pages? and how hard is
> > to
> > > do that? Do i just have to do some configuration and set up or do some
> > > implementations also?
> > >
> > > Also can anyone tell me if i want to crawl around 20,000 websites (say
> > for
> > > depth 5) in a day, is it possible and if yes then how many machines
> > would i
> > > roughly require? and what all configurations i will need? I would
> > appreciate
> > > even some very approximate numbers also as i can understand it might not
> > be
> > > trivial to find out or may be :-)
> > >
> > > TIA
> > > Pushpesh
> > >
> > >
> >
>
>

Re: Is any one able to successfully run Distributed Crawl?

Posted by "Pushpesh Kr. Rajwanshi" <pu...@gmail.com>.

Hi NN,

Thanks for replying me. Actually I wanted to know if distributed crawling in
nutch is working fine and to what success? Like i am successful in setting
up distributed crawl for 2 machines (1 master and 1 slave) but when i try
with more than two machines there seems problem specially while injecting
urls in crawlDB. So was wondering if anyone is successful in doing a massive
crawl using nutch involving crawling of millions of pages successfully?

My requirement is to crawl like 20,000 websites (for say depth 5) in a day
and i was wondering how many machines would it require to do that.

Would truely appreciate any response on this.

Thanks In Advance
Pushpesh

On 12/28/05, Nutch Newbie <nu...@gmail.com> wrote:
>
> Have you tried the following:
>
> http://wiki.apache.org/nutch/HardwareRequirements
>
> and
>
> http://wiki.apache.org/nutch/
>
> There are no quick answer if one is planning to crawl million
> pages..Read..Try.. Read..
>
>
> On 12/28/05, Pushpesh Kr. Rajwanshi <pu...@gmail.com> wrote:
> > Hi,
> >
> > I want to know if anyone is able to successfully run distributed crawl
> on
> > multiple machines involving crawling millions of pages? and how hard is
> to
> > do that? Do i just have to do some configuration and set up or do some
> > implementations also?
> >
> > Also can anyone tell me if i want to crawl around 20,000 websites (say
> for
> > depth 5) in a day, is it possible and if yes then how many machines
> would i
> > roughly require? and what all configurations i will need? I would
> appreciate
> > even some very approximate numbers also as i can understand it might not
> be
> > trivial to find out or may be :-)
> >
> > TIA
> > Pushpesh
> >
> >
>

Re: Is any one able to successfully run Distributed Crawl?

Posted by Nutch Newbie <nu...@gmail.com>.

Have you tried the following:

http://wiki.apache.org/nutch/HardwareRequirements

and

http://wiki.apache.org/nutch/

There are no quick answer if one is planning to crawl million
pages..Read..Try.. Read..


On 12/28/05, Pushpesh Kr. Rajwanshi <pu...@gmail.com> wrote:
> Hi,
>
> I want to know if anyone is able to successfully run distributed crawl on
> multiple machines involving crawling millions of pages? and how hard is to
> do that? Do i just have to do some configuration and set up or do some
> implementations also?
>
> Also can anyone tell me if i want to crawl around 20,000 websites (say for
> depth 5) in a day, is it possible and if yes then how many machines would i
> roughly require? and what all configurations i will need? I would appreciate
> even some very approximate numbers also as i can understand it might not be
> trivial to find out or may be :-)
>
> TIA
> Pushpesh
>
>