You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Steffen Viken Valvåg <st...@cs.uit.no> on 2005/09/15 11:47:06 UTC

Whole-web crawling with the mapreduce branch

Hi,

I'm playing around with the mapreduce branch, and got it working for a
simple intranet crawl by following the nutch tutorial on
http://lucene.apache.org/nutch/tutorial.html.  The tutorial seems
inapplicable when it comes to whole-web crawling, though, as the "nutch
admin" command has been disabled, and the usage of the "nutch inject"
command seems to have changed.  I'm willing to read the source to get up to
speed, but if there is any other documentation on the mapreduce branch that
would obviously be helpful.  I would also greatly appreciate it if someone
took the time to give me a short bullet list of commands to get me started
on a whole-web crawl.

Thanks,
Steffen


RE: Whole-web crawling with the mapreduce branch

Posted by Steffen Viken Valvåg <st...@cs.uit.no>.
Thanks,

That got me going.  Works like a charm :)

Steffen 

-----Original Message-----
From: Doug Cutting [mailto:cutting@nutch.org] 
Sent: 15. september 2005 23:48
To: nutch-dev@lucene.apache.org
Subject: Re: Whole-web crawling with the mapreduce branch

For now, look at the source for crawl/Crawl.java.

I'll try to add some documentation ASAP.

Doug

Steffen Viken Valvåg wrote:
> Hi,
> 
> I'm playing around with the mapreduce branch, and got it working for a 
> simple intranet crawl by following the nutch tutorial on 
> http://lucene.apache.org/nutch/tutorial.html.  The tutorial seems 
> inapplicable when it comes to whole-web crawling, though, as the 
> "nutch admin" command has been disabled, and the usage of the "nutch
inject"
> command seems to have changed.  I'm willing to read the source to get 
> up to speed, but if there is any other documentation on the mapreduce 
> branch that would obviously be helpful.  I would also greatly 
> appreciate it if someone took the time to give me a short bullet list 
> of commands to get me started on a whole-web crawl.
> 
> Thanks,
> Steffen
> 


Re: Whole-web crawling with the mapreduce branch

Posted by Doug Cutting <cu...@nutch.org>.
For now, look at the source for crawl/Crawl.java.

I'll try to add some documentation ASAP.

Doug

Steffen Viken Valvåg wrote:
> Hi,
> 
> I'm playing around with the mapreduce branch, and got it working for a
> simple intranet crawl by following the nutch tutorial on
> http://lucene.apache.org/nutch/tutorial.html.  The tutorial seems
> inapplicable when it comes to whole-web crawling, though, as the "nutch
> admin" command has been disabled, and the usage of the "nutch inject"
> command seems to have changed.  I'm willing to read the source to get up to
> speed, but if there is any other documentation on the mapreduce branch that
> would obviously be helpful.  I would also greatly appreciate it if someone
> took the time to give me a short bullet list of commands to get me started
> on a whole-web crawl.
> 
> Thanks,
> Steffen
>