You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/03/19 16:19:59 UTC

NutchHadoopTutorial Updated

Hi Guys,

The NutchHadoopTutorial [0] on  our wiki has finally been updated after
quite some time. It's a rather long beast, but covers (hopefully)
everything you require to get cracking with leveraging the lastest versions
Nutch and Hadoop on a distributed platform and making best use of the great
technologies.

We would really appreciate feedback as there will undoubtedly be some
errors or data missing.

Thanks

Lewis

[0] http://wiki.apache.org/nutch/NutchHadoopTutorial

-- 
*Lewis*

Re: NutchHadoopTutorial Updated

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Done... thanks

On Tue, Mar 20, 2012 at 1:45 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> No I'm taking it out right now. Thanks troops. :)
>
>
> On Tue, Mar 20, 2012 at 1:38 PM, Mathijs Homminga <
> mathijs.homminga@kalooga.com> wrote:
>
>>
>> >> About the section "Deploy Nutch to Multiple Machines": this is not
>> >> necessary right? The job jar should be self containing and ship with
>> all
>> >> the configuration files necessary. Nutch should be able to run on any
>> >> vanilla Hadoop cluster.
>> >
>> > It does. All you need is a healthy cluster and a Hadoop environment
>> (cluster
>> > or local) that points to the jobtracker.
>>
>> Exactly ;)
>> Lewis, any reason to keep this section in there?
>>
>> Mathijs
>
>
>
>
> --
> *Lewis*
>
>


-- 
*Lewis*

Re: NutchHadoopTutorial Updated

Posted by Julien Nioche <li...@gmail.com>.
Note : the case variation in my previous email is purely accidental. I did
not intend to shout or make the first part more important than the second
:-)

On 20 March 2012 14:19, Julien Nioche <li...@gmail.com> wrote:

> The section Deploy Nutch to Single Machine is probably based on an old
> version of Nutch and quite misleading. Wether you are in fully or pseudo
> distributed mode all you need to do is build the job file from the Nutch
> root, go to runtime deploy and use the Nutch command from the bin
> directory. There aren't any conf files or hadoop executable anymore. If you
> need to change something in the conf e.g. url filter files, you need to
> rebuild a new job file.
>
> This is definitely a good effort but IMHO most of it is about Hadoop
> configuration which is very well explained on the Hadoop pages
> http://hadoop.apache.org/common/docs/stable/single_node_setup.html. I
> think we should refer to them systematically and focus on the Nutch
> specific parts instead.
>
> Julien
> On 20 March 2012 13:45, Lewis John Mcgibbney <le...@gmail.com>wrote:
>
>> No I'm taking it out right now. Thanks troops. :)
>>
>> On Tue, Mar 20, 2012 at 1:38 PM, Mathijs Homminga <
>> mathijs.homminga@kalooga.com> wrote:
>>
>> >
>> > >> About the section "Deploy Nutch to Multiple Machines": this is not
>> > >> necessary right? The job jar should be self containing and ship with
>> all
>> > >> the configuration files necessary. Nutch should be able to run on any
>> > >> vanilla Hadoop cluster.
>> > >
>> > > It does. All you need is a healthy cluster and a Hadoop environment
>> > (cluster
>> > > or local) that points to the jobtracker.
>> >
>> > Exactly ;)
>> > Lewis, any reason to keep this section in there?
>> >
>> > Mathijs
>>
>>
>>
>>
>> --
>> *Lewis*
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: NutchHadoopTutorial Updated

Posted by Julien Nioche <li...@gmail.com>.
The section Deploy Nutch to Single Machine is probably based on an old
version of Nutch and quite misleading. Wether you are in fully or pseudo
distributed mode all you need to do is build the job file from the Nutch
root, go to runtime deploy and use the Nutch command from the bin
directory. There aren't any conf files or hadoop executable anymore. If you
need to change something in the conf e.g. url filter files, you need to
rebuild a new job file.

This is definitely a good effort but IMHO most of it is about Hadoop
configuration which is very well explained on the Hadoop pages
http://hadoop.apache.org/common/docs/stable/single_node_setup.html. I think
we should refer to them systematically and focus on the Nutch specific
parts instead.

Julien
On 20 March 2012 13:45, Lewis John Mcgibbney <le...@gmail.com>wrote:

> No I'm taking it out right now. Thanks troops. :)
>
> On Tue, Mar 20, 2012 at 1:38 PM, Mathijs Homminga <
> mathijs.homminga@kalooga.com> wrote:
>
> >
> > >> About the section "Deploy Nutch to Multiple Machines": this is not
> > >> necessary right? The job jar should be self containing and ship with
> all
> > >> the configuration files necessary. Nutch should be able to run on any
> > >> vanilla Hadoop cluster.
> > >
> > > It does. All you need is a healthy cluster and a Hadoop environment
> > (cluster
> > > or local) that points to the jobtracker.
> >
> > Exactly ;)
> > Lewis, any reason to keep this section in there?
> >
> > Mathijs
>
>
>
>
> --
> *Lewis*
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: NutchHadoopTutorial Updated

Posted by Lewis John Mcgibbney <le...@gmail.com>.
No I'm taking it out right now. Thanks troops. :)

On Tue, Mar 20, 2012 at 1:38 PM, Mathijs Homminga <
mathijs.homminga@kalooga.com> wrote:

>
> >> About the section "Deploy Nutch to Multiple Machines": this is not
> >> necessary right? The job jar should be self containing and ship with all
> >> the configuration files necessary. Nutch should be able to run on any
> >> vanilla Hadoop cluster.
> >
> > It does. All you need is a healthy cluster and a Hadoop environment
> (cluster
> > or local) that points to the jobtracker.
>
> Exactly ;)
> Lewis, any reason to keep this section in there?
>
> Mathijs




-- 
*Lewis*

Re: NutchHadoopTutorial Updated

Posted by Mathijs Homminga <ma...@kalooga.com>.
>> About the section "Deploy Nutch to Multiple Machines": this is not
>> necessary right? The job jar should be self containing and ship with all
>> the configuration files necessary. Nutch should be able to run on any
>> vanilla Hadoop cluster.
> 
> It does. All you need is a healthy cluster and a Hadoop environment (cluster 
> or local) that points to the jobtracker.

Exactly ;)
Lewis, any reason to keep this section in there?

Mathijs

Re: NutchHadoopTutorial Updated

Posted by Markus Jelsma <ma...@openindex.io>.

On Tuesday 20 March 2012 14:23:47 Mathijs Homminga wrote:
> This is great work!! Thanks Lewis!
> 
> I must say that when I read the tutorial it stroke me how much of the
> effort goes into getting Hadoop up and running.
> 
> It would be great if we could start with:
> "First, make sure you have a healthy Hadoop cluster running, see here for
> the Hadoop tutorial" ;-)
> 
> About the section "Deploy Nutch to Multiple Machines": this is not
> necessary right? The job jar should be self containing and ship with all
> the configuration files necessary. Nutch should be able to run on any
> vanilla Hadoop cluster.

It does. All you need is a healthy cluster and a Hadoop environment (cluster 
or local) that points to the jobtracker.

> 
> Anyway, looking at the questions that arrive at nutch-user, this is really
> really helpful.
> 
> Cheers,
> Mathijs
> 
> On Mar 19, 2012, at 16:19 , Lewis John Mcgibbney wrote:
> > Hi Guys,
> > 
> > The NutchHadoopTutorial [0] on  our wiki has finally been updated after
> > quite some time. It's a rather long beast, but covers (hopefully)
> > everything you require to get cracking with leveraging the lastest
> > versions Nutch and Hadoop on a distributed platform and making best use
> > of the great technologies.
> > 
> > We would really appreciate feedback as there will undoubtedly be some
> > errors or data missing.
> > 
> > Thanks
> > 
> > Lewis
> > 
> > [0] http://wiki.apache.org/nutch/NutchHadoopTutorial

-- 
Markus Jelsma - CTO - Openindex

Re: NutchHadoopTutorial Updated

Posted by Mathijs Homminga <ma...@kalooga.com>.
This is great work!! Thanks Lewis!

I must say that when I read the tutorial it stroke me how much of the effort goes into getting Hadoop up and running.

It would be great if we could start with:
"First, make sure you have a healthy Hadoop cluster running, see here for the Hadoop tutorial" ;-)

About the section "Deploy Nutch to Multiple Machines": this is not necessary right? The job jar should be self containing and ship with all the configuration files necessary. 
Nutch should be able to run on any vanilla Hadoop cluster.

Anyway, looking at the questions that arrive at nutch-user, this is really really helpful.

Cheers,
Mathijs




On Mar 19, 2012, at 16:19 , Lewis John Mcgibbney wrote:

> Hi Guys,
> 
> The NutchHadoopTutorial [0] on  our wiki has finally been updated after
> quite some time. It's a rather long beast, but covers (hopefully)
> everything you require to get cracking with leveraging the lastest versions
> Nutch and Hadoop on a distributed platform and making best use of the great
> technologies.
> 
> We would really appreciate feedback as there will undoubtedly be some
> errors or data missing.
> 
> Thanks
> 
> Lewis
> 
> [0] http://wiki.apache.org/nutch/NutchHadoopTutorial
> 
> -- 
> *Lewis*


Re: NutchHadoopTutorial Updated

Posted by Chris A Mattmann <ch...@gmail.com>.
Thanks for the heads up, dude...you rock, per usual!

Cheers,
Chris

On Mar 19, 2012, at 4:19 PM, Lewis John Mcgibbney wrote:

> Hi Guys,
> 
> The NutchHadoopTutorial [0] on  our wiki has finally been updated after
> quite some time. It's a rather long beast, but covers (hopefully)
> everything you require to get cracking with leveraging the lastest versions
> Nutch and Hadoop on a distributed platform and making best use of the great
> technologies.
> 
> We would really appreciate feedback as there will undoubtedly be some
> errors or data missing.
> 
> Thanks
> 
> Lewis
> 
> [0] http://wiki.apache.org/nutch/NutchHadoopTutorial
> 
> -- 
> *Lewis*


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++