You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "flo @" <xx...@gmail.com> on 2013/11/13 10:19:23 UTC

Nutch cluster

Which is the best approach to setup a nutch cluster with multiple nutch
instances running on different machines. Is there some kind of scheduler
for nutch?

I already configured a single nutch instance with HBase for storing the
index in the background.

Thanks

flo

Re: Nutch cluster

Posted by "flo @" <xx...@gmail.com>.
I have some additional question to setup a cluster:

If I want a continuous crawling, I create a nutch script with an endless
loop?
Shall I run nutch instances and the hbase db on different hadoop clusters?
If I want to run more nutch jobs simultaneously shall I start the nutch
script several times?


2013/11/14 A Laxmi <a....@gmail.com>

> Hi Julien-
>
> From the link you provided (
> http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial) for Nutch 1.x
> -
> how and where is the crawled data stored?
>
> Thanks!
>
>
> On Wed, Nov 13, 2013 at 4:58 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
> > Just to add to what Markus said : see
> > http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial
> > The approach is the same for 2.x. Nutch is just a Hadoop application
> with a
> > few scripts to make your life easier
> >
> > Julien
> >
> >
> > On 13 November 2013 09:45, Markus Jelsma <ma...@openindex.io>
> > wrote:
> >
> > > You can just install Hadoop on the cluster as you would have otherwise.
> > > Then you can run the Nutch job file via the bin/nutch script on any
> > Hadoop
> > > client such as the jobtracker for example.
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:flo @ <xx...@gmail.com>
> > > > Sent: Wednesday 13th November 2013 10:20
> > > > To: user@nutch.apache.org
> > > > Subject: Nutch cluster
> > > >
> > > > Which is the best approach to setup a nutch cluster with multiple
> nutch
> > > > instances running on different machines. Is there some kind of
> > scheduler
> > > > for nutch?
> > > >
> > > > I already configured a single nutch instance with HBase for storing
> the
> > > > index in the background.
> > > >
> > > > Thanks
> > > >
> > > > flo
> > > >
> > >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>

Re: Nutch cluster

Posted by A Laxmi <a....@gmail.com>.
Hi Julien-

>From the link you provided (
http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial) for Nutch 1.x -
how and where is the crawled data stored?

Thanks!


On Wed, Nov 13, 2013 at 4:58 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Just to add to what Markus said : see
> http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial
> The approach is the same for 2.x. Nutch is just a Hadoop application with a
> few scripts to make your life easier
>
> Julien
>
>
> On 13 November 2013 09:45, Markus Jelsma <ma...@openindex.io>
> wrote:
>
> > You can just install Hadoop on the cluster as you would have otherwise.
> > Then you can run the Nutch job file via the bin/nutch script on any
> Hadoop
> > client such as the jobtracker for example.
> >
> >
> >
> > -----Original message-----
> > > From:flo @ <xx...@gmail.com>
> > > Sent: Wednesday 13th November 2013 10:20
> > > To: user@nutch.apache.org
> > > Subject: Nutch cluster
> > >
> > > Which is the best approach to setup a nutch cluster with multiple nutch
> > > instances running on different machines. Is there some kind of
> scheduler
> > > for nutch?
> > >
> > > I already configured a single nutch instance with HBase for storing the
> > > index in the background.
> > >
> > > Thanks
> > >
> > > flo
> > >
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Nutch cluster

Posted by Julien Nioche <li...@gmail.com>.
Just to add to what Markus said : see
http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial
The approach is the same for 2.x. Nutch is just a Hadoop application with a
few scripts to make your life easier

Julien


On 13 November 2013 09:45, Markus Jelsma <ma...@openindex.io> wrote:

> You can just install Hadoop on the cluster as you would have otherwise.
> Then you can run the Nutch job file via the bin/nutch script on any Hadoop
> client such as the jobtracker for example.
>
>
>
> -----Original message-----
> > From:flo @ <xx...@gmail.com>
> > Sent: Wednesday 13th November 2013 10:20
> > To: user@nutch.apache.org
> > Subject: Nutch cluster
> >
> > Which is the best approach to setup a nutch cluster with multiple nutch
> > instances running on different machines. Is there some kind of scheduler
> > for nutch?
> >
> > I already configured a single nutch instance with HBase for storing the
> > index in the background.
> >
> > Thanks
> >
> > flo
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

RE: Nutch cluster

Posted by Markus Jelsma <ma...@openindex.io>.
You can just install Hadoop on the cluster as you would have otherwise. Then you can run the Nutch job file via the bin/nutch script on any Hadoop client such as the jobtracker for example.

 
 
-----Original message-----
> From:flo @ <xx...@gmail.com>
> Sent: Wednesday 13th November 2013 10:20
> To: user@nutch.apache.org
> Subject: Nutch cluster
> 
> Which is the best approach to setup a nutch cluster with multiple nutch
> instances running on different machines. Is there some kind of scheduler
> for nutch?
> 
> I already configured a single nutch instance with HBase for storing the
> index in the background.
> 
> Thanks
> 
> flo
>