You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by A Laxmi <a....@gmail.com> on 2013/10/05 17:07:33 UTC

HBase Pseudo distributed or Fully distributed mode for Nutch 2.2.1?

Hi,

I have a single Linux (Ubuntu) server in Development environment and I plan
to use a single server for Production environment as well.

I am using HBase 0.90.6 as a backend datastore for Nutch 2.2.1. I have
tried standalone mode of HBase and it works fine for smaller crawls only.
Now, I plan to crawl extensively in a bit larger scale trying to achieve
300K urls data and hence I would like to migrate from standalone to a
distributed mode. However, **I don't intend to use multiple machines. All I
have is a single server, so which mode of HBase is ideal for Production
environment in my case - pseudo or fully-distributed?

Thanks for your help!

Re: HBase Pseudo distributed or Fully distributed mode for Nutch 2.2.1? fully-distributed

Posted by A Laxmi <a....@gmail.com>.

Julien - Thanks so much for your input! I will try the pseudo-distributed
mode.
When I said manipulating crawled data - I would like to play with the
parsed content of the crawl to clean the "clutter" it grabs while crawling
like - navigation, text on banner, header etc., so I can have a rich
snippet(or text summary).


On Mon, Oct 7, 2013 at 4:22 AM, Julien Nioche <lists.digitalpebble@gmail.com
> wrote:

> Hi,
>
> IMHO 2.68 times slower (with HBase -
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html) is
> not just 'a bit' slower, it's a lot slower :-)
>
> I usually recommend to use a pseudo-distributed setup instead of a local
> one. There might be a small overhead in using HDFS over the filesystem
> directly indeed, but this is more than compensated by the fact that you
> will have parallelism and all the non-fetch tasks will be a lot faster
> thanks to this. It is also a lot easier to monitor the crawl using the
> MapReduce UI e.g. look at the logs for a given task, check the counters,
> etc...
>
> The local mode can be used only for debugging and testing the configuration
> IMHO
>
> Re- " I need a way to manipulate crawled data" : you can do that in Nutch
> 1.x just as well even though not having to keep track of where things are
> in segments definitely simplifies things. It probably depends on what you
> want to do with the data.
>
> Just my 0.02£
>
> Julien
>
>
>
> On 6 October 2013 12:32, Markus Jelsma <ma...@openindex.io> wrote:
>
> > With Nutch 1.x you can easily keep up with three million records on a
> > 512MB VPS running Nutch in local mode. Although 2,x is a bit slower, you
> > really don't need a cluster for just 300k records.
> >
> >
> >
> > -----Original message-----
> > > From:Renato Marroquín Mogrovejo <re...@gmail.com>
> > > Sent: Sunday 6th October 2013 5:36
> > > To: Nutch Users <us...@nutch.apache.org>
> > > Subject: Re: HBase Pseudo distributed or Fully distributed mode for
> > Nutch 2.2.1? fully-distributed
> > >
> > > As Talad said maybe 300k should work fairly OK.
> > > What is your hardware like? Is HBase and Nutch inside the same server?
> > >
> > > Renato M.
> > > On Oct 5, 2013 10:08 PM, "A Laxmi" <a....@gmail.com> wrote:
> > >
> > > > Thanks Talat!
> > > >
> > > > Renato - Thanks for your reply! I have tried Standalone mode but I
> > have had
> > > > lot of issues when I started crawling more than 5 rounds of
> > depth.Though I
> > > > am runing Nutch and HBase on a single node, I felt HBase - standalone
> > mode
> > > > was not a good fit for larger crawls with a depth of 5 or higher and
> > topN
> > > > 50000.
> > > >
> > > >
> > > >
> > > > On Sat, Oct 5, 2013 at 4:45 PM, Talat UYARER <
> talat.uyarer@agmlab.com
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Hbase use hdfs file system. Hdfs is fully-distributed. Nutch use
> > > > mapreduce
> > > > > framework from hadoop. Pseudo mode just only development
> envoirment.
> > I
> > > > dont
> > > > > recommend for large scale crawlers. But if you crawl only 300K
> urls,
> > it
> > > > is
> > > > > not too big. You can use pseudo mode.
> > > > >
> > > > > 05-10-2013 18:07 tarihinde, A Laxmi yazdı:
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >> I have a single Linux (Ubuntu) server in Development environment
> > and I
> > > > >> plan
> > > > >> to use a single server for Production environment as well.
> > > > >>
> > > > >> I am using HBase 0.90.6 as a backend datastore for Nutch 2.2.1. I
> > have
> > > > >> tried standalone mode of HBase and it works fine for smaller
> crawls
> > > > only.
> > > > >> Now, I plan to crawl extensively in a bit larger scale trying to
> > achieve
> > > > >> 300K urls data and hence I would like to migrate from standalone
> to
> > a
> > > > >> distributed mode. However, **I don't intend to use multiple
> > machines.
> > > > All
> > > > >> I
> > > > >> have is a single server, so which mode of HBase is ideal for
> > Production
> > > > >> environment in my case - pseudo or fully-distributed?
> > > > >>
> > > > >> Thanks for your help!
> > > > >>
> > > > >>
> > > > >
> > > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: HBase Pseudo distributed or Fully distributed mode for Nutch 2.2.1? fully-distributed

Posted by Julien Nioche <li...@gmail.com>.

Hi,

IMHO 2.68 times slower (with HBase -
http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html) is
not just 'a bit' slower, it's a lot slower :-)

I usually recommend to use a pseudo-distributed setup instead of a local
one. There might be a small overhead in using HDFS over the filesystem
directly indeed, but this is more than compensated by the fact that you
will have parallelism and all the non-fetch tasks will be a lot faster
thanks to this. It is also a lot easier to monitor the crawl using the
MapReduce UI e.g. look at the logs for a given task, check the counters,
etc...

The local mode can be used only for debugging and testing the configuration
IMHO

Re- " I need a way to manipulate crawled data" : you can do that in Nutch
1.x just as well even though not having to keep track of where things are
in segments definitely simplifies things. It probably depends on what you
want to do with the data.

Just my 0.02£

Julien



On 6 October 2013 12:32, Markus Jelsma <ma...@openindex.io> wrote:

> With Nutch 1.x you can easily keep up with three million records on a
> 512MB VPS running Nutch in local mode. Although 2,x is a bit slower, you
> really don't need a cluster for just 300k records.
>
>
>
> -----Original message-----
> > From:Renato Marroquín Mogrovejo <re...@gmail.com>
> > Sent: Sunday 6th October 2013 5:36
> > To: Nutch Users <us...@nutch.apache.org>
> > Subject: Re: HBase Pseudo distributed or Fully distributed mode for
> Nutch 2.2.1? fully-distributed
> >
> > As Talad said maybe 300k should work fairly OK.
> > What is your hardware like? Is HBase and Nutch inside the same server?
> >
> > Renato M.
> > On Oct 5, 2013 10:08 PM, "A Laxmi" <a....@gmail.com> wrote:
> >
> > > Thanks Talat!
> > >
> > > Renato - Thanks for your reply! I have tried Standalone mode but I
> have had
> > > lot of issues when I started crawling more than 5 rounds of
> depth.Though I
> > > am runing Nutch and HBase on a single node, I felt HBase - standalone
> mode
> > > was not a good fit for larger crawls with a depth of 5 or higher and
> topN
> > > 50000.
> > >
> > >
> > >
> > > On Sat, Oct 5, 2013 at 4:45 PM, Talat UYARER <talat.uyarer@agmlab.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > Hbase use hdfs file system. Hdfs is fully-distributed. Nutch use
> > > mapreduce
> > > > framework from hadoop. Pseudo mode just only development envoirment.
> I
> > > dont
> > > > recommend for large scale crawlers. But if you crawl only 300K urls,
> it
> > > is
> > > > not too big. You can use pseudo mode.
> > > >
> > > > 05-10-2013 18:07 tarihinde, A Laxmi yazdı:
> > > >
> > > >> Hi,
> > > >>
> > > >> I have a single Linux (Ubuntu) server in Development environment
> and I
> > > >> plan
> > > >> to use a single server for Production environment as well.
> > > >>
> > > >> I am using HBase 0.90.6 as a backend datastore for Nutch 2.2.1. I
> have
> > > >> tried standalone mode of HBase and it works fine for smaller crawls
> > > only.
> > > >> Now, I plan to crawl extensively in a bit larger scale trying to
> achieve
> > > >> 300K urls data and hence I would like to migrate from standalone to
> a
> > > >> distributed mode. However, **I don't intend to use multiple
> machines.
> > > All
> > > >> I
> > > >> have is a single server, so which mode of HBase is ideal for
> Production
> > > >> environment in my case - pseudo or fully-distributed?
> > > >>
> > > >> Thanks for your help!
> > > >>
> > > >>
> > > >
> > >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: HBase Pseudo distributed or Fully distributed mode for Nutch 2.2.1? fully-distributed

Posted by A Laxmi <a....@gmail.com>.

Markus - Thanks for your input! I need a way to manipulate crawled data so
I have chosen 2.x over 1.x. Yeah, I really don't have large volumes of
data, at the max I am looking at is anywhere from 300k to 500k records down
the line. To start off with, I will have 300K records.


On Sun, Oct 6, 2013 at 7:32 AM, Markus Jelsma <ma...@openindex.io>wrote:

> With Nutch 1.x you can easily keep up with three million records on a
> 512MB VPS running Nutch in local mode. Although 2,x is a bit slower, you
> really don't need a cluster for just 300k records.
>
>
>
> -----Original message-----
> > From:Renato Marroquín Mogrovejo <re...@gmail.com>
> > Sent: Sunday 6th October 2013 5:36
> > To: Nutch Users <us...@nutch.apache.org>
> > Subject: Re: HBase Pseudo distributed or Fully distributed mode for
> Nutch 2.2.1? fully-distributed
> >
> > As Talad said maybe 300k should work fairly OK.
> > What is your hardware like? Is HBase and Nutch inside the same server?
> >
> > Renato M.
> > On Oct 5, 2013 10:08 PM, "A Laxmi" <a....@gmail.com> wrote:
> >
> > > Thanks Talat!
> > >
> > > Renato - Thanks for your reply! I have tried Standalone mode but I
> have had
> > > lot of issues when I started crawling more than 5 rounds of
> depth.Though I
> > > am runing Nutch and HBase on a single node, I felt HBase - standalone
> mode
> > > was not a good fit for larger crawls with a depth of 5 or higher and
> topN
> > > 50000.
> > >
> > >
> > >
> > > On Sat, Oct 5, 2013 at 4:45 PM, Talat UYARER <talat.uyarer@agmlab.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > Hbase use hdfs file system. Hdfs is fully-distributed. Nutch use
> > > mapreduce
> > > > framework from hadoop. Pseudo mode just only development envoirment.
> I
> > > dont
> > > > recommend for large scale crawlers. But if you crawl only 300K urls,
> it
> > > is
> > > > not too big. You can use pseudo mode.
> > > >
> > > > 05-10-2013 18:07 tarihinde, A Laxmi yazdı:
> > > >
> > > >> Hi,
> > > >>
> > > >> I have a single Linux (Ubuntu) server in Development environment
> and I
> > > >> plan
> > > >> to use a single server for Production environment as well.
> > > >>
> > > >> I am using HBase 0.90.6 as a backend datastore for Nutch 2.2.1. I
> have
> > > >> tried standalone mode of HBase and it works fine for smaller crawls
> > > only.
> > > >> Now, I plan to crawl extensively in a bit larger scale trying to
> achieve
> > > >> 300K urls data and hence I would like to migrate from standalone to
> a
> > > >> distributed mode. However, **I don't intend to use multiple
> machines.
> > > All
> > > >> I
> > > >> have is a single server, so which mode of HBase is ideal for
> Production
> > > >> environment in my case - pseudo or fully-distributed?
> > > >>
> > > >> Thanks for your help!
> > > >>
> > > >>
> > > >
> > >
>

RE: HBase Pseudo distributed or Fully distributed mode for Nutch 2.2.1? fully-distributed

Posted by Markus Jelsma <ma...@openindex.io>.

With Nutch 1.x you can easily keep up with three million records on a 512MB VPS running Nutch in local mode. Although 2,x is a bit slower, you really don't need a cluster for just 300k records.

 
 
-----Original message-----
> From:Renato Marroquín Mogrovejo <re...@gmail.com>
> Sent: Sunday 6th October 2013 5:36
> To: Nutch Users <us...@nutch.apache.org>
> Subject: Re: HBase Pseudo distributed or Fully distributed mode for Nutch 2.2.1? fully-distributed
> 
> As Talad said maybe 300k should work fairly OK.
> What is your hardware like? Is HBase and Nutch inside the same server?
> 
> Renato M.
> On Oct 5, 2013 10:08 PM, "A Laxmi" <a....@gmail.com> wrote:
> 
> > Thanks Talat!
> >
> > Renato - Thanks for your reply! I have tried Standalone mode but I have had
> > lot of issues when I started crawling more than 5 rounds of depth.Though I
> > am runing Nutch and HBase on a single node, I felt HBase - standalone mode
> > was not a good fit for larger crawls with a depth of 5 or higher and topN
> > 50000.
> >
> >
> >
> > On Sat, Oct 5, 2013 at 4:45 PM, Talat UYARER <talat.uyarer@agmlab.com
> > >wrote:
> >
> > > Hi,
> > >
> > > Hbase use hdfs file system. Hdfs is fully-distributed. Nutch use
> > mapreduce
> > > framework from hadoop. Pseudo mode just only development envoirment. I
> > dont
> > > recommend for large scale crawlers. But if you crawl only 300K urls, it
> > is
> > > not too big. You can use pseudo mode.
> > >
> > > 05-10-2013 18:07 tarihinde, A Laxmi yazdı:
> > >
> > >> Hi,
> > >>
> > >> I have a single Linux (Ubuntu) server in Development environment and I
> > >> plan
> > >> to use a single server for Production environment as well.
> > >>
> > >> I am using HBase 0.90.6 as a backend datastore for Nutch 2.2.1. I have
> > >> tried standalone mode of HBase and it works fine for smaller crawls
> > only.
> > >> Now, I plan to crawl extensively in a bit larger scale trying to achieve
> > >> 300K urls data and hence I would like to migrate from standalone to a
> > >> distributed mode. However, **I don't intend to use multiple machines.
> > All
> > >> I
> > >> have is a single server, so which mode of HBase is ideal for Production
> > >> environment in my case - pseudo or fully-distributed?
> > >>
> > >> Thanks for your help!
> > >>
> > >>
> > >
> >

Re: HBase Pseudo distributed or Fully distributed mode for Nutch 2.2.1? fully-distributed

Posted by A Laxmi <a....@gmail.com>.

Renato - Yes, both HBase and Nutch are inside the same server.


On Sat, Oct 5, 2013 at 11:34 PM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> As Talad said maybe 300k should work fairly OK.
> What is your hardware like? Is HBase and Nutch inside the same server?
>
> Renato M.
> On Oct 5, 2013 10:08 PM, "A Laxmi" <a....@gmail.com> wrote:
>
> > Thanks Talat!
> >
> > Renato - Thanks for your reply! I have tried Standalone mode but I have
> had
> > lot of issues when I started crawling more than 5 rounds of depth.Though
> I
> > am runing Nutch and HBase on a single node, I felt HBase - standalone
> mode
> > was not a good fit for larger crawls with a depth of 5 or higher and topN
> > 50000.
> >
> >
> >
> > On Sat, Oct 5, 2013 at 4:45 PM, Talat UYARER <talat.uyarer@agmlab.com
> > >wrote:
> >
> > > Hi,
> > >
> > > Hbase use hdfs file system. Hdfs is fully-distributed. Nutch use
> > mapreduce
> > > framework from hadoop. Pseudo mode just only development envoirment. I
> > dont
> > > recommend for large scale crawlers. But if you crawl only 300K urls, it
> > is
> > > not too big. You can use pseudo mode.
> > >
> > > 05-10-2013 18:07 tarihinde, A Laxmi yazdı:
> > >
> > >> Hi,
> > >>
> > >> I have a single Linux (Ubuntu) server in Development environment and I
> > >> plan
> > >> to use a single server for Production environment as well.
> > >>
> > >> I am using HBase 0.90.6 as a backend datastore for Nutch 2.2.1. I have
> > >> tried standalone mode of HBase and it works fine for smaller crawls
> > only.
> > >> Now, I plan to crawl extensively in a bit larger scale trying to
> achieve
> > >> 300K urls data and hence I would like to migrate from standalone to a
> > >> distributed mode. However, **I don't intend to use multiple machines.
> > All
> > >> I
> > >> have is a single server, so which mode of HBase is ideal for
> Production
> > >> environment in my case - pseudo or fully-distributed?
> > >>
> > >> Thanks for your help!
> > >>
> > >>
> > >
> >
>

Re: HBase Pseudo distributed or Fully distributed mode for Nutch 2.2.1? fully-distributed

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

As Talad said maybe 300k should work fairly OK.
What is your hardware like? Is HBase and Nutch inside the same server?

Renato M.
On Oct 5, 2013 10:08 PM, "A Laxmi" <a....@gmail.com> wrote:

> Thanks Talat!
>
> Renato - Thanks for your reply! I have tried Standalone mode but I have had
> lot of issues when I started crawling more than 5 rounds of depth.Though I
> am runing Nutch and HBase on a single node, I felt HBase - standalone mode
> was not a good fit for larger crawls with a depth of 5 or higher and topN
> 50000.
>
>
>
> On Sat, Oct 5, 2013 at 4:45 PM, Talat UYARER <talat.uyarer@agmlab.com
> >wrote:
>
> > Hi,
> >
> > Hbase use hdfs file system. Hdfs is fully-distributed. Nutch use
> mapreduce
> > framework from hadoop. Pseudo mode just only development envoirment. I
> dont
> > recommend for large scale crawlers. But if you crawl only 300K urls, it
> is
> > not too big. You can use pseudo mode.
> >
> > 05-10-2013 18:07 tarihinde, A Laxmi yazdı:
> >
> >> Hi,
> >>
> >> I have a single Linux (Ubuntu) server in Development environment and I
> >> plan
> >> to use a single server for Production environment as well.
> >>
> >> I am using HBase 0.90.6 as a backend datastore for Nutch 2.2.1. I have
> >> tried standalone mode of HBase and it works fine for smaller crawls
> only.
> >> Now, I plan to crawl extensively in a bit larger scale trying to achieve
> >> 300K urls data and hence I would like to migrate from standalone to a
> >> distributed mode. However, **I don't intend to use multiple machines.
> All
> >> I
> >> have is a single server, so which mode of HBase is ideal for Production
> >> environment in my case - pseudo or fully-distributed?
> >>
> >> Thanks for your help!
> >>
> >>
> >
>

Re: HBase Pseudo distributed or Fully distributed mode for Nutch 2.2.1? fully-distributed

Posted by A Laxmi <a....@gmail.com>.

Thanks Talat!

Renato - Thanks for your reply! I have tried Standalone mode but I have had
lot of issues when I started crawling more than 5 rounds of depth.Though I
am runing Nutch and HBase on a single node, I felt HBase - standalone mode
was not a good fit for larger crawls with a depth of 5 or higher and topN
50000.



On Sat, Oct 5, 2013 at 4:45 PM, Talat UYARER <ta...@agmlab.com>wrote:

> Hi,
>
> Hbase use hdfs file system. Hdfs is fully-distributed. Nutch use mapreduce
> framework from hadoop. Pseudo mode just only development envoirment. I dont
> recommend for large scale crawlers. But if you crawl only 300K urls, it is
> not too big. You can use pseudo mode.
>
> 05-10-2013 18:07 tarihinde, A Laxmi yazdı:
>
>> Hi,
>>
>> I have a single Linux (Ubuntu) server in Development environment and I
>> plan
>> to use a single server for Production environment as well.
>>
>> I am using HBase 0.90.6 as a backend datastore for Nutch 2.2.1. I have
>> tried standalone mode of HBase and it works fine for smaller crawls only.
>> Now, I plan to crawl extensively in a bit larger scale trying to achieve
>> 300K urls data and hence I would like to migrate from standalone to a
>> distributed mode. However, **I don't intend to use multiple machines. All
>> I
>> have is a single server, so which mode of HBase is ideal for Production
>> environment in my case - pseudo or fully-distributed?
>>
>> Thanks for your help!
>>
>>
>

Re: HBase Pseudo distributed or Fully distributed mode for Nutch 2.2.1? fully-distributed

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi,

if you are using a single server for production then you will be using a
single server for Hbase as well right? So you should use standalone mode as
you will use the file system directly. Pseudo distributed mode could be
another option but probably would have more overhead and no advantage in
the short term as it will be checking for services on the network while
everything is on the same server.

Renato M.
On Oct 5, 2013 3:46 PM, "Talat UYARER" <ta...@agmlab.com> wrote:

> Hi,
>
> Hbase use hdfs file system. Hdfs is fully-distributed. Nutch use mapreduce
> framework from hadoop. Pseudo mode just only development envoirment. I dont
> recommend for large scale crawlers. But if you crawl only 300K urls, it is
> not too big. You can use pseudo mode.
>
> 05-10-2013 18:07 tarihinde, A Laxmi yazdı:
>
>> Hi,
>>
>> I have a single Linux (Ubuntu) server in Development environment and I
>> plan
>> to use a single server for Production environment as well.
>>
>> I am using HBase 0.90.6 as a backend datastore for Nutch 2.2.1. I have
>> tried standalone mode of HBase and it works fine for smaller crawls only.
>> Now, I plan to crawl extensively in a bit larger scale trying to achieve
>> 300K urls data and hence I would like to migrate from standalone to a
>> distributed mode. However, **I don't intend to use multiple machines. All
>> I
>> have is a single server, so which mode of HBase is ideal for Production
>> environment in my case - pseudo or fully-distributed?
>>
>> Thanks for your help!
>>
>>
>

Re: HBase Pseudo distributed or Fully distributed mode for Nutch 2.2.1? fully-distributed

Posted by Talat UYARER <ta...@agmlab.com>.

Hi,

Hbase use hdfs file system. Hdfs is fully-distributed. Nutch use 
mapreduce framework from hadoop. Pseudo mode just only development 
envoirment. I dont recommend for large scale crawlers. But if you crawl 
only 300K urls, it is not too big. You can use pseudo mode.

05-10-2013 18:07 tarihinde, A Laxmi yazdı:
> Hi,
>
> I have a single Linux (Ubuntu) server in Development environment and I plan
> to use a single server for Production environment as well.
>
> I am using HBase 0.90.6 as a backend datastore for Nutch 2.2.1. I have
> tried standalone mode of HBase and it works fine for smaller crawls only.
> Now, I plan to crawl extensively in a bit larger scale trying to achieve
> 300K urls data and hence I would like to migrate from standalone to a
> distributed mode. However, **I don't intend to use multiple machines. All I
> have is a single server, so which mode of HBase is ideal for Production
> environment in my case - pseudo or fully-distributed?
>
> Thanks for your help!
>