You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by BlackIce <bl...@gmail.com> on 2014/03/18 13:00:42 UTC

Optimizing Nutch 2.2.1

Hi,

I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode , Hadoop
1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram

Currently the Fetch cycle is limited by my Internet connection.

Parse cycle uses an average of 10% per CPU core

Updatedb cycle uses average 3% per CPU core

Currently I'm only running Hbase in Speudo distributed, not Nutch.

As the DB grows everything slows down significantly but as you can see CPU
resources are not used very much, heck during Update DB my web browsing
creates higher utilization spikes than the updatedb process. I feel that my
hardware is very underutilized and adding more phisycal machines would be a
waste.

What are the bottlenecks? how can I optimize them? should I run a cluster
on 3 Virtual machines?

Thank you for any help you can give!


Ralf R. Kotowski

Fwd: Optimizing Nutch 2.2.1

Posted by BlackIce <bl...@gmail.com>.

Hi,

I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode , Hadoop
1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram

Currently the Fetch cycle is limited by my Internet connection.

Parse cycle uses an average of 10% per CPU core

Updatedb cycle uses average 3% per CPU core

Currently I'm only running Hbase in pseudo distributed, not Nutch.

As the DB grows everything slows down significantly but as you can see CPU
resources are not used very much, heck during Update DB my web browsing
creates higher utilization spikes than the updatedb process. I feel that my
hardware is very underutilized and adding more phisycal machines would be a
waste.

What are the bottlenecks? how can I optimize them? should I run a cluster
on 3 Virtual machines?

Thank you for any help you can give!


Ralf R. Kotowski

Re: Optimizing Nutch 2.2.1

Posted by BlackIce <bl...@gmail.com>.

Thnx,

It seems that anything related to Hadoop is a MUST read!


On Wed, Mar 19, 2014 at 8:25 PM, Talat Uyarer <ta...@uyarer.com> wrote:

> imho you dont wait performance on psedo mode. Actually you should learn how
> do hadoop run. I read Hadoop Definitive Guide, i recommend you for start
> point
> 19 Mar 2014 20:48 tarihinde "BlackIce" <bl...@gmail.com> yazdı:
>
> > Thank you,
> >
> > what are some good starting points to start tuning?
> >
> > thnx
> >
> >
> > On Tue, Mar 18, 2014 at 8:20 PM, Talat Uyarer <ta...@uyarer.com> wrote:
> >
> > > Hi,
> > >
> > > When you use Hadoop in pseudo mode, it create 2 map and 2 reduce. If
> you
> > > want to speed up some job you should decrease your map and reduce
> count.
> > > But optimization is very general concept. You should tune Nutch, Hdfs,
> > > Jobtracker and Hbase settings.
> > >
> > > Good luck ;)
> > >
> > >
> > > 2014-03-18 14:00 GMT+02:00 BlackIce <bl...@gmail.com>:
> > >
> > > > Hi,
> > > >
> > > > I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode ,
> Hadoop
> > > > 1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram
> > > >
> > > > Currently the Fetch cycle is limited by my Internet connection.
> > > >
> > > > Parse cycle uses an average of 10% per CPU core
> > > >
> > > > Updatedb cycle uses average 3% per CPU core
> > > >
> > > > Currently I'm only running Hbase in Speudo distributed, not Nutch.
> > > >
> > > > As the DB grows everything slows down significantly but as you can
> see
> > > CPU
> > > > resources are not used very much, heck during Update DB my web
> browsing
> > > > creates higher utilization spikes than the updatedb process. I feel
> > that
> > > my
> > > > hardware is very underutilized and adding more phisycal machines
> would
> > > be a
> > > > waste.
> > > >
> > > > What are the bottlenecks? how can I optimize them? should I run a
> > cluster
> > > > on 3 Virtual machines?
> > > >
> > > > Thank you for any help you can give!
> > > >
> > > >
> > > > Ralf R. Kotowski
> > > >
> > >
> > >
> > >
> > > --
> > > Talat UYARER
> > > Websitesi: http://talat.uyarer.com
> > > Twitter: http://twitter.com/talatuyarer
> > > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
> > >
> >
>

Re: Optimizing Nutch 2.2.1

Posted by Talat Uyarer <ta...@uyarer.com>.

imho you dont wait performance on psedo mode. Actually you should learn how
do hadoop run. I read Hadoop Definitive Guide, i recommend you for start
point
19 Mar 2014 20:48 tarihinde "BlackIce" <bl...@gmail.com> yazdı:

> Thank you,
>
> what are some good starting points to start tuning?
>
> thnx
>
>
> On Tue, Mar 18, 2014 at 8:20 PM, Talat Uyarer <ta...@uyarer.com> wrote:
>
> > Hi,
> >
> > When you use Hadoop in pseudo mode, it create 2 map and 2 reduce. If you
> > want to speed up some job you should decrease your map and reduce count.
> > But optimization is very general concept. You should tune Nutch, Hdfs,
> > Jobtracker and Hbase settings.
> >
> > Good luck ;)
> >
> >
> > 2014-03-18 14:00 GMT+02:00 BlackIce <bl...@gmail.com>:
> >
> > > Hi,
> > >
> > > I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode , Hadoop
> > > 1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram
> > >
> > > Currently the Fetch cycle is limited by my Internet connection.
> > >
> > > Parse cycle uses an average of 10% per CPU core
> > >
> > > Updatedb cycle uses average 3% per CPU core
> > >
> > > Currently I'm only running Hbase in Speudo distributed, not Nutch.
> > >
> > > As the DB grows everything slows down significantly but as you can see
> > CPU
> > > resources are not used very much, heck during Update DB my web browsing
> > > creates higher utilization spikes than the updatedb process. I feel
> that
> > my
> > > hardware is very underutilized and adding more phisycal machines would
> > be a
> > > waste.
> > >
> > > What are the bottlenecks? how can I optimize them? should I run a
> cluster
> > > on 3 Virtual machines?
> > >
> > > Thank you for any help you can give!
> > >
> > >
> > > Ralf R. Kotowski
> > >
> >
> >
> >
> > --
> > Talat UYARER
> > Websitesi: http://talat.uyarer.com
> > Twitter: http://twitter.com/talatuyarer
> > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
> >
>

Re: Optimizing Nutch 2.2.1

Posted by BlackIce <bl...@gmail.com>.

Thank you,

what are some good starting points to start tuning?

thnx


On Tue, Mar 18, 2014 at 8:20 PM, Talat Uyarer <ta...@uyarer.com> wrote:

> Hi,
>
> When you use Hadoop in pseudo mode, it create 2 map and 2 reduce. If you
> want to speed up some job you should decrease your map and reduce count.
> But optimization is very general concept. You should tune Nutch, Hdfs,
> Jobtracker and Hbase settings.
>
> Good luck ;)
>
>
> 2014-03-18 14:00 GMT+02:00 BlackIce <bl...@gmail.com>:
>
> > Hi,
> >
> > I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode , Hadoop
> > 1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram
> >
> > Currently the Fetch cycle is limited by my Internet connection.
> >
> > Parse cycle uses an average of 10% per CPU core
> >
> > Updatedb cycle uses average 3% per CPU core
> >
> > Currently I'm only running Hbase in Speudo distributed, not Nutch.
> >
> > As the DB grows everything slows down significantly but as you can see
> CPU
> > resources are not used very much, heck during Update DB my web browsing
> > creates higher utilization spikes than the updatedb process. I feel that
> my
> > hardware is very underutilized and adding more phisycal machines would
> be a
> > waste.
> >
> > What are the bottlenecks? how can I optimize them? should I run a cluster
> > on 3 Virtual machines?
> >
> > Thank you for any help you can give!
> >
> >
> > Ralf R. Kotowski
> >
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>

Re: Optimizing Nutch 2.2.1

Posted by Talat Uyarer <ta...@uyarer.com>.

Hi,

When you use Hadoop in pseudo mode, it create 2 map and 2 reduce. If you
want to speed up some job you should decrease your map and reduce count.
But optimization is very general concept. You should tune Nutch, Hdfs,
Jobtracker and Hbase settings.

Good luck ;)


2014-03-18 14:00 GMT+02:00 BlackIce <bl...@gmail.com>:

> Hi,
>
> I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode , Hadoop
> 1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram
>
> Currently the Fetch cycle is limited by my Internet connection.
>
> Parse cycle uses an average of 10% per CPU core
>
> Updatedb cycle uses average 3% per CPU core
>
> Currently I'm only running Hbase in Speudo distributed, not Nutch.
>
> As the DB grows everything slows down significantly but as you can see CPU
> resources are not used very much, heck during Update DB my web browsing
> creates higher utilization spikes than the updatedb process. I feel that my
> hardware is very underutilized and adding more phisycal machines would be a
> waste.
>
> What are the bottlenecks? how can I optimize them? should I run a cluster
> on 3 Virtual machines?
>
> Thank you for any help you can give!
>
>
> Ralf R. Kotowski
>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304