You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Deepa Jayaveer <de...@tcs.com> on 2014/02/12 10:01:37 UTC

sizing guide

Hi ,
Am using Nutch2.1 with MySQL. Is there a sizing guide available for Nutch 
2.1? 
Is there any recommendations could be ginven on  sizing memory,CPU and 
Disk Space for crawling.

Thanks and Regards
Deepa Devi Jayaveer
Mobile No: 9940662806
Tata Consultancy Services
Mailto: deepa.jayaveer@tcs.com
Website: http://www.tcs.com
____________________________________________
Experience certainty.   IT Services
                        Business Solutions
                        Consulting
____________________________________________
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you

Re: sizing guide

Posted by Deepa Jayaveer <de...@tcs.com>.

Thanks a lot for your reply

Thanks and Regards
Deepa Devi Jayaveer
Tata Consultancy Services
Mailto: deepa.jayaveer@tcs.com
Website: http://www.tcs.com
____________________________________________
Experience certainty.   IT Services
                        Business Solutions
                        Consulting
____________________________________________



From:
Tejas Patil <te...@gmail.com>
To:
"user@nutch.apache.org" <us...@nutch.apache.org>
Date:
02/13/2014 02:29 PM
Subject:
Re: sizing guide



On Wed, Feb 12, 2014 at 11:08 PM, Deepa Jayaveer 
<de...@tcs.com>wrote:

> Thanks for your reply.
>   I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with
> Hbase
> once I get a fair idea about Nutch.
>                 For our use case, I need to crawl large documents for
> around 100 web sites
>  weekly  and our functionality demands to crawl on daily basis or even
> hourly basis to
> extract specific information from around 20 different host. Say,
> Need to extract product details from the retailer's site.
> In that case, we need to recrawl the pages to get the latest information
>
> As you mentioned, I can do a batch delete the crawled html data once
> I extract the information from the crawled data. I can expect the
> crawled data roughly to be around  1 TB (could be deleted on scheduled
> basis)
>

If you process the data as soon it is available, then you might not need 
to
have 1 TB.. unless Nutch gets that much data in a single fetch cycle.

>
> Will these sizing be fine for Nutch installation in production?
> 4 Node Hadoop cluster with 2 TB storage each
> 64 GB RAM each
> 10 GB heap
>

Looks fine. You need to monitor the crawl for first week or two so as to
know if you need to change this setup.

>
> Apart from that, need to do HBase data sizing to store the product
> details(which
> would be around 400 GB of data)
> can I use the same HBase cluster to store the extracted data where Nutch
> is raining
>

Yes you can. HBase is a black box to me and it would have a bunch of its
own configs which you could tune.

>
> Can you please let me know your suggestion or recommendations.
>
>
> Thanks and Regards
> Deepa Devi Jayaveer
> Mobile No: 9940662806
> Tata Consultancy Services
> Mailto: deepa.jayaveer@tcs.com
> Website: http://www.tcs.com
> ____________________________________________
> Experience certainty.   IT Services
>                         Business Solutions
>                         Consulting
> ____________________________________________
>
>
>
> From:
> Tejas Patil <te...@gmail.com>
> To:
> "user@nutch.apache.org" <us...@nutch.apache.org>
> Date:
> 02/13/2014 05:58 AM
> Subject:
> Re: sizing guide
>
>
>
> If you are looking for specific Nutch 2.1 + MySQL combination, I think
> that
> there won;t be any on the project wiki.
>
> There is no perfect answer for this as it depends on these factors (this
> list may go on):
> - Nature of data that you are crawling: small html files or large
> documents.
> - Is it a continuous crawl or few levels ?
> - Are you re-crawling urls ?
> - How big is the crawl space ?
> - Is it a intranet crawl ? How frequently are the pages changed ?
>
> Nutch 1.x would be a perfect fit for prod level crawls. If you still 
want
> to use Nutch 2.x, it would be better to switch to some other datastore
> (eg.
> HBase).
>
> Below are my experiences with two use cases wherein Nutch was used over
> prod with Nutch 1.x:
>
> (A) Targeted crawl of a single host
> In this case I wanted to get the data crawled quickly and didn't bother
> about the updates that would happen to the pages. I started off with a
> five
> node Hadoop cluster but later did the math that it won't get my work 
done
> in few days (remember that you need to have a delay between successive
> requests which the server agrees on else your crawler is banned). Later 
I
> bumped the cluster to 15 nodes. The pages were HTML files with size
> roughly
> 200k. The crawled data roughly needed 200GB and I had storage of about
> 500GB.
>
> (B) Open crawl of several hosts
> The configs and memory settings were driven by the prod hardware. I had 
a
> 4
> node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
> hadoop job with an exception of generate job which needed more heap 
(8-10
> GB). There was no need to store the crawled data and every batch was
> deleted as soon as it was processed. That said that disk had a capacity 
of
> 2 TB.
>
> Thanks,
> Tejas
>
> On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer
> <de...@tcs.com>wrote:
>
> > Hi ,
> > Am using Nutch2.1 with MySQL. Is there a sizing guide available for
> Nutch
> > 2.1?
> > Is there any recommendations could be ginven on  sizing memory,CPU and
> > Disk Space for crawling.
> >
> > Thanks and Regards
> > Deepa Devi Jayaveer
> > Mobile No: 9940662806
> > Tata Consultancy Services
> > Mailto: deepa.jayaveer@tcs.com
> > Website: http://www.tcs.com
> > ____________________________________________
> > Experience certainty.   IT Services
> >                         Business Solutions
> >                         Consulting
> > ____________________________________________
> > =====-----=====-----=====
> > Notice: The information contained in this e-mail
> > message and/or attachments to it may contain
> > confidential or privileged information. If you are
> > not the intended recipient, any dissemination, use,
> > review, distribution, printing or copying of the
> > information contained in this e-mail message
> > and/or attachments to it are strictly prohibited. If
> > you have received this communication in error,
> > please notify us by reply e-mail or telephone and
> > immediately and permanently delete the message
> > and any attachments. Thank you
> >
> >
> >
>
>
>

Re: sizing guide

Posted by Tejas Patil <te...@gmail.com>.

On Wed, Feb 12, 2014 at 11:08 PM, Deepa Jayaveer <de...@tcs.com>wrote:

> Thanks for your reply.
>   I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with
> Hbase
> once I get a fair idea about Nutch.
>                 For our use case, I need to crawl large documents for
> around 100 web sites
>  weekly  and our functionality demands to crawl on daily basis or even
> hourly basis to
> extract specific information from around 20 different host. Say,
> Need to extract product details from the retailer's site.
> In that case, we need to recrawl the pages to get the latest information
>
> As you mentioned, I can do a batch delete the crawled html data once
> I extract the information from the crawled data. I can expect the
> crawled data roughly to be around  1 TB (could be deleted on scheduled
> basis)
>

If you process the data as soon it is available, then you might not need to
have 1 TB.. unless Nutch gets that much data in a single fetch cycle.

>
> Will these sizing be fine for Nutch installation in production?
> 4 Node Hadoop cluster with 2 TB storage each
> 64 GB RAM each
> 10 GB heap
>

Looks fine. You need to monitor the crawl for first week or two so as to
know if you need to change this setup.

>
> Apart from that, need to do HBase data sizing to store the product
> details(which
> would be around 400 GB of data)
> can I use the same HBase cluster to store the extracted data where Nutch
> is raining
>

Yes you can. HBase is a black box to me and it would have a bunch of its
own configs which you could tune.

>
> Can you please let me know your suggestion or recommendations.
>
>
> Thanks and Regards
> Deepa Devi Jayaveer
> Mobile No: 9940662806
> Tata Consultancy Services
> Mailto: deepa.jayaveer@tcs.com
> Website: http://www.tcs.com
> ____________________________________________
> Experience certainty.   IT Services
>                         Business Solutions
>                         Consulting
> ____________________________________________
>
>
>
> From:
> Tejas Patil <te...@gmail.com>
> To:
> "user@nutch.apache.org" <us...@nutch.apache.org>
> Date:
> 02/13/2014 05:58 AM
> Subject:
> Re: sizing guide
>
>
>
> If you are looking for specific Nutch 2.1 + MySQL combination, I think
> that
> there won;t be any on the project wiki.
>
> There is no perfect answer for this as it depends on these factors (this
> list may go on):
> - Nature of data that you are crawling: small html files or large
> documents.
> - Is it a continuous crawl or few levels ?
> - Are you re-crawling urls ?
> - How big is the crawl space ?
> - Is it a intranet crawl ? How frequently are the pages changed ?
>
> Nutch 1.x would be a perfect fit for prod level crawls. If you still want
> to use Nutch 2.x, it would be better to switch to some other datastore
> (eg.
> HBase).
>
> Below are my experiences with two use cases wherein Nutch was used over
> prod with Nutch 1.x:
>
> (A) Targeted crawl of a single host
> In this case I wanted to get the data crawled quickly and didn't bother
> about the updates that would happen to the pages. I started off with a
> five
> node Hadoop cluster but later did the math that it won't get my work done
> in few days (remember that you need to have a delay between successive
> requests which the server agrees on else your crawler is banned). Later I
> bumped the cluster to 15 nodes. The pages were HTML files with size
> roughly
> 200k. The crawled data roughly needed 200GB and I had storage of about
> 500GB.
>
> (B) Open crawl of several hosts
> The configs and memory settings were driven by the prod hardware. I had a
> 4
> node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
> hadoop job with an exception of generate job which needed more heap (8-10
> GB). There was no need to store the crawled data and every batch was
> deleted as soon as it was processed. That said that disk had a capacity of
> 2 TB.
>
> Thanks,
> Tejas
>
> On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer
> <de...@tcs.com>wrote:
>
> > Hi ,
> > Am using Nutch2.1 with MySQL. Is there a sizing guide available for
> Nutch
> > 2.1?
> > Is there any recommendations could be ginven on  sizing memory,CPU and
> > Disk Space for crawling.
> >
> > Thanks and Regards
> > Deepa Devi Jayaveer
> > Mobile No: 9940662806
> > Tata Consultancy Services
> > Mailto: deepa.jayaveer@tcs.com
> > Website: http://www.tcs.com
> > ____________________________________________
> > Experience certainty.   IT Services
> >                         Business Solutions
> >                         Consulting
> > ____________________________________________
> > =====-----=====-----=====
> > Notice: The information contained in this e-mail
> > message and/or attachments to it may contain
> > confidential or privileged information. If you are
> > not the intended recipient, any dissemination, use,
> > review, distribution, printing or copying of the
> > information contained in this e-mail message
> > and/or attachments to it are strictly prohibited. If
> > you have received this communication in error,
> > please notify us by reply e-mail or telephone and
> > immediately and permanently delete the message
> > and any attachments. Thank you
> >
> >
> >
>
>
>

Re: sizing guide

Posted by Deepa Jayaveer <de...@tcs.com>.

Thanks for your reply.
  I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with 
Hbase 
once I get a fair idea about Nutch.
                For our use case, I need to crawl large documents for 
around 100 web sites
 weekly  and our functionality demands to crawl on daily basis or even 
hourly basis to 
extract specific information from around 20 different host. Say, 
Need to extract product details from the retailer's site. 
In that case, we need to recrawl the pages to get the latest information

As you mentioned, I can do a batch delete the crawled html data once
I extract the information from the crawled data. I can expect the 
crawled data roughly to be around  1 TB (could be deleted on scheduled 
basis)

Will these sizing be fine for Nutch installation in production?
4 Node Hadoop cluster with 2 TB storage each
64 GB RAM each
10 GB heap

Apart from that, need to do HBase data sizing to store the product 
details(which
would be around 400 GB of data) 
can I use the same HBase cluster to store the extracted data where Nutch 
is raining 

Can you please let me know your suggestion or recommendations.

Thanks and Regards
Deepa Devi Jayaveer
Mobile No: 9940662806
Tata Consultancy Services
Mailto: deepa.jayaveer@tcs.com
Website: http://www.tcs.com
____________________________________________
Experience certainty.   IT Services
                        Business Solutions
                        Consulting
____________________________________________

From:
Tejas Patil <te...@gmail.com>
To:
"user@nutch.apache.org" <us...@nutch.apache.org>
Date:
02/13/2014 05:58 AM
Subject:
Re: sizing guide

If you are looking for specific Nutch 2.1 + MySQL combination, I think 
that
there won;t be any on the project wiki.

There is no perfect answer for this as it depends on these factors (this
list may go on):
- Nature of data that you are crawling: small html files or large 
documents.
- Is it a continuous crawl or few levels ?
- Are you re-crawling urls ?
- How big is the crawl space ?
- Is it a intranet crawl ? How frequently are the pages changed ?

Nutch 1.x would be a perfect fit for prod level crawls. If you still want
to use Nutch 2.x, it would be better to switch to some other datastore 
(eg.
HBase).

Below are my experiences with two use cases wherein Nutch was used over
prod with Nutch 1.x:

(A) Targeted crawl of a single host
In this case I wanted to get the data crawled quickly and didn't bother
about the updates that would happen to the pages. I started off with a 
five
node Hadoop cluster but later did the math that it won't get my work done
in few days (remember that you need to have a delay between successive
requests which the server agrees on else your crawler is banned). Later I
bumped the cluster to 15 nodes. The pages were HTML files with size 
roughly
200k. The crawled data roughly needed 200GB and I had storage of about
500GB.

(B) Open crawl of several hosts
The configs and memory settings were driven by the prod hardware. I had a 
4
node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
hadoop job with an exception of generate job which needed more heap (8-10
GB). There was no need to store the crawled data and every batch was
deleted as soon as it was processed. That said that disk had a capacity of
2 TB.

Thanks,
Tejas

On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer 
<de...@tcs.com>wrote:

> Hi ,
> Am using Nutch2.1 with MySQL. Is there a sizing guide available for 
Nutch
> 2.1?
> Is there any recommendations could be ginven on  sizing memory,CPU and
> Disk Space for crawling.
>
> Thanks and Regards
> Deepa Devi Jayaveer
> Mobile No: 9940662806
> Tata Consultancy Services
> Mailto: deepa.jayaveer@tcs.com
> Website: http://www.tcs.com
> ____________________________________________
> Experience certainty.   IT Services
>                         Business Solutions
>                         Consulting
> ____________________________________________
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>

Re: sizing guide

Posted by Tejas Patil <te...@gmail.com>.

If you are looking for specific Nutch 2.1 + MySQL combination, I think that
there won;t be any on the project wiki.

There is no perfect answer for this as it depends on these factors (this
list may go on):
- Nature of data that you are crawling: small html files or large documents.
- Is it a continuous crawl or few levels ?
- Are you re-crawling urls ?
- How big is the crawl space ?
- Is it a intranet crawl ? How frequently are the pages changed ?

Nutch 1.x would be a perfect fit for prod level crawls. If you still want
to use Nutch 2.x, it would be better to switch to some other datastore (eg.
HBase).

Below are my experiences with two use cases wherein Nutch was used over
prod with Nutch 1.x:

(A) Targeted crawl of a single host
In this case I wanted to get the data crawled quickly and didn't bother
about the updates that would happen to the pages. I started off with a five
node Hadoop cluster but later did the math that it won't get my work done
in few days (remember that you need to have a delay between successive
requests which the server agrees on else your crawler is banned). Later I
bumped the cluster to 15 nodes. The pages were HTML files with size roughly
200k. The crawled data roughly needed 200GB and I had storage of about
500GB.

(B) Open crawl of several hosts
The configs and memory settings were driven by the prod hardware. I had a 4
node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
hadoop job with an exception of generate job which needed more heap (8-10
GB). There was no need to store the crawled data and every batch was
deleted as soon as it was processed. That said that disk had a capacity of
2 TB.

Thanks,
Tejas

On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer <de...@tcs.com>wrote:

> Hi ,
> Am using Nutch2.1 with MySQL. Is there a sizing guide available for Nutch
> 2.1?
> Is there any recommendations could be ginven on  sizing memory,CPU and
> Disk Space for crawling.
>
> Thanks and Regards
> Deepa Devi Jayaveer
> Mobile No: 9940662806
> Tata Consultancy Services
> Mailto: deepa.jayaveer@tcs.com
> Website: http://www.tcs.com
> ____________________________________________
> Experience certainty.   IT Services
>                         Business Solutions
>                         Consulting
> ____________________________________________
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>

RE: sizing guide

Posted by Markus Jelsma <ma...@openindex.io>.

Increase the number of mappers and reducers per node, see mapred-site.xml. 
 
-----Original message-----
> From:Deepa Jayaveer <de...@tcs.com>
> Sent: Thursday 13th February 2014 11:58
> To: user@nutch.apache.org
> Cc: user@nutch.apache.org
> Subject: RE: sizing guide
> 
> Hi,
> How to make smaller mapper /reducer units  ? -is it making less number of 
> URLs in seed,txt? 
> 
> 
> Thanks and Regards
> Deepa Devi Jayaveer
> 
> 
> 
> 
> From:
> Markus Jelsma <ma...@openindex.io>
> To:
> user@nutch.apache.org <us...@nutch.apache.org>
> Date:
> 02/13/2014 02:54 PM
> Subject:
> RE: sizing guide
> 
> 
> 
> Hi,
> 
> 10GB heap is a complete waste of memory and resources. 500MB heap is most 
> cases enough. It is better to have more small mappers/reducers than a few 
> large units. Also, 64GB of RAM per datanode/tasktracker is too much (Nutch 
> is not a long running process and does not benefit from a large heap or a 
> lot of OS disk cache), unless you also have 64 CPU cores available. A rule 
> of thumb of mine is to allocate one CPU core and 500-1000MB RAM per slot. 
> 
> Cheers 
> 
>  
>  
> -----Original message-----
> > From:Deepa Jayaveer <de...@tcs.com>
> > Sent: Thursday 13th February 2014 8:09
> > To: user@nutch.apache.org
> > Cc: user@nutch.apache.org
> > Subject: Re: sizing guide
> > 
> > Thanks for your reply.
> >   I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with 
> > Hbase 
> > once I get a fair idea about Nutch.
> >                 For our use case, I need to crawl large documents for 
> > around 100 web sites
> >  weekly  and our functionality demands to crawl on daily basis or even 
> > hourly basis to 
> > extract specific information from around 20 different host. Say, 
> > Need to extract product details from the retailer's site. 
> > In that case, we need to recrawl the pages to get the latest information
> > 
> > As you mentioned, I can do a batch delete the crawled html data once
> > I extract the information from the crawled data. I can expect the 
> > crawled data roughly to be around  1 TB (could be deleted on scheduled 
> > basis)
> > 
> > Will these sizing be fine for Nutch installation in production?
> > 4 Node Hadoop cluster with 2 TB storage each
> > 64 GB RAM each
> > 10 GB heap
> > 
> > Apart from that, need to do HBase data sizing to store the product 
> > details(which
> > would be around 400 GB of data) 
> > can I use the same HBase cluster to store the extracted data where Nutch 
> 
> > is raining 
> > 
> > Can you please let me know your suggestion or recommendations.
> > 
> > 
> > Thanks and Regards
> > Deepa Devi Jayaveer
> > Mobile No: 9940662806
> > Tata Consultancy Services
> > Mailto: deepa.jayaveer@tcs.com
> > Website: http://www.tcs.com
> > ____________________________________________
> > Experience certainty.   IT Services
> >                         Business Solutions
> >                         Consulting
> > ____________________________________________
> > 
> > 
> > 
> > From:
> > Tejas Patil <te...@gmail.com>
> > To:
> > "user@nutch.apache.org" <us...@nutch.apache.org>
> > Date:
> > 02/13/2014 05:58 AM
> > Subject:
> > Re: sizing guide
> > 
> > 
> > 
> > If you are looking for specific Nutch 2.1 + MySQL combination, I think 
> > that
> > there won;t be any on the project wiki.
> > 
> > There is no perfect answer for this as it depends on these factors (this
> > list may go on):
> > - Nature of data that you are crawling: small html files or large 
> > documents.
> > - Is it a continuous crawl or few levels ?
> > - Are you re-crawling urls ?
> > - How big is the crawl space ?
> > - Is it a intranet crawl ? How frequently are the pages changed ?
> > 
> > Nutch 1.x would be a perfect fit for prod level crawls. If you still 
> want
> > to use Nutch 2.x, it would be better to switch to some other datastore 
> > (eg.
> > HBase).
> > 
> > Below are my experiences with two use cases wherein Nutch was used over
> > prod with Nutch 1.x:
> > 
> > (A) Targeted crawl of a single host
> > In this case I wanted to get the data crawled quickly and didn't bother
> > about the updates that would happen to the pages. I started off with a 
> > five
> > node Hadoop cluster but later did the math that it won't get my work 
> done
> > in few days (remember that you need to have a delay between successive
> > requests which the server agrees on else your crawler is banned). Later 
> I
> > bumped the cluster to 15 nodes. The pages were HTML files with size 
> > roughly
> > 200k. The crawled data roughly needed 200GB and I had storage of about
> > 500GB.
> > 
> > (B) Open crawl of several hosts
> > The configs and memory settings were driven by the prod hardware. I had 
> a 
> > 4
> > node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
> > hadoop job with an exception of generate job which needed more heap 
> (8-10
> > GB). There was no need to store the crawled data and every batch was
> > deleted as soon as it was processed. That said that disk had a capacity 
> of
> > 2 TB.
> > 
> > Thanks,
> > Tejas
> > 
> > On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer 
> > <de...@tcs.com>wrote:
> > 
> > > Hi ,
> > > Am using Nutch2.1 with MySQL. Is there a sizing guide available for 
> > Nutch
> > > 2.1?
> > > Is there any recommendations could be ginven on  sizing memory,CPU and
> > > Disk Space for crawling.
> > >
> > > Thanks and Regards
> > > Deepa Devi Jayaveer
> > > Mobile No: 9940662806
> > > Tata Consultancy Services
> > > Mailto: deepa.jayaveer@tcs.com
> > > Website: http://www.tcs.com
> > > ____________________________________________
> > > Experience certainty.   IT Services
> > >                         Business Solutions
> > >                         Consulting
> > > ____________________________________________
> > > =====-----=====-----=====
> > > Notice: The information contained in this e-mail
> > > message and/or attachments to it may contain
> > > confidential or privileged information. If you are
> > > not the intended recipient, any dissemination, use,
> > > review, distribution, printing or copying of the
> > > information contained in this e-mail message
> > > and/or attachments to it are strictly prohibited. If
> > > you have received this communication in error,
> > > please notify us by reply e-mail or telephone and
> > > immediately and permanently delete the message
> > > and any attachments. Thank you
> > >
> > >
> > >
> > 
> > 
> > 
> 
> 
>

RE: sizing guide

Posted by Deepa Jayaveer <de...@tcs.com>.

Hi,
How to make smaller mapper /reducer units  ? -is it making less number of 
URLs in seed,txt? 


Thanks and Regards
Deepa Devi Jayaveer




From:
Markus Jelsma <ma...@openindex.io>
To:
user@nutch.apache.org <us...@nutch.apache.org>
Date:
02/13/2014 02:54 PM
Subject:
RE: sizing guide



Hi,

10GB heap is a complete waste of memory and resources. 500MB heap is most 
cases enough. It is better to have more small mappers/reducers than a few 
large units. Also, 64GB of RAM per datanode/tasktracker is too much (Nutch 
is not a long running process and does not benefit from a large heap or a 
lot of OS disk cache), unless you also have 64 CPU cores available. A rule 
of thumb of mine is to allocate one CPU core and 500-1000MB RAM per slot. 

Cheers 

 
 
-----Original message-----
> From:Deepa Jayaveer <de...@tcs.com>
> Sent: Thursday 13th February 2014 8:09
> To: user@nutch.apache.org
> Cc: user@nutch.apache.org
> Subject: Re: sizing guide
> 
> Thanks for your reply.
>   I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with 
> Hbase 
> once I get a fair idea about Nutch.
>                 For our use case, I need to crawl large documents for 
> around 100 web sites
>  weekly  and our functionality demands to crawl on daily basis or even 
> hourly basis to 
> extract specific information from around 20 different host. Say, 
> Need to extract product details from the retailer's site. 
> In that case, we need to recrawl the pages to get the latest information
> 
> As you mentioned, I can do a batch delete the crawled html data once
> I extract the information from the crawled data. I can expect the 
> crawled data roughly to be around  1 TB (could be deleted on scheduled 
> basis)
> 
> Will these sizing be fine for Nutch installation in production?
> 4 Node Hadoop cluster with 2 TB storage each
> 64 GB RAM each
> 10 GB heap
> 
> Apart from that, need to do HBase data sizing to store the product 
> details(which
> would be around 400 GB of data) 
> can I use the same HBase cluster to store the extracted data where Nutch 

> is raining 
> 
> Can you please let me know your suggestion or recommendations.
> 
> 
> Thanks and Regards
> Deepa Devi Jayaveer
> Mobile No: 9940662806
> Tata Consultancy Services
> Mailto: deepa.jayaveer@tcs.com
> Website: http://www.tcs.com
> ____________________________________________
> Experience certainty.   IT Services
>                         Business Solutions
>                         Consulting
> ____________________________________________
> 
> 
> 
> From:
> Tejas Patil <te...@gmail.com>
> To:
> "user@nutch.apache.org" <us...@nutch.apache.org>
> Date:
> 02/13/2014 05:58 AM
> Subject:
> Re: sizing guide
> 
> 
> 
> If you are looking for specific Nutch 2.1 + MySQL combination, I think 
> that
> there won;t be any on the project wiki.
> 
> There is no perfect answer for this as it depends on these factors (this
> list may go on):
> - Nature of data that you are crawling: small html files or large 
> documents.
> - Is it a continuous crawl or few levels ?
> - Are you re-crawling urls ?
> - How big is the crawl space ?
> - Is it a intranet crawl ? How frequently are the pages changed ?
> 
> Nutch 1.x would be a perfect fit for prod level crawls. If you still 
want
> to use Nutch 2.x, it would be better to switch to some other datastore 
> (eg.
> HBase).
> 
> Below are my experiences with two use cases wherein Nutch was used over
> prod with Nutch 1.x:
> 
> (A) Targeted crawl of a single host
> In this case I wanted to get the data crawled quickly and didn't bother
> about the updates that would happen to the pages. I started off with a 
> five
> node Hadoop cluster but later did the math that it won't get my work 
done
> in few days (remember that you need to have a delay between successive
> requests which the server agrees on else your crawler is banned). Later 
I
> bumped the cluster to 15 nodes. The pages were HTML files with size 
> roughly
> 200k. The crawled data roughly needed 200GB and I had storage of about
> 500GB.
> 
> (B) Open crawl of several hosts
> The configs and memory settings were driven by the prod hardware. I had 
a 
> 4
> node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
> hadoop job with an exception of generate job which needed more heap 
(8-10
> GB). There was no need to store the crawled data and every batch was
> deleted as soon as it was processed. That said that disk had a capacity 
of
> 2 TB.
> 
> Thanks,
> Tejas
> 
> On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer 
> <de...@tcs.com>wrote:
> 
> > Hi ,
> > Am using Nutch2.1 with MySQL. Is there a sizing guide available for 
> Nutch
> > 2.1?
> > Is there any recommendations could be ginven on  sizing memory,CPU and
> > Disk Space for crawling.
> >
> > Thanks and Regards
> > Deepa Devi Jayaveer
> > Mobile No: 9940662806
> > Tata Consultancy Services
> > Mailto: deepa.jayaveer@tcs.com
> > Website: http://www.tcs.com
> > ____________________________________________
> > Experience certainty.   IT Services
> >                         Business Solutions
> >                         Consulting
> > ____________________________________________
> > =====-----=====-----=====
> > Notice: The information contained in this e-mail
> > message and/or attachments to it may contain
> > confidential or privileged information. If you are
> > not the intended recipient, any dissemination, use,
> > review, distribution, printing or copying of the
> > information contained in this e-mail message
> > and/or attachments to it are strictly prohibited. If
> > you have received this communication in error,
> > please notify us by reply e-mail or telephone and
> > immediately and permanently delete the message
> > and any attachments. Thank you
> >
> >
> >
> 
> 
>

RE: sizing guide

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

10GB heap is a complete waste of memory and resources. 500MB heap is most cases enough. It is better to have more small mappers/reducers than a few large units. Also, 64GB of RAM per datanode/tasktracker is too much (Nutch is not a long running process and does not benefit from a large heap or a lot of OS disk cache), unless you also have 64 CPU cores available. A rule of thumb of mine is to allocate one CPU core and 500-1000MB RAM per slot. 

Cheers 

 
 
-----Original message-----
> From:Deepa Jayaveer <de...@tcs.com>
> Sent: Thursday 13th February 2014 8:09
> To: user@nutch.apache.org
> Cc: user@nutch.apache.org
> Subject: Re: sizing guide
> 
> Thanks for your reply.
>   I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with 
> Hbase 
> once I get a fair idea about Nutch.
>                 For our use case, I need to crawl large documents for 
> around 100 web sites
>  weekly  and our functionality demands to crawl on daily basis or even 
> hourly basis to 
> extract specific information from around 20 different host. Say, 
> Need to extract product details from the retailer's site. 
> In that case, we need to recrawl the pages to get the latest information
> 
> As you mentioned, I can do a batch delete the crawled html data once
> I extract the information from the crawled data. I can expect the 
> crawled data roughly to be around  1 TB (could be deleted on scheduled 
> basis)
> 
> Will these sizing be fine for Nutch installation in production?
> 4 Node Hadoop cluster with 2 TB storage each
> 64 GB RAM each
> 10 GB heap
> 
> Apart from that, need to do HBase data sizing to store the product 
> details(which
> would be around 400 GB of data) 
> can I use the same HBase cluster to store the extracted data where Nutch 
> is raining 
>    
> Can you please let me know your suggestion or recommendations.
> 
> 
> Thanks and Regards
> Deepa Devi Jayaveer
> Mobile No: 9940662806
> Tata Consultancy Services
> Mailto: deepa.jayaveer@tcs.com
> Website: http://www.tcs.com
> ____________________________________________
> Experience certainty.   IT Services
>                         Business Solutions
>                         Consulting
> ____________________________________________
> 
> 
> 
> From:
> Tejas Patil <te...@gmail.com>
> To:
> "user@nutch.apache.org" <us...@nutch.apache.org>
> Date:
> 02/13/2014 05:58 AM
> Subject:
> Re: sizing guide
> 
> 
> 
> If you are looking for specific Nutch 2.1 + MySQL combination, I think 
> that
> there won;t be any on the project wiki.
> 
> There is no perfect answer for this as it depends on these factors (this
> list may go on):
> - Nature of data that you are crawling: small html files or large 
> documents.
> - Is it a continuous crawl or few levels ?
> - Are you re-crawling urls ?
> - How big is the crawl space ?
> - Is it a intranet crawl ? How frequently are the pages changed ?
> 
> Nutch 1.x would be a perfect fit for prod level crawls. If you still want
> to use Nutch 2.x, it would be better to switch to some other datastore 
> (eg.
> HBase).
> 
> Below are my experiences with two use cases wherein Nutch was used over
> prod with Nutch 1.x:
> 
> (A) Targeted crawl of a single host
> In this case I wanted to get the data crawled quickly and didn't bother
> about the updates that would happen to the pages. I started off with a 
> five
> node Hadoop cluster but later did the math that it won't get my work done
> in few days (remember that you need to have a delay between successive
> requests which the server agrees on else your crawler is banned). Later I
> bumped the cluster to 15 nodes. The pages were HTML files with size 
> roughly
> 200k. The crawled data roughly needed 200GB and I had storage of about
> 500GB.
> 
> (B) Open crawl of several hosts
> The configs and memory settings were driven by the prod hardware. I had a 
> 4
> node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
> hadoop job with an exception of generate job which needed more heap (8-10
> GB). There was no need to store the crawled data and every batch was
> deleted as soon as it was processed. That said that disk had a capacity of
> 2 TB.
> 
> Thanks,
> Tejas
> 
> On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer 
> <de...@tcs.com>wrote:
> 
> > Hi ,
> > Am using Nutch2.1 with MySQL. Is there a sizing guide available for 
> Nutch
> > 2.1?
> > Is there any recommendations could be ginven on  sizing memory,CPU and
> > Disk Space for crawling.
> >
> > Thanks and Regards
> > Deepa Devi Jayaveer
> > Mobile No: 9940662806
> > Tata Consultancy Services
> > Mailto: deepa.jayaveer@tcs.com
> > Website: http://www.tcs.com
> > ____________________________________________
> > Experience certainty.   IT Services
> >                         Business Solutions
> >                         Consulting
> > ____________________________________________
> > =====-----=====-----=====
> > Notice: The information contained in this e-mail
> > message and/or attachments to it may contain
> > confidential or privileged information. If you are
> > not the intended recipient, any dissemination, use,
> > review, distribution, printing or copying of the
> > information contained in this e-mail message
> > and/or attachments to it are strictly prohibited. If
> > you have received this communication in error,
> > please notify us by reply e-mail or telephone and
> > immediately and permanently delete the message
> > and any attachments. Thank you
> >
> >
> >
> 
> 
>