You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Siddharth Shah <ia...@gmail.com> on 2015/03/10 13:14:18 UTC

Crawling Pages from Single Domain

Hello All,
              I have a question regarding running Nutch on Hadoop. The
current setup is as follows

   - Hadoop 1.0.3 cluster on AWS's EMR (1 Master - Medium Instance + 3
   Slave Nodes Small Instance)
   - Nutch 1.7
   - Apart from default hadoop config only mapred.map.tasks set to 3
   - On Nutch I've update nutch-site.xml with proper agent name

I have seed-list of about 7,00,000 pages from a single domain. So my
questions are

   - What setting do I need to update so that fetcher works on all 3 nodes
   as opposed to single node?
   - What would be appropriate settings for depth and topN values? (I am
   assuming them to be 1 and 700000 respectively)

Thank you,
Sidharth

Re: Crawling Pages from Single Domain

Posted by Siddharth Shah <ia...@gmail.com>.
Hi Jonathan,
                    Apologies for my delayed response. Thank you for the
pointer the crawl worked as expected, I needed to tweak regex filtering.

Thank you once again,
Sidharth

On Wed, Mar 11, 2015 at 4:46 AM, Jonathan Cooper-Ellis <
jcooperellis@cloudera.com> wrote:

> Hi Siddharth,
>
> Check out the bin/crawl script. There you can set the number of slave
> nodes, as well as topN for your crawl (size fetchlist * number of slaves),
> which you want to be 700,000+.
>
> If you tell the bin/crawl script to execute 1 round of 700,000+ pages, you
> will get your entire seed list. You'd really only want to do it like this
> if you're only planning on crawling the pages once and not interested in
> any of the outlinks. If you run another crawl using the same crawl db, you
> will end up following the outlinks collected in the initial crawl, unless
> you've excluded everything but your desired pages in regex urlfilter.
>
> Hope that helps.
>
> jce
>
> On Tue, Mar 10, 2015 at 8:14 AM, Siddharth Shah <ia...@gmail.com> wrote:
>
> > Hello All,
> >               I have a question regarding running Nutch on Hadoop. The
> > current setup is as follows
> >
> >    - Hadoop 1.0.3 cluster on AWS's EMR (1 Master - Medium Instance + 3
> >    Slave Nodes Small Instance)
> >    - Nutch 1.7
> >    - Apart from default hadoop config only mapred.map.tasks set to 3
> >    - On Nutch I've update nutch-site.xml with proper agent name
> >
> > I have seed-list of about 7,00,000 pages from a single domain. So my
> > questions are
> >
> >    - What setting do I need to update so that fetcher works on all 3
> nodes
> >    as opposed to single node?
> >    - What would be appropriate settings for depth and topN values? (I am
> >    assuming them to be 1 and 700000 respectively)
> >
> > Thank you,
> > Sidharth
> >
>
>
>
> --
> Jonathan Cooper-Ellis
> Field Enablement Engineer
> <http://www.cloudera.com>
>

Re: Crawling Pages from Single Domain

Posted by Jonathan Cooper-Ellis <jc...@cloudera.com>.
Hi Siddharth,

Check out the bin/crawl script. There you can set the number of slave
nodes, as well as topN for your crawl (size fetchlist * number of slaves),
which you want to be 700,000+.

If you tell the bin/crawl script to execute 1 round of 700,000+ pages, you
will get your entire seed list. You'd really only want to do it like this
if you're only planning on crawling the pages once and not interested in
any of the outlinks. If you run another crawl using the same crawl db, you
will end up following the outlinks collected in the initial crawl, unless
you've excluded everything but your desired pages in regex urlfilter.

Hope that helps.

jce

On Tue, Mar 10, 2015 at 8:14 AM, Siddharth Shah <ia...@gmail.com> wrote:

> Hello All,
>               I have a question regarding running Nutch on Hadoop. The
> current setup is as follows
>
>    - Hadoop 1.0.3 cluster on AWS's EMR (1 Master - Medium Instance + 3
>    Slave Nodes Small Instance)
>    - Nutch 1.7
>    - Apart from default hadoop config only mapred.map.tasks set to 3
>    - On Nutch I've update nutch-site.xml with proper agent name
>
> I have seed-list of about 7,00,000 pages from a single domain. So my
> questions are
>
>    - What setting do I need to update so that fetcher works on all 3 nodes
>    as opposed to single node?
>    - What would be appropriate settings for depth and topN values? (I am
>    assuming them to be 1 and 700000 respectively)
>
> Thank you,
> Sidharth
>



-- 
Jonathan Cooper-Ellis
Field Enablement Engineer
<http://www.cloudera.com>