You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Meraj A. Khan" <me...@gmail.com> on 2014/09/19 07:00:45 UTC

Running multiple fetch map tasks on a Hadoop Cluster.

Hello Folks,

I am  unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN.

Based on Julien's suggestion I am using the bin/crawl script and did the
following tweaks to trigger a fetch with multiple map tasks , however I am
unable to do so.

1. Added maxNumSegments and numFetchers parameters to the generate phase.
$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments
-maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter

2. Removed the topN paramter and removed the noParsing parameter because I
want the parsing to happen at the time of fetch.
$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch
$CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#

The generate phase is not generating more than one segment.

And as a result the fetch phase is not creating multiple map tasks, also I
belive the way the script is written it does not allow the fecth to fecth
multiple segements in parallel  even if the generate were to generate
multiple segments.

Can someone please let me know , how they go the script to run in a
distributed Hadoop cluster ? Or if there is a different version of script
that should be used?

Thanks.

Re: Running multiple fetch map tasks on a Hadoop Cluster.

Posted by "Meraj A. Khan" <me...@gmail.com>.
Jake,

I am not sure how to make that happen, every time I run the nutch 1.7 job
on YARN , I see a single segment being generated a nd a single map task
bein launched,underutilizing the capacity of the cluster and slowing the
crawl.

Are you suggesting  I should be seeing multiple fetch map tasks for a
single segment, if so I am not.

Thanks.
On Sep 19, 2014 5:13 PM, "Jake Dodd" <ja...@ontopic.io> wrote:

> Hi Meraj,
>
> Nutch and Hadoop abstract all of that for you, so you don’t need to worry
> about it. When you execute the fetch command for a segment, it will be
> parallelized across the nodes in your cluster.
>
> Cheers
>
> Jake
>
> On Sep 19, 2014, at 1:52 PM, Meraj A. Khan <me...@gmail.com> wrote:
>
> > Julien,
> >
> > How would you achieve parallelism then on a Hadoop cluster , am I missing
> > something here? My understanding was that we could scale the crawl  by
> > allowing fetch to happen in multiple map tasks in multiple nodes in a
> > Hadoop cluster , otherwise I am stuck in sequentially crawling a large
> set
> > of urls spread across mutiple domains.
> >
> > If that is indeed the way to scale the crawl , then we would need to
> > generate multiple segments at the generate time so that these could be
> > fetched in paralle.
> >
> > So I guess I really need help in .
> >
> >
> >   1. Making the generate phase generate multiple segments
> >   2. Being able to fetch these segments in parallel.
> >
> >
> > Can you please let me know if my approach to scale the crawl sounds right
> > to you ?
> >
> >
> > Thanks and much appreciated, all the help I have gotten so far....
> >
> >
> >
> > On Fri, Sep 19, 2014 at 10:40 AM, Julien Nioche <
> > lists.digitalpebble@gmail.com> wrote:
> >
> >> The fetching operates segment by segment and won't fetch more than one
> at
> >> the same time. You can get the generation step to build multiple
> segments
> >> in one go but you'd need to modify the script so that the fetching step
> is
> >> called as many times as you have segments + you'd probably need to add
> some
> >> logic for detecting that they've all finished before you move on to the
> >> update step.
> >> Out of curiosity : why do you want to fetch multiple segments at the
> same
> >> time?
> >>
> >> On 19 September 2014 06:00, Meraj A. Khan <me...@gmail.com> wrote:
> >>
> >>> Hello Folks,
> >>>
> >>> I am  unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop
> YARN.
> >>>
> >>> Based on Julien's suggestion I am using the bin/crawl script and did
> the
> >>> following tweaks to trigger a fetch with multiple map tasks , however I
> >> am
> >>> unable to do so.
> >>>
> >>> 1. Added maxNumSegments and numFetchers parameters to the generate
> phase.
> >>> $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
> >> $CRAWL_PATH/segments
> >>> -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
> >>>
> >>> 2. Removed the topN paramter and removed the noParsing parameter
> because
> >> I
> >>> want the parsing to happen at the time of fetch.
> >>> $bin/nutch fetch $commonOptions -D
> fetcher.timelimit.mins=$timeLimitFetch
> >>> $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
> >>>
> >>> The generate phase is not generating more than one segment.
> >>>
> >>> And as a result the fetch phase is not creating multiple map tasks,
> also
> >> I
> >>> belive the way the script is written it does not allow the fecth to
> fecth
> >>> multiple segements in parallel  even if the generate were to generate
> >>> multiple segments.
> >>>
> >>> Can someone please let me know , how they go the script to run in a
> >>> distributed Hadoop cluster ? Or if there is a different version of
> script
> >>> that should be used?
> >>>
> >>> Thanks.
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> Open Source Solutions for Text Engineering
> >>
> >> http://digitalpebble.blogspot.com/
> >> http://www.digitalpebble.com
> >> http://twitter.com/digitalpebble
> >>
>
>

Re: Running multiple fetch map tasks on a Hadoop Cluster.

Posted by Jake Dodd <ja...@ontopic.io>.
Hi Meraj,

Nutch and Hadoop abstract all of that for you, so you don’t need to worry about it. When you execute the fetch command for a segment, it will be parallelized across the nodes in your cluster.

Cheers

Jake

On Sep 19, 2014, at 1:52 PM, Meraj A. Khan <me...@gmail.com> wrote:

> Julien,
> 
> How would you achieve parallelism then on a Hadoop cluster , am I missing
> something here? My understanding was that we could scale the crawl  by
> allowing fetch to happen in multiple map tasks in multiple nodes in a
> Hadoop cluster , otherwise I am stuck in sequentially crawling a large set
> of urls spread across mutiple domains.
> 
> If that is indeed the way to scale the crawl , then we would need to
> generate multiple segments at the generate time so that these could be
> fetched in paralle.
> 
> So I guess I really need help in .
> 
> 
>   1. Making the generate phase generate multiple segments
>   2. Being able to fetch these segments in parallel.
> 
> 
> Can you please let me know if my approach to scale the crawl sounds right
> to you ?
> 
> 
> Thanks and much appreciated, all the help I have gotten so far....
> 
> 
> 
> On Fri, Sep 19, 2014 at 10:40 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
> 
>> The fetching operates segment by segment and won't fetch more than one at
>> the same time. You can get the generation step to build multiple segments
>> in one go but you'd need to modify the script so that the fetching step is
>> called as many times as you have segments + you'd probably need to add some
>> logic for detecting that they've all finished before you move on to the
>> update step.
>> Out of curiosity : why do you want to fetch multiple segments at the same
>> time?
>> 
>> On 19 September 2014 06:00, Meraj A. Khan <me...@gmail.com> wrote:
>> 
>>> Hello Folks,
>>> 
>>> I am  unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN.
>>> 
>>> Based on Julien's suggestion I am using the bin/crawl script and did the
>>> following tweaks to trigger a fetch with multiple map tasks , however I
>> am
>>> unable to do so.
>>> 
>>> 1. Added maxNumSegments and numFetchers parameters to the generate phase.
>>> $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
>> $CRAWL_PATH/segments
>>> -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
>>> 
>>> 2. Removed the topN paramter and removed the noParsing parameter because
>> I
>>> want the parsing to happen at the time of fetch.
>>> $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch
>>> $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
>>> 
>>> The generate phase is not generating more than one segment.
>>> 
>>> And as a result the fetch phase is not creating multiple map tasks, also
>> I
>>> belive the way the script is written it does not allow the fecth to fecth
>>> multiple segements in parallel  even if the generate were to generate
>>> multiple segments.
>>> 
>>> Can someone please let me know , how they go the script to run in a
>>> distributed Hadoop cluster ? Or if there is a different version of script
>>> that should be used?
>>> 
>>> Thanks.
>>> 
>> 
>> 
>> 
>> --
>> 
>> Open Source Solutions for Text Engineering
>> 
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>> 


Re: Running multiple fetch map tasks on a Hadoop Cluster.

Posted by "Meraj A. Khan" <me...@gmail.com>.
Julien,

How would you achieve parallelism then on a Hadoop cluster , am I missing
something here? My understanding was that we could scale the crawl  by
allowing fetch to happen in multiple map tasks in multiple nodes in a
Hadoop cluster , otherwise I am stuck in sequentially crawling a large set
of urls spread across mutiple domains.

If that is indeed the way to scale the crawl , then we would need to
generate multiple segments at the generate time so that these could be
fetched in paralle.

So I guess I really need help in .


   1. Making the generate phase generate multiple segments
   2. Being able to fetch these segments in parallel.


Can you please let me know if my approach to scale the crawl sounds right
to you ?


Thanks and much appreciated, all the help I have gotten so far....



On Fri, Sep 19, 2014 at 10:40 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> The fetching operates segment by segment and won't fetch more than one at
> the same time. You can get the generation step to build multiple segments
> in one go but you'd need to modify the script so that the fetching step is
> called as many times as you have segments + you'd probably need to add some
> logic for detecting that they've all finished before you move on to the
> update step.
> Out of curiosity : why do you want to fetch multiple segments at the same
> time?
>
> On 19 September 2014 06:00, Meraj A. Khan <me...@gmail.com> wrote:
>
> > Hello Folks,
> >
> > I am  unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN.
> >
> > Based on Julien's suggestion I am using the bin/crawl script and did the
> > following tweaks to trigger a fetch with multiple map tasks , however I
> am
> > unable to do so.
> >
> > 1. Added maxNumSegments and numFetchers parameters to the generate phase.
> > $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
> $CRAWL_PATH/segments
> > -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
> >
> > 2. Removed the topN paramter and removed the noParsing parameter because
> I
> > want the parsing to happen at the time of fetch.
> > $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch
> > $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
> >
> > The generate phase is not generating more than one segment.
> >
> > And as a result the fetch phase is not creating multiple map tasks, also
> I
> > belive the way the script is written it does not allow the fecth to fecth
> > multiple segements in parallel  even if the generate were to generate
> > multiple segments.
> >
> > Can someone please let me know , how they go the script to run in a
> > distributed Hadoop cluster ? Or if there is a different version of script
> > that should be used?
> >
> > Thanks.
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Running multiple fetch map tasks on a Hadoop Cluster.

Posted by Julien Nioche <li...@gmail.com>.
The fetching operates segment by segment and won't fetch more than one at
the same time. You can get the generation step to build multiple segments
in one go but you'd need to modify the script so that the fetching step is
called as many times as you have segments + you'd probably need to add some
logic for detecting that they've all finished before you move on to the
update step.
Out of curiosity : why do you want to fetch multiple segments at the same
time?

On 19 September 2014 06:00, Meraj A. Khan <me...@gmail.com> wrote:

> Hello Folks,
>
> I am  unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN.
>
> Based on Julien's suggestion I am using the bin/crawl script and did the
> following tweaks to trigger a fetch with multiple map tasks , however I am
> unable to do so.
>
> 1. Added maxNumSegments and numFetchers parameters to the generate phase.
> $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments
> -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
>
> 2. Removed the topN paramter and removed the noParsing parameter because I
> want the parsing to happen at the time of fetch.
> $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch
> $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
>
> The generate phase is not generating more than one segment.
>
> And as a result the fetch phase is not creating multiple map tasks, also I
> belive the way the script is written it does not allow the fecth to fecth
> multiple segements in parallel  even if the generate were to generate
> multiple segments.
>
> Can someone please let me know , how they go the script to run in a
> distributed Hadoop cluster ? Or if there is a different version of script
> that should be used?
>
> Thanks.
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble