You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Meraj A. Khan" <me...@gmail.com> on 2014/08/28 07:47:33 UTC

Nutch 1.7 fetch happening in a single map task.

Hi All,

I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there
is only a single reducer in the generate partition job. I am  running in a
situation where the subsequent fetch is only running in a single map task
(I believe as a consequence of a single reducer in the earlier phase).  How
can I force Nutch to do fetch in multiple map tasks , is there a setting to
force more than one reducers in the generate-partition job to have more map
tasks ?.

Please also note that I have commented out the code in Crawl.java to not do
the LInkInversion phase as , I dont need the scoring of the URLS that Nutch
crawls, every URL is equally important to me.

Thanks.

Re: Nutch 1.7 fetch happening in a single map task.

Posted by Julien Nioche <li...@gmail.com>.

See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script

just go to runtime/deploy/bin and run the script from there.

Julien


On 29 August 2014 13:38, Meraj A. Khan <me...@gmail.com> wrote:

> Hi Julien,
>
> I have 15 domains and they are all being fetched in a single map task which
> does not fetch all the urls no matter what depth or topN i give.
>
> I am submitting the Nutch job jar which seems to be using the Crawl.java
> class, how do I use the Crawl script on a Hadoop cluster, are there any
> pointers you can share?
>
> Thanks.
> On Aug 29, 2014 4:40 AM, "Julien Nioche" <li...@gmail.com>
> wrote:
>
> > Hi Meraj,
> >
> > The generator will place all the URLs in a single segment if all they
> > belong to the same host for politeness reason. Otherwise it will use
> > whichever value is passed with the -numFetchers parameter in the
> generation
> > step.
> >
> > Why don't you use the crawl script in /bin instead of tinkering with the
> > (now deprecated) Crawl class? It comes with a good default configuration
> > and should make your life easier.
> >
> > Julien
> >
> >
> > On 28 August 2014 06:47, Meraj A. Khan <me...@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that
> > there
> > > is only a single reducer in the generate partition job. I am  running
> in
> > a
> > > situation where the subsequent fetch is only running in a single map
> task
> > > (I believe as a consequence of a single reducer in the earlier phase).
> > How
> > > can I force Nutch to do fetch in multiple map tasks , is there a
> setting
> > to
> > > force more than one reducers in the generate-partition job to have more
> > map
> > > tasks ?.
> > >
> > > Please also note that I have commented out the code in Crawl.java to
> not
> > do
> > > the LInkInversion phase as , I dont need the scoring of the URLS that
> > Nutch
> > > crawls, every URL is equally important to me.
> > >
> > > Thanks.
> > >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch 1.7 fetch happening in a single map task.

Posted by "Meraj A. Khan" <me...@gmail.com>.

Hi Julien,

I have 15 domains and they are all being fetched in a single map task which
does not fetch all the urls no matter what depth or topN i give.

I am submitting the Nutch job jar which seems to be using the Crawl.java
class, how do I use the Crawl script on a Hadoop cluster, are there any
pointers you can share?

Thanks.
On Aug 29, 2014 4:40 AM, "Julien Nioche" <li...@gmail.com>
wrote:

> Hi Meraj,
>
> The generator will place all the URLs in a single segment if all they
> belong to the same host for politeness reason. Otherwise it will use
> whichever value is passed with the -numFetchers parameter in the generation
> step.
>
> Why don't you use the crawl script in /bin instead of tinkering with the
> (now deprecated) Crawl class? It comes with a good default configuration
> and should make your life easier.
>
> Julien
>
>
> On 28 August 2014 06:47, Meraj A. Khan <me...@gmail.com> wrote:
>
> > Hi All,
> >
> > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that
> there
> > is only a single reducer in the generate partition job. I am  running in
> a
> > situation where the subsequent fetch is only running in a single map task
> > (I believe as a consequence of a single reducer in the earlier phase).
> How
> > can I force Nutch to do fetch in multiple map tasks , is there a setting
> to
> > force more than one reducers in the generate-partition job to have more
> map
> > tasks ?.
> >
> > Please also note that I have commented out the code in Crawl.java to not
> do
> > the LInkInversion phase as , I dont need the scoring of the URLS that
> Nutch
> > crawls, every URL is equally important to me.
> >
> > Thanks.
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Nutch 1.7 fetch happening in a single map task.

Posted by Julien Nioche <li...@gmail.com>.

Hi Meraj,

The generator will place all the URLs in a single segment if all they
belong to the same host for politeness reason. Otherwise it will use
whichever value is passed with the -numFetchers parameter in the generation
step.

Why don't you use the crawl script in /bin instead of tinkering with the
(now deprecated) Crawl class? It comes with a good default configuration
and should make your life easier.

Julien

On 28 August 2014 06:47, Meraj A. Khan <me...@gmail.com> wrote:

> Hi All,
>
> I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there
> is only a single reducer in the generate partition job. I am  running in a
> situation where the subsequent fetch is only running in a single map task
> (I believe as a consequence of a single reducer in the earlier phase).  How
> can I force Nutch to do fetch in multiple map tasks , is there a setting to
> force more than one reducers in the generate-partition job to have more map
> tasks ?.
>
> Please also note that I have commented out the code in Crawl.java to not do
> the LInkInversion phase as , I dont need the scoring of the URLS that Nutch
> crawls, every URL is equally important to me.
>
> Thanks.
>

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble