You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Karol Rybak <ka...@gmail.com> on 2007/10/16 12:28:01 UTC

Hadoop fetch jobs

Hello, i've succesfully set up cluster of 3 machines under hadoop. However i
have a problem. While fetching hadoop generates 6 jobs, however the number
of pages in each of those jobs is not spread equally i get 5 jobs with ~ 3
500 pages and one with ~ 50 000. That's not a good thing as 5 jobs finish
very quickly and afterwards only one machine is working while others are
waiting. Could this be a problem with my configuration, i've set number of
map jobs to 30, number of reduce jobs to 6 and fetcher threads to 150,
however during fetch i still get only 6 map jobs. Any help would be
appreciated, thanks.

-- 
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology
and Management
+48(17)8661277

Re: Hadoop fetch jobs

Posted by Karol Rybak <ka...@gmail.com>.

Actually setting -noParsing helped but only a bit i got about 6000 pages
fetched per job (1000 earlier). I'll try using fetch instead of fetch2, hope
that this will help. Another question is how do i control the number of
fetch jobs, cause they do not behave as typical map jobs ?

On 10/18/07, Karol Rybak <ka...@gmail.com> wrote:
>
> Well, that's not the case i have found out that those jobs have proper
> number of pages , however they end prematurely as fetcher fails with out of
> memory exception. Now i'm trying to fetch it without parsing, we'll see what
> happens...
>
> On 10/16/07, Dennis Kubes <kubes@apache.org > wrote:
> >
> > This is because some of the websites you are fetching have an unusually
> > large number of pages.  Since Nutch partitions by hostname, all of these
> > pages get assigned to a single fetcher.  The way to avoid this is to set
> >
> > a maximum number of pages per site through the generate.max.per.host
> > configuration variable.  In production we have this set to 10.
> >
> > The downside of this is that some very large sites which you may want to
> > fetch all of their content (i.e. wikipedia) still will only fetch the
> > top 10 pages of that site per fetch cycle.
> >
> > Dennis
> >
> > Karol Rybak wrote:
> > > Hello, i've succesfully set up cluster of 3 machines under hadoop.
> > However i
> > > have a problem. While fetching hadoop generates 6 jobs, however the
> > number
> > > of pages in each of those jobs is not spread equally i get 5 jobs with
> > ~ 3
> > > 500 pages and one with ~ 50 000. That's not a good thing as 5 jobs
> > finish
> > > very quickly and afterwards only one machine is working while others
> > are
> > > waiting. Could this be a problem with my configuration, i've set
> > number of
> > > map jobs to 30, number of reduce jobs to 6 and fetcher threads to 150,
> >
> > > however during fetch i still get only 6 map jobs. Any help would be
> > > appreciated, thanks.
> > >
> >
>
>
>
> --
> Karol Rybak
> Programista / Programmer
> Sekcja aplikacji / Applications section
> Wyższa Szkoła Informatyki i Zarządzania / University of Internet
> Technology and Management
> +48(17)8661277
>



-- 
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology
and Management
+48(17)8661277

Re: Hadoop fetch jobs

Posted by Karol Rybak <ka...@gmail.com>.

Well, that's not the case i have found out that those jobs have proper
number of pages , however they end prematurely as fetcher fails with out of
memory exception. Now i'm trying to fetch it without parsing, we'll see what
happens...

On 10/16/07, Dennis Kubes <ku...@apache.org> wrote:
>
> This is because some of the websites you are fetching have an unusually
> large number of pages.  Since Nutch partitions by hostname, all of these
> pages get assigned to a single fetcher.  The way to avoid this is to set
> a maximum number of pages per site through the generate.max.per.host
> configuration variable.  In production we have this set to 10.
>
> The downside of this is that some very large sites which you may want to
> fetch all of their content (i.e. wikipedia) still will only fetch the
> top 10 pages of that site per fetch cycle.
>
> Dennis
>
> Karol Rybak wrote:
> > Hello, i've succesfully set up cluster of 3 machines under hadoop.
> However i
> > have a problem. While fetching hadoop generates 6 jobs, however the
> number
> > of pages in each of those jobs is not spread equally i get 5 jobs with ~
> 3
> > 500 pages and one with ~ 50 000. That's not a good thing as 5 jobs
> finish
> > very quickly and afterwards only one machine is working while others are
> > waiting. Could this be a problem with my configuration, i've set number
> of
> > map jobs to 30, number of reduce jobs to 6 and fetcher threads to 150,
> > however during fetch i still get only 6 map jobs. Any help would be
> > appreciated, thanks.
> >
>



-- 
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology
and Management
+48(17)8661277

Re: Hadoop fetch jobs

Posted by Dennis Kubes <ku...@apache.org>.

This is because some of the websites you are fetching have an unusually 
large number of pages.  Since Nutch partitions by hostname, all of these 
pages get assigned to a single fetcher.  The way to avoid this is to set 
a maximum number of pages per site through the generate.max.per.host 
configuration variable.  In production we have this set to 10.

The downside of this is that some very large sites which you may want to 
fetch all of their content (i.e. wikipedia) still will only fetch the 
top 10 pages of that site per fetch cycle.

Dennis

Karol Rybak wrote:
> Hello, i've succesfully set up cluster of 3 machines under hadoop. However i
> have a problem. While fetching hadoop generates 6 jobs, however the number
> of pages in each of those jobs is not spread equally i get 5 jobs with ~ 3
> 500 pages and one with ~ 50 000. That's not a good thing as 5 jobs finish
> very quickly and afterwards only one machine is working while others are
> waiting. Could this be a problem with my configuration, i've set number of
> map jobs to 30, number of reduce jobs to 6 and fetcher threads to 150,
> however during fetch i still get only 6 map jobs. Any help would be
> appreciated, thanks.
>