You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by hyoung jun kim <ha...@gmail.com> on 2008/05/07 09:57:32 UTC

how many map tasks?

Hi all,
I read "pig_hadoopsummit.pdf" and tried it.
I made a 320MB file (visit) in dir1 and a 20MB file (page) in dir2.
And ran this script.

Visits= load '/dir1/visit as (user, url, time);
Visits= foreach Visits generate user, url, time;
Pages= load '/dir2/page' as (url, pagerank);
VP= join Visits by url, Pages by url;
Results = foreach UserVisits generate group, AVG(VP.pagerank) as avgpr;
store Results into '/data/users';

I expected 6 maps(320MB/64MB) + 1 map(20MB) tasks.
But Hadoop makes 2 map tasks and 1 reduce task.
Why hadoop made only 2 map tasks?

Test environment:
 - 5 hadoop cluster
 - hadoop 0.16.3
 - pig upated from svn repository on May.7

Re: how many map tasks?

Posted by hyoung jun kim <ha...@gmail.com>.

Thanks. but I cheecked hadoop configuraion.
"dfs.block.size" value is 67108864. and I also checked input file's number
of blocks.
Input file has 6 blocks.


2008/5/8 Vitthal Gogate <go...@yahoo-inc.com>:

> Sorry I mean check the "dfs.block.size" parameter in hadoop-site.xml file
> in
> the $HADOOP_HOME/conf  directory, it may be configured as 128MB.
>
> Sorry,in the following reply, I kind of assumed default size as 128MB :)
>
> regards
>
>
> On 5/7/08 9:42 AM, "Vitthal Gogate" <go...@yahoo-inc.com> wrote:
>
> > I assume block size is 128MB. I guess it is specified in hadoop site
> > configuration file.  Also for join the the mapred program creates number
> of
> > map tasks based on combined size of both tables/files being joined..
> >
> > -regards, Suhas
> >
> >
> > On 5/7/08 12:57 AM, "hyoung jun kim" <ha...@gmail.com> wrote:
> >
> >> Hi all,
> >> I read "pig_hadoopsummit.pdf" and tried it.
> >> I made a 320MB file (visit) in dir1 and a 20MB file (page) in dir2.
> >> And ran this script.
> >>
> >> Visits= load '/dir1/visit as (user, url, time);
> >> Visits= foreach Visits generate user, url, time;
> >> Pages= load '/dir2/page' as (url, pagerank);
> >> VP= join Visits by url, Pages by url;
> >> Results = foreach UserVisits generate group, AVG(VP.pagerank) as avgpr;
> >> store Results into '/data/users';
> >>
> >> I expected 6 maps(320MB/64MB) + 1 map(20MB) tasks.
> >> But Hadoop makes 2 map tasks and 1 reduce task.
> >> Why hadoop made only 2 map tasks?
> >>
> >> Test environment:
> >>  - 5 hadoop cluster
> >>  - hadoop 0.16.3
> >>  - pig upated from svn repository on May.7
> >
>
>

Re: how many map tasks?

Posted by Vitthal Gogate <go...@yahoo-inc.com>.

Sorry I mean check the "dfs.block.size" parameter in hadoop-site.xml file in
the $HADOOP_HOME/conf  directory, it may be configured as 128MB.

Sorry,in the following reply, I kind of assumed default size as 128MB :)

regards


On 5/7/08 9:42 AM, "Vitthal Gogate" <go...@yahoo-inc.com> wrote:

> I assume block size is 128MB. I guess it is specified in hadoop site
> configuration file.  Also for join the the mapred program creates number of
> map tasks based on combined size of both tables/files being joined..
> 
> -regards, Suhas
> 
> 
> On 5/7/08 12:57 AM, "hyoung jun kim" <ha...@gmail.com> wrote:
> 
>> Hi all,
>> I read "pig_hadoopsummit.pdf" and tried it.
>> I made a 320MB file (visit) in dir1 and a 20MB file (page) in dir2.
>> And ran this script.
>> 
>> Visits= load '/dir1/visit as (user, url, time);
>> Visits= foreach Visits generate user, url, time;
>> Pages= load '/dir2/page' as (url, pagerank);
>> VP= join Visits by url, Pages by url;
>> Results = foreach UserVisits generate group, AVG(VP.pagerank) as avgpr;
>> store Results into '/data/users';
>> 
>> I expected 6 maps(320MB/64MB) + 1 map(20MB) tasks.
>> But Hadoop makes 2 map tasks and 1 reduce task.
>> Why hadoop made only 2 map tasks?
>> 
>> Test environment:
>>  - 5 hadoop cluster
>>  - hadoop 0.16.3
>>  - pig upated from svn repository on May.7
>

Re: how many map tasks?

Posted by Vitthal Gogate <go...@yahoo-inc.com>.

I assume block size is 128MB. I guess it is specified in hadoop site
configuration file.  Also for join the the mapred program creates number of
map tasks based on combined size of both tables/files being joined..

-regards, Suhas


On 5/7/08 12:57 AM, "hyoung jun kim" <ha...@gmail.com> wrote:

> Hi all,
> I read "pig_hadoopsummit.pdf" and tried it.
> I made a 320MB file (visit) in dir1 and a 20MB file (page) in dir2.
> And ran this script.
> 
> Visits= load '/dir1/visit as (user, url, time);
> Visits= foreach Visits generate user, url, time;
> Pages= load '/dir2/page' as (url, pagerank);
> VP= join Visits by url, Pages by url;
> Results = foreach UserVisits generate group, AVG(VP.pagerank) as avgpr;
> store Results into '/data/users';
> 
> I expected 6 maps(320MB/64MB) + 1 map(20MB) tasks.
> But Hadoop makes 2 map tasks and 1 reduce task.
> Why hadoop made only 2 map tasks?
> 
> Test environment:
>  - 5 hadoop cluster
>  - hadoop 0.16.3
>  - pig upated from svn repository on May.7