You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Josh Ferguson <jo...@besquared.net> on 2009/01/27 08:27:41 UTC

Job Speed

So I have a table with roughly 145,000 records spread across 300  
files. The total size is about 7MB. Right now I'm running one job  
tracker and one task tracker which is a high cpu amazon box (1.7 Gbits  
of RAM, ~ 4 cores). I run the following query:

SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities;

And it takes about 35 minutes to finish. One of my problems is that I  
can't get my task tracker to process more than one map at a time even  
though it has a higher number of maximum map tasks. But even that is  
relatively fast compared to the reduce which takes about 30 minutes by  
itself. The status of the task is:

reduce > copy (225 of 344 at 0.01 MB/s) >

I really don't understand what is going on during this copy step or  
why it is taking so long. The files are small and they're all inside  
of amazon's network. Can you guys help me out?

Josh F.

Re: Job Speed

Posted by jason hadoop <ja...@gmail.com>.

I just realized this was a hive question, I have no experience with Hive, so
my advice is probably incorrect.

On Tue, Jan 27, 2009 at 8:13 AM, jason hadoop <ja...@gmail.com>wrote:

> It is not clear to me fromyour email if you have the number of map tasks
> per machine set to > 1, or if you are attempting to us a multi-threaded
> mapper.
>
> How many tasks does the system split your job into? and how many execute at
> once.
> It is a first guess that you are getting 300 map tasks, and each runs for a
> small number of seconds, and most of that time is probably the task setup
> time.
>
> As a first try, you could try packing your 300 small files into as many
> files as you have simultaneous task execution slots and adjust the input
> split size (probably not necessary) to ensure there is no further splitting.
>
> The reduces all essentially stall until all of the map tasks are done, so
> the reduce copy speed is a misleading value.
>
>
> On Mon, Jan 26, 2009 at 11:27 PM, Josh Ferguson <jo...@besquared.net>wrote:
>
>> So I have a table with roughly 145,000 records spread across 300 files.
>> The total size is about 7MB. Right now I'm running one job tracker and one
>> task tracker which is a high cpu amazon box (1.7 Gbits of RAM, ~ 4 cores). I
>> run the following query:
>> SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities;
>>
>> And it takes about 35 minutes to finish. One of my problems is that I
>> can't get my task tracker to process more than one map at a time even though
>> it has a higher number of maximum map tasks. But even that is relatively
>> fast compared to the reduce which takes about 30 minutes by itself. The
>> status of the task is:
>>
>> reduce > copy (225 of 344 at 0.01 MB/s) >
>>
>> I really don't understand what is going on during this copy step or why it
>> is taking so long. The files are small and they're all inside of amazon's
>> network. Can you guys help me out?
>>
>> Josh F.
>>
>
>

Re: Job Speed

Posted by jason hadoop <ja...@gmail.com>.

It is not clear to me fromyour email if you have the number of map tasks per
machine set to > 1, or if you are attempting to us a multi-threaded mapper.

How many tasks does the system split your job into? and how many execute at
once.
It is a first guess that you are getting 300 map tasks, and each runs for a
small number of seconds, and most of that time is probably the task setup
time.

As a first try, you could try packing your 300 small files into as many
files as you have simultaneous task execution slots and adjust the input
split size (probably not necessary) to ensure there is no further splitting.

The reduces all essentially stall until all of the map tasks are done, so
the reduce copy speed is a misleading value.

On Mon, Jan 26, 2009 at 11:27 PM, Josh Ferguson <jo...@besquared.net> wrote:

> So I have a table with roughly 145,000 records spread across 300 files. The
> total size is about 7MB. Right now I'm running one job tracker and one task
> tracker which is a high cpu amazon box (1.7 Gbits of RAM, ~ 4 cores). I run
> the following query:
> SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities;
>
> And it takes about 35 minutes to finish. One of my problems is that I can't
> get my task tracker to process more than one map at a time even though it
> has a higher number of maximum map tasks. But even that is relatively fast
> compared to the reduce which takes about 30 minutes by itself. The status of
> the task is:
>
> reduce > copy (225 of 344 at 0.01 MB/s) >
>
> I really don't understand what is going on during this copy step or why it
> is taking so long. The files are small and they're all inside of amazon's
> network. Can you guys help me out?
>
> Josh F.
>

Re: Job Speed

Posted by Josh Ferguson <jo...@besquared.net>.

yeah so I am loading 344 files each one taking just under 1 second according
to the log, which takes approximately 5 minutes. The other 30 minutes are
spent doing a "reduce > copy". I'm not sure why it's so slow because it's
copying about 144,000 small records, the total size is about 16MB after it's
mapped. I think with this particular query the slowness could be caused by
the reduce task itself being slow? It's a distinct count so perhaps the
reducer code is running extremely slow? I will try to write my own tonight
and see if it goes any faster.
Josh F.
On Tue, Jan 27, 2009 at 8:34 AM, Joydeep Sen Sarma <js...@facebook.com>wrote:

>  Hi Josh,
>
>
>
> Copying large number small map outputs can take a while. Can't say why the
> tasktracker is not running more than one mapper.
>
>
>
> We are working on this. hadoop-4565 tracks a jira to create splits that
> cross files while preserving locality. Hive-74 will use 4565 on hive side to
> control number of maps better.
>
>
>
> Joydeep
>
>
>  ------------------------------
>
> *From:* Josh Ferguson [mailto:josh@besquared.net]
> *Sent:* Monday, January 26, 2009 11:28 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Job Speed
>
>
>
> So I have a table with roughly 145,000 records spread across 300 files. The
> total size is about 7MB. Right now I'm running one job tracker and one task
> tracker which is a high cpu amazon box (1.7 Gbits of RAM, ~ 4 cores). I run
> the following query:
>
>
>
> SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities;
>
>
>
> And it takes about 35 minutes to finish. One of my problems is that I can't
> get my task tracker to process more than one map at a time even though it
> has a higher number of maximum map tasks. But even that is relatively fast
> compared to the reduce which takes about 30 minutes by itself. The status of
> the task is:
>
>
>
> reduce > copy (225 of 344 at 0.01 MB/s) >
>
>
>
>   I really don't understand what is going on during this copy step or why
> it is taking so long. The files are small and they're all inside of amazon's
> network. Can you guys help me out?
>
>
>
>   Josh F.
>

RE: Job Speed

Posted by Joydeep Sen Sarma <js...@facebook.com>.

Hi Josh,

Copying large number small map outputs can take a while. Can't say why the tasktracker is not running more than one mapper.

We are working on this. hadoop-4565 tracks a jira to create splits that cross files while preserving locality. Hive-74 will use 4565 on hive side to control number of maps better.

Joydeep

________________________________
From: Josh Ferguson [mailto:josh@besquared.net]
Sent: Monday, January 26, 2009 11:28 PM
To: hive-user@hadoop.apache.org
Subject: Job Speed

So I have a table with roughly 145,000 records spread across 300 files. The total size is about 7MB. Right now I'm running one job tracker and one task tracker which is a high cpu amazon box (1.7 Gbits of RAM, ~ 4 cores). I run the following query:

SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities;

And it takes about 35 minutes to finish. One of my problems is that I can't get my task tracker to process more than one map at a time even though it has a higher number of maximum map tasks. But even that is relatively fast compared to the reduce which takes about 30 minutes by itself. The status of the task is:

reduce > copy (225 of 344 at 0.01 MB/s) >


I really don't understand what is going on during this copy step or why it is taking so long. The files are small and they're all inside of amazon's network. Can you guys help me out?


Josh F.