You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Marc Limotte <ms...@gmail.com> on 2013/01/31 00:16:02 UTC

delay before query starts processing

Hi,

I'm running in Amazon on an EMR cluster with hive 0.8.1.  We have a lot of
other Hadoop jobs, but only started experimenting with Hive recently.

I've been seeing a long pause after submitting a hive query and the
actually start of the hadoop job... 10 minutes or more in some cases.  I'm
wondering what's happening during this time.  Either a high level answer,
or maybe there is some logging I can turn on?

Here's some more detail.  I submit the query on the master using the hive
cli, and start to see some output right away...

Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>


*[then a long delay here: 10 minutes or more... no activity in the hadoop
job tracker ui] *


… and then it continues normally ...
Starting Job = job_201301160029_0082, Tracking URL =
http://ip-xxxxxxxx.ec2.internal:9100/jobdetails.jsp?jobid=job_201301160029_0082
Kill Command = /home/hadoop/bin/hadoop job
 -Dmapred.job.tracker=xxxxxx:9001 -kill job_201301160029_0082
Hadoop job information for Stage-1: number of mappers: 2; number of
reducers: 1
2013-01-30 20:45:30,526 Stage-1 map = 0%,  reduce = 0%
…

This query is processing in the neighborhood of 500GB of data from S3.  A
couple of possibilities I thought of… perhaps someone can confirm or deny:
a) Is the data copied from S3 to HDFS during this time?
b) I have a fairly large set of libs in HIVE_AUX_JAR_PATH (around ~175
MB)-- does it have to copy these around to the tasks at this time?

Any insights appreciated.

Marc

Re: delay before query starts processing

Posted by Ariel Marcus <ar...@openbi.com>.
>From the archives:
http://mail-archives.apache.org/mod_mbox/hive-user/201110.mbox/%3CCAC9SPjuQtxOK1KtEmReD6OanNTgNM_uLkGQD+=n7KRcJcaLmGw@mail.gmail.com%3E

TL;DR set hive.optimize.s3.query=true;

---------------------------------
Ariel Marcus, Consultant
www.openbi.com | ariel.marcus@openbi.com
150 N Michigan Avenue, Suite 2800, Chicago, IL 60601
Cell: 314-827-4356


On Wed, Jan 30, 2013 at 6:16 PM, Marc Limotte <ms...@gmail.com> wrote:

> Hi,
>
> I'm running in Amazon on an EMR cluster with hive 0.8.1.  We have a lot of
> other Hadoop jobs, but only started experimenting with Hive recently.
>
> I've been seeing a long pause after submitting a hive query and the
> actually start of the hadoop job... 10 minutes or more in some cases.  I'm
> wondering what's happening during this time.  Either a high level answer,
> or maybe there is some logging I can turn on?
>
> Here's some more detail.  I submit the query on the master using the hive
> cli, and start to see some output right away...
>
> Total MapReduce jobs = 2
> Launching Job 1 out of 2
> Number of reduce tasks not specified. Estimated from input data size: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=<number>
>
>
> *[then a long delay here: 10 minutes or more... no activity in the hadoop
> job tracker ui] *
>
>
> … and then it continues normally ...
> Starting Job = job_201301160029_0082, Tracking URL =
> http://ip-xxxxxxxx.ec2.internal:9100/jobdetails.jsp?jobid=job_201301160029_0082
> Kill Command = /home/hadoop/bin/hadoop job
>  -Dmapred.job.tracker=xxxxxx:9001 -kill job_201301160029_0082
> Hadoop job information for Stage-1: number of mappers: 2; number of
> reducers: 1
> 2013-01-30 20:45:30,526 Stage-1 map = 0%,  reduce = 0%
> …
>
> This query is processing in the neighborhood of 500GB of data from S3.  A
> couple of possibilities I thought of… perhaps someone can confirm or deny:
> a) Is the data copied from S3 to HDFS during this time?
> b) I have a fairly large set of libs in HIVE_AUX_JAR_PATH (around ~175
> MB)-- does it have to copy these around to the tasks at this time?
>
> Any insights appreciated.
>
> Marc
>
>
>
>
> ------------------------------
>
> This transmission is confidential and intended solely for the use of the
> recipient named above. It may contain confidential, proprietary, or legally
> privileged information. If you are not the intended recipient, you are
> hereby notified that any unauthorized review, use, disclosure or
> distribution is strictly prohibited. If you have received this transmission
> in error, please contact the sender by reply e-mail and delete the original
> transmission and all copies from your system.
>

-- 

------------------------------

This transmission is confidential and intended solely for the use of the 
recipient named above. It may contain confidential, proprietary, or legally 
privileged information. If you are not the intended recipient, you are 
hereby notified that any unauthorized review, use, disclosure or 
distribution is strictly prohibited. If you have received this transmission 
in error, please contact the sender by reply e-mail and delete the original 
transmission and all copies from your system.

Re: delay before query starts processing

Posted by Marc Limotte <ms...@gmail.com>.
Ariel,
set hive.optimize.s3.query=true seems to help. But I'm surprised, because
the information I can find online about that config suggests that it is
related to tables with a large number of partitions. I have a lot of files,
but only one partition. But it seems to help anyway.

Abdelrahman,
Thanks for the logging tip. I do want to know what it is doing, so this
should be helpful.

Marc

On Wed, Jan 30, 2013 at 3:23 PM, Abdelrahman Shettia <
ashettia@hortonworks.com> wrote:

> Hi Marc,
>
> You can try running the hive client with debug mode on and see what is
> trying to do on the JT level.
> hive -hiveconf hive.root.logger=ALL,console -e " DDL;"
> hive -hiveconf hive.root.logger=ALL,console -f ddl.sql ;
>
> Hope this helps .
>
> Thanks
> -Abdelrahman
>
>
> On Wed, Jan 30, 2013 at 3:16 PM, Marc Limotte <ms...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm running in Amazon on an EMR cluster with hive 0.8.1.  We have a lot
>> of other Hadoop jobs, but only started experimenting with Hive recently.
>>
>> I've been seeing a long pause after submitting a hive query and the
>> actually start of the hadoop job... 10 minutes or more in some cases.  I'm
>> wondering what's happening during this time.  Either a high level answer,
>> or maybe there is some logging I can turn on?
>>
>> Here's some more detail.  I submit the query on the master using the hive
>> cli, and start to see some output right away...
>>
>> Total MapReduce jobs = 2
>> Launching Job 1 out of 2
>> Number of reduce tasks not specified. Estimated from input data size: 1
>> In order to change the average load for a reducer (in bytes):
>>   set hive.exec.reducers.bytes.per.reducer=<number>
>> In order to limit the maximum number of reducers:
>>   set hive.exec.reducers.max=<number>
>> In order to set a constant number of reducers:
>>   set mapred.reduce.tasks=<number>
>>
>>
>> *[then a long delay here: 10 minutes or more... no activity in the
>> hadoop job tracker ui] *
>>
>>
>> … and then it continues normally ...
>> Starting Job = job_201301160029_0082, Tracking URL =
>> http://ip-xxxxxxxx.ec2.internal:9100/jobdetails.jsp?jobid=job_201301160029_0082
>> Kill Command = /home/hadoop/bin/hadoop job
>>  -Dmapred.job.tracker=xxxxxx:9001 -kill job_201301160029_0082
>> Hadoop job information for Stage-1: number of mappers: 2; number of
>> reducers: 1
>> 2013-01-30 20:45:30,526 Stage-1 map = 0%,  reduce = 0%
>> …
>>
>> This query is processing in the neighborhood of 500GB of data from S3.  A
>> couple of possibilities I thought of… perhaps someone can confirm or deny:
>> a) Is the data copied from S3 to HDFS during this time?
>> b) I have a fairly large set of libs in HIVE_AUX_JAR_PATH (around ~175
>> MB)-- does it have to copy these around to the tasks at this time?
>>
>> Any insights appreciated.
>>
>> Marc
>>
>>
>>
>>
>

Re: delay before query starts processing

Posted by Abdelrahman Shettia <as...@hortonworks.com>.
Hi Marc,

You can try running the hive client with debug mode on and see what is
trying to do on the JT level.
hive -hiveconf hive.root.logger=ALL,console -e " DDL;"
hive -hiveconf hive.root.logger=ALL,console -f ddl.sql ;

Hope this helps .

Thanks
-Abdelrahman


On Wed, Jan 30, 2013 at 3:16 PM, Marc Limotte <ms...@gmail.com> wrote:

> Hi,
>
> I'm running in Amazon on an EMR cluster with hive 0.8.1.  We have a lot of
> other Hadoop jobs, but only started experimenting with Hive recently.
>
> I've been seeing a long pause after submitting a hive query and the
> actually start of the hadoop job... 10 minutes or more in some cases.  I'm
> wondering what's happening during this time.  Either a high level answer,
> or maybe there is some logging I can turn on?
>
> Here's some more detail.  I submit the query on the master using the hive
> cli, and start to see some output right away...
>
> Total MapReduce jobs = 2
> Launching Job 1 out of 2
> Number of reduce tasks not specified. Estimated from input data size: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=<number>
>
>
> *[then a long delay here: 10 minutes or more... no activity in the hadoop
> job tracker ui] *
>
>
> … and then it continues normally ...
> Starting Job = job_201301160029_0082, Tracking URL =
> http://ip-xxxxxxxx.ec2.internal:9100/jobdetails.jsp?jobid=job_201301160029_0082
> Kill Command = /home/hadoop/bin/hadoop job
>  -Dmapred.job.tracker=xxxxxx:9001 -kill job_201301160029_0082
> Hadoop job information for Stage-1: number of mappers: 2; number of
> reducers: 1
> 2013-01-30 20:45:30,526 Stage-1 map = 0%,  reduce = 0%
> …
>
> This query is processing in the neighborhood of 500GB of data from S3.  A
> couple of possibilities I thought of… perhaps someone can confirm or deny:
> a) Is the data copied from S3 to HDFS during this time?
> b) I have a fairly large set of libs in HIVE_AUX_JAR_PATH (around ~175
> MB)-- does it have to copy these around to the tasks at this time?
>
> Any insights appreciated.
>
> Marc
>
>
>
>