You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by sh...@accenture.com on 2014/05/31 00:05:35 UTC

Need urgent help on hive query performance

Hi,

Does anybody  help urgently on optimizing hive query performance? I am looking more Hadoop tuning point of view. Currently, small amount of table takes much time to query?

We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task Nodes.

Quick help is much appreciated.

Thanks,
Shouvanik

________________________________

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
______________________________________________________________________________________

www.accenture.com

Re: Need urgent help on hive query performance

Posted by Ashish Garg <ga...@gmail.com>.
hive> Create External Table Emp(

       > id INT,

       > name STRING,

       > Salary INT)

       > PARTITIONED BY (Country STRING, State STRING)

       > ROW FORMAT DELIMITED

       > FIELDS TERMINATED BY ‘\t’

       > LOCATION ‘/user/data/’;

Now load the data which is partition specific. For example,

hive> LOAD DATA LOCAL INPATH ‘---‘

       > OVERWRITE INTO TABLE Emp

       > PARTITION (Country=’US’ , State=’NJ’);

Now try running queries like

hive> Select Count(*), MAX(Salary) FROM Emp Where Country='US' And
State='NJ';

This will optimize your query performance.


On Fri, May 30, 2014 at 6:32 PM, <sh...@accenture.com> wrote:

>  Can you please give a specific example or blog to refer to. I did not
> understand
>
>
>
> *From:* Ashish Garg [mailto:gargcreation1992@gmail.com]
> *Sent:* Friday, May 30, 2014 3:31 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Need urgent help on hive query performance
>
>
>
> try partitioning the table and run the queries which are partition
> specific. Hope this helps.
>
> Thanks and Regards,
>
> Ashish Garg.
>
>
>
> On Fri, May 30, 2014 at 6:05 PM, <sh...@accenture.com> wrote:
>
> Hi,
>
>
>
> Does anybody  help urgently on optimizing hive query performance? I am
> looking more Hadoop tuning point of view. Currently, small amount of table
> takes much time to query?
>
>
>
> We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task
> Nodes.
>
>
>
> Quick help is much appreciated.
>
>
>
> Thanks,
>
> Shouvanik
>
>
>  ------------------------------
>
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
>
> ______________________________________________________________________________________
>
> www.accenture.com
>
>
>

RE: Need urgent help on hive query performance

Posted by sh...@accenture.com.
Thanks to all for all your suggestions. Really appreciate.

But we have a constraint on Amazon EMR. It would be great if I get any pointer on how to tune Hadoop configurations(e.g. core-site.xml, mapred-site.xml etc) so that HIVE query gets executed faster.

Please help ASAP. Sorry for the urgency.

Thanks,
Shouvanik

From: Bala Krishna Gangisetty [mailto:bala@altiscale.com]
Sent: Friday, May 30, 2014 4:08 PM
To: user@hive.apache.org
Subject: Re: Need urgent help on hive query performance

Another dimension,

Try storing Hive table in ORC format. From my experience, it significantly improves the performance compare to other formats.

Since you mentioned about join queries, on a side note, as a long term goal, you probably want to explore Hive with Tez.

--Bala G.

On Fri, May 30, 2014 at 3:59 PM, kulkarni.swarnim@gmail.com<ma...@gmail.com> <ku...@gmail.com>> wrote:
> It has innumerable no of joins. Since its client specific query, u understand I cannot share. Sorry about that

Like I said, Joins are slow and in not done correctly could have terrible performance. A couple of handy techniques depend on how exactly are you trying to perform the join. For instance, if you are trying to join a smaller table to a larger one, a map join could work well for you where the smaller table is kept in-memory when the join is performed. Also if you are able to break your table down to smaller buckets, you might as well be able to use a bucketed map join for instance. Following link should be helpful[1][2].

Hope this helps.

[1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization
[2] http://stackoverflow.com/questions/20199077/hive-efficient-join-of-two-tables

On Fri, May 30, 2014 at 5:38 PM, <sh...@accenture.com>> wrote:
Pls find the answers



From: kulkarni.swarnim@gmail.com<ma...@gmail.com> [mailto:kulkarni.swarnim@gmail.com<ma...@gmail.com>]
Sent: Friday, May 30, 2014 3:34 PM

To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Need urgent help on hive query performance

I feel it's pretty hard to answer this without understanding the following:


1.      What exactly are you trying to query? CSV? Avro? ....
HIVE table

2.      Where is your data? HDFS? HBase? Local filesystem?
Data is in s3

3.      What version of hive are you using?
Hive 0.12

4.      What is an example of a query that is slow? Some queries like joins and stuff would be inherently slower than other simpler ones(though can be optimized).
It has innumerable no of joins. Since its client specific query, u understand I cannot share. Sorry about that

Thanks,

--
Swarnim

On Fri, May 30, 2014 at 5:32 PM, <sh...@accenture.com>> wrote:
Can you please give a specific example or blog to refer to. I did not understand

From: Ashish Garg [mailto:gargcreation1992@gmail.com<ma...@gmail.com>]
Sent: Friday, May 30, 2014 3:31 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Need urgent help on hive query performance

try partitioning the table and run the queries which are partition specific. Hope this helps.
Thanks and Regards,
Ashish Garg.

On Fri, May 30, 2014 at 6:05 PM, <sh...@accenture.com>> wrote:
Hi,

Does anybody  help urgently on optimizing hive query performance? I am looking more Hadoop tuning point of view. Currently, small amount of table takes much time to query?

We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task Nodes.

Quick help is much appreciated.

Thanks,
Shouvanik

________________________________

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
______________________________________________________________________________________

www.accenture.com<http://www.accenture.com>




--
Swarnim



--
Swarnim


Re: Need urgent help on hive query performance

Posted by Bala Krishna Gangisetty <ba...@altiscale.com>.
Another dimension,

Try storing Hive table in ORC format. From my experience, it significantly
improves the performance compare to other formats.

Since you mentioned about join queries, on a side note, as a long term
goal, you probably want to explore Hive with Tez.

--Bala G.


On Fri, May 30, 2014 at 3:59 PM, kulkarni.swarnim@gmail.com <
kulkarni.swarnim@gmail.com> wrote:

> > It has innumerable no of joins. Since its client specific query, u
> understand I cannot share. Sorry about that
>
> Like I said, Joins are slow and in not done correctly could have terrible
> performance. A couple of handy techniques depend on how exactly are you
> trying to perform the join. For instance, if you are trying to join a
> smaller table to a larger one, a map join could work well for you where the
> smaller table is kept in-memory when the join is performed. Also if you are
> able to break your table down to smaller buckets, you might as well be able
> to use a bucketed map join for instance. Following link should be
> helpful[1][2].
>
> Hope this helps.
>
> [1]
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization
> [2]
> http://stackoverflow.com/questions/20199077/hive-efficient-join-of-two-tables
>
>
> On Fri, May 30, 2014 at 5:38 PM, <sh...@accenture.com> wrote:
>
>>  Pls find the answers
>>
>>
>>
>>
>>
>>
>>
>> *From:* kulkarni.swarnim@gmail.com [mailto:kulkarni.swarnim@gmail.com]
>> *Sent:* Friday, May 30, 2014 3:34 PM
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: Need urgent help on hive query performance
>>
>>
>>
>> I feel it's pretty hard to answer this without understanding the
>> following:
>>
>>
>>
>> 1.      What exactly are you trying to query? CSV? Avro? ....
>>
>> HIVE table
>>
>> 2.      Where is your data? HDFS? HBase? Local filesystem?
>>
>> Data is in s3
>>
>> 3.      What version of hive are you using?
>>
>> Hive 0.12
>>
>> 4.      What is an example of a query that is slow? Some queries like
>> joins and stuff would be inherently slower than other simpler ones(though
>> can be optimized).
>>
>> It has innumerable no of joins. Since its client specific query, u
>> understand I cannot share. Sorry about that
>>
>>
>>
>> Thanks,
>>
>>
>>
>> --
>> Swarnim
>>
>>
>>
>> On Fri, May 30, 2014 at 5:32 PM, <sh...@accenture.com> wrote:
>>
>> Can you please give a specific example or blog to refer to. I did not
>> understand
>>
>>
>>
>> *From:* Ashish Garg [mailto:gargcreation1992@gmail.com]
>> *Sent:* Friday, May 30, 2014 3:31 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Need urgent help on hive query performance
>>
>>
>>
>> try partitioning the table and run the queries which are partition
>> specific. Hope this helps.
>>
>> Thanks and Regards,
>>
>> Ashish Garg.
>>
>>
>>
>> On Fri, May 30, 2014 at 6:05 PM, <sh...@accenture.com> wrote:
>>
>> Hi,
>>
>>
>>
>> Does anybody  help urgently on optimizing hive query performance? I am
>> looking more Hadoop tuning point of view. Currently, small amount of table
>> takes much time to query?
>>
>>
>>
>> We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task
>> Nodes.
>>
>>
>>
>> Quick help is much appreciated.
>>
>>
>>
>> Thanks,
>>
>> Shouvanik
>>
>>
>>  ------------------------------
>>
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy.
>>
>> ______________________________________________________________________________________
>>
>> www.accenture.com
>>
>>
>>
>>
>>
>>
>>
>> --
>> Swarnim
>>
>
>
>
> --
> Swarnim
>

Re: Need urgent help on hive query performance

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.
> It has innumerable no of joins. Since its client specific query, u
understand I cannot share. Sorry about that

Like I said, Joins are slow and in not done correctly could have terrible
performance. A couple of handy techniques depend on how exactly are you
trying to perform the join. For instance, if you are trying to join a
smaller table to a larger one, a map join could work well for you where the
smaller table is kept in-memory when the join is performed. Also if you are
able to break your table down to smaller buckets, you might as well be able
to use a bucketed map join for instance. Following link should be
helpful[1][2].

Hope this helps.

[1]
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization
[2]
http://stackoverflow.com/questions/20199077/hive-efficient-join-of-two-tables


On Fri, May 30, 2014 at 5:38 PM, <sh...@accenture.com> wrote:

>  Pls find the answers
>
>
>
>
>
>
>
> *From:* kulkarni.swarnim@gmail.com [mailto:kulkarni.swarnim@gmail.com]
> *Sent:* Friday, May 30, 2014 3:34 PM
>
> *To:* user@hive.apache.org
> *Subject:* Re: Need urgent help on hive query performance
>
>
>
> I feel it's pretty hard to answer this without understanding the following:
>
>
>
> 1.      What exactly are you trying to query? CSV? Avro? ....
>
> HIVE table
>
> 2.      Where is your data? HDFS? HBase? Local filesystem?
>
> Data is in s3
>
> 3.      What version of hive are you using?
>
> Hive 0.12
>
> 4.      What is an example of a query that is slow? Some queries like
> joins and stuff would be inherently slower than other simpler ones(though
> can be optimized).
>
> It has innumerable no of joins. Since its client specific query, u
> understand I cannot share. Sorry about that
>
>
>
> Thanks,
>
>
>
> --
> Swarnim
>
>
>
> On Fri, May 30, 2014 at 5:32 PM, <sh...@accenture.com> wrote:
>
> Can you please give a specific example or blog to refer to. I did not
> understand
>
>
>
> *From:* Ashish Garg [mailto:gargcreation1992@gmail.com]
> *Sent:* Friday, May 30, 2014 3:31 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Need urgent help on hive query performance
>
>
>
> try partitioning the table and run the queries which are partition
> specific. Hope this helps.
>
> Thanks and Regards,
>
> Ashish Garg.
>
>
>
> On Fri, May 30, 2014 at 6:05 PM, <sh...@accenture.com> wrote:
>
> Hi,
>
>
>
> Does anybody  help urgently on optimizing hive query performance? I am
> looking more Hadoop tuning point of view. Currently, small amount of table
> takes much time to query?
>
>
>
> We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task
> Nodes.
>
>
>
> Quick help is much appreciated.
>
>
>
> Thanks,
>
> Shouvanik
>
>
>  ------------------------------
>
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
>
> ______________________________________________________________________________________
>
> www.accenture.com
>
>
>
>
>
>
>
> --
> Swarnim
>



-- 
Swarnim

RE: Need urgent help on hive query performance

Posted by sh...@accenture.com.
Pls find the answers



From: kulkarni.swarnim@gmail.com [mailto:kulkarni.swarnim@gmail.com]
Sent: Friday, May 30, 2014 3:34 PM
To: user@hive.apache.org
Subject: Re: Need urgent help on hive query performance

I feel it's pretty hard to answer this without understanding the following:


1.      What exactly are you trying to query? CSV? Avro? ....
HIVE table

2.      Where is your data? HDFS? HBase? Local filesystem?
Data is in s3

3.      What version of hive are you using?
Hive 0.12

4.      What is an example of a query that is slow? Some queries like joins and stuff would be inherently slower than other simpler ones(though can be optimized).
It has innumerable no of joins. Since its client specific query, u understand I cannot share. Sorry about that

Thanks,

--
Swarnim

On Fri, May 30, 2014 at 5:32 PM, <sh...@accenture.com>> wrote:
Can you please give a specific example or blog to refer to. I did not understand

From: Ashish Garg [mailto:gargcreation1992@gmail.com<ma...@gmail.com>]
Sent: Friday, May 30, 2014 3:31 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Need urgent help on hive query performance

try partitioning the table and run the queries which are partition specific. Hope this helps.
Thanks and Regards,
Ashish Garg.

On Fri, May 30, 2014 at 6:05 PM, <sh...@accenture.com>> wrote:
Hi,

Does anybody  help urgently on optimizing hive query performance? I am looking more Hadoop tuning point of view. Currently, small amount of table takes much time to query?

We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task Nodes.

Quick help is much appreciated.

Thanks,
Shouvanik

________________________________

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
______________________________________________________________________________________

www.accenture.com<http://www.accenture.com>




--
Swarnim

Re: Need urgent help on hive query performance

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.
I feel it's pretty hard to answer this without understanding the following:

1. What exactly are you trying to query? CSV? Avro? ....
2. Where is your data? HDFS? HBase? Local filesystem?
3. What version of hive are you using?
4. What is an example of a query that is slow? Some queries like joins and
stuff would be inherently slower than other simpler ones(though can be
optimized).

Thanks,

-- 
Swarnim


On Fri, May 30, 2014 at 5:32 PM, <sh...@accenture.com> wrote:

>  Can you please give a specific example or blog to refer to. I did not
> understand
>
>
>
> *From:* Ashish Garg [mailto:gargcreation1992@gmail.com]
> *Sent:* Friday, May 30, 2014 3:31 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Need urgent help on hive query performance
>
>
>
> try partitioning the table and run the queries which are partition
> specific. Hope this helps.
>
> Thanks and Regards,
>
> Ashish Garg.
>
>
>
> On Fri, May 30, 2014 at 6:05 PM, <sh...@accenture.com> wrote:
>
> Hi,
>
>
>
> Does anybody  help urgently on optimizing hive query performance? I am
> looking more Hadoop tuning point of view. Currently, small amount of table
> takes much time to query?
>
>
>
> We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task
> Nodes.
>
>
>
> Quick help is much appreciated.
>
>
>
> Thanks,
>
> Shouvanik
>
>
>  ------------------------------
>
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
>
> ______________________________________________________________________________________
>
> www.accenture.com
>
>
>



-- 
Swarnim

RE: Need urgent help on hive query performance

Posted by sh...@accenture.com.
Can you please give a specific example or blog to refer to. I did not understand

From: Ashish Garg [mailto:gargcreation1992@gmail.com]
Sent: Friday, May 30, 2014 3:31 PM
To: user@hive.apache.org
Subject: Re: Need urgent help on hive query performance

try partitioning the table and run the queries which are partition specific. Hope this helps.
Thanks and Regards,
Ashish Garg.

On Fri, May 30, 2014 at 6:05 PM, <sh...@accenture.com>> wrote:
Hi,

Does anybody  help urgently on optimizing hive query performance? I am looking more Hadoop tuning point of view. Currently, small amount of table takes much time to query?

We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task Nodes.

Quick help is much appreciated.

Thanks,
Shouvanik

________________________________

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
______________________________________________________________________________________

www.accenture.com<http://www.accenture.com>


Re: Need urgent help on hive query performance

Posted by Ashish Garg <ga...@gmail.com>.
try partitioning the table and run the queries which are partition
specific. Hope this helps.
Thanks and Regards,
Ashish Garg.


On Fri, May 30, 2014 at 6:05 PM, <sh...@accenture.com> wrote:

>  Hi,
>
>
>
> Does anybody  help urgently on optimizing hive query performance? I am
> looking more Hadoop tuning point of view. Currently, small amount of table
> takes much time to query?
>
>
>
> We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task
> Nodes.
>
>
>
> Quick help is much appreciated.
>
>
>
> Thanks,
>
> Shouvanik
>
> ------------------------------
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
>
> ______________________________________________________________________________________
>
> www.accenture.com
>