You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Nishant Aggarwal <ni...@gmail.com> on 2015/08/20 09:55:57 UTC

HIVE:1.2, Query taking huge time

Dear Hive Users,

I am in process of running over a poc to one of my customer demonstrating
the huge performance benefits of Hadoop BigData using Hive.

Following is the problem statement i am stuck with.

I have generate a large table with 28 columns( all are double). Table size
on disk is 70GB (i ultimately created compressed table using ORC format to
save disk space bringing down the table size to < 1GB) with more than
450Million records.

In order to demonstrate a complex use case i joined this table with itself.
Following are the queries i have used to create table and  join query i am
using.

*Create Table and Loading Data, Hive parameters settigs:*
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
set mapred.max.split.size=100000000;
set mapred.min.split.size=1000000;
set hive.auto.convert.join=false;
set hive.enforce.sorting=true;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set mapreduce.reduce.input.limit=-1;
set hive.exec.parallel = true;

CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
double,col10 double,col11 double,col12 double,col13 double,col14
double,col15 double,col16 double,col17 double,col18 double,col19
double,col20 double,col21 double,col22 double,col23 double,col24
double,col25 double,col26 double,col27 double,col28 double)
clustered by (col1) sorted by (col1) into 240 buckets
STORED AS ORC tblproperties ("orc.compress"="SNAPPY");

from huge_numeric_table insert overwrite table huge_numeric_table_orc2
select * sort by col1;


*JOIN QUERY:*

select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5
from huge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2
on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12


*The problem is that this query gets stuck at reducers :80-85%. and goes in
a loop and never finishes. *

Version of Hive is 1.2.

Please help.


Thanks and Regards
Nishant Aggarwal, PMP
Cell No:- +91 99588 94305

Re: HIVE:1.2, Query taking huge time

Posted by Nishant Aggarwal <ni...@gmail.com>.

Thanks All.
I will implement the suggested points and share the output.

Thanks again for all the help.

Thanks and Regards
Nishant Aggarwal, PMP
Cell No:- +91 99588 94305

On Fri, Aug 21, 2015 at 10:33 AM, Jörn Franke <jo...@gmail.com> wrote:

> Additionally, although it is a PoC you should have a realistic data model.
> Furthermore, following good data modeling practices should be taken into
> account. Joining on a double is not one of them. It should be int.
> Furthermore, double is a type that is in most scenarios rarely used. In the
> business world you have usually something like decimal(precision,scale)
> where you exactly define precision  and scale (usually x,2 for all stuff
> related to money). The reason is that rounding needs to be consistent
> across all analytics applications.
> Also you may then partition of course the table.
>
> Le jeu. 20 août 2015 à 15:46, Xuefu Zhang <xz...@cloudera.com> a écrit :
>
>> Please check out HIVE-11502. For your poc, you can simply get around using
>> other data types instead of double.
>>
>> On Thu, Aug 20, 2015 at 2:08 AM, Nishant Aggarwal <ni...@gmail.com>
>> wrote:
>>
>> > Thanks for the reply Noam. I have already tried the later point of
>> > dividing the query. But the challenge comes during the joining of the
>> table.
>> >
>> >
>> > Thanks and Regards
>> > Nishant Aggarwal, PMP
>> > Cell No:- +91 99588 94305
>> >
>> >
>> > On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson <no...@kenshoo.com>
>> > wrote:
>> >
>> >> Hi,
>> >>
>> >> Have you look at counters in Hadoop side? It's possible you are dealing
>> >> with a bad join which causes multiplication of items, if you see huge
>> >> number of record input/output in map/reduce phase and keeps increasing
>> >> that's probably the case.
>> >>
>> >> Another thing I would try is to divide the job into several different
>> >> smaller queries, for example start with filter only, after than join
>> and so
>> >> on.
>> >>
>> >> Noam.
>> >>
>> >> On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal <
>> nishant.agg@gmail.com
>> >> > wrote:
>> >>
>> >>> Dear Hive Users,
>> >>>
>> >>> I am in process of running over a poc to one of my customer
>> >>> demonstrating the huge performance benefits of Hadoop BigData using
>> Hive.
>> >>>
>> >>> Following is the problem statement i am stuck with.
>> >>>
>> >>> I have generate a large table with 28 columns( all are double). Table
>> >>> size on disk is 70GB (i ultimately created compressed table using ORC
>> >>> format to save disk space bringing down the table size to < 1GB) with
>> more
>> >>> than 450Million records.
>> >>>
>> >>> In order to demonstrate a complex use case i joined this table with
>> >>> itself. Following are the queries i have used to create table and
>> join
>> >>> query i am using.
>> >>>
>> >>> *Create Table and Loading Data, Hive parameters settigs:*
>> >>> set hive.vectorized.execution.enabled = true;
>> >>> set hive.vectorized.execution.reduce.enabled = true;
>> >>> set mapred.max.split.size=100000000;
>> >>> set mapred.min.split.size=1000000;
>> >>> set hive.auto.convert.join=false;
>> >>> set hive.enforce.sorting=true;
>> >>> set hive.enforce.bucketing=true;
>> >>> set hive.exec.dynamic.partition=true;
>> >>> set hive.exec.dynamic.partition.mode=nonstrict;
>> >>> set mapreduce.reduce.input.limit=-1;
>> >>> set hive.exec.parallel = true;
>> >>>
>> >>> CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
>> >>> double,col4 double,col5 double,col6 double,col7 double,col8
>> double,col9
>> >>> double,col10 double,col11 double,col12 double,col13 double,col14
>> >>> double,col15 double,col16 double,col17 double,col18 double,col19
>> >>> double,col20 double,col21 double,col22 double,col23 double,col24
>> >>> double,col25 double,col26 double,col27 double,col28 double)
>> >>> clustered by (col1) sorted by (col1) into 240 buckets
>> >>> STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
>> >>>
>> >>> from huge_numeric_table insert overwrite table huge_numeric_table_orc2
>> >>> select * sort by col1;
>> >>>
>> >>>
>> >>> *JOIN QUERY:*
>> >>>
>> >>> select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as
>> AVG5
>> >>> from huge_numeric_table_orc2 t1 left outer join
>> huge_numeric_table_orc2 t2
>> >>> on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12
>> >>>
>> >>>
>> >>> *The problem is that this query gets stuck at reducers :80-85%. and
>> goes
>> >>> in a loop and never finishes. *
>> >>>
>> >>> Version of Hive is 1.2.
>> >>>
>> >>> Please help.
>> >>>
>> >>>
>> >>> Thanks and Regards
>> >>> Nishant Aggarwal, PMP
>> >>> Cell No:- +91 99588 94305
>> >>>
>> >>>
>> >>
>> >> This e-mail, as well as any attached document, may contain material
>> which
>> >> is confidential and privileged and may include trademark, copyright and
>> >> other intellectual property rights that are proprietary to Kenshoo Ltd,
>> >>  its subsidiaries or affiliates ("Kenshoo"). This e-mail and its
>> >> attachments may be read, copied and used only by the addressee for the
>> >> purpose(s) for which it was disclosed herein. If you have received it
>> in
>> >> error, please destroy the message and any attachment, and contact us
>> >> immediately. If you are not the intended recipient, be aware that any
>> >> review, reliance, disclosure, copying, distribution or use of the
>> contents
>> >> of this message without Kenshoo's express permission is strictly
>> prohibited.
>> >
>> >
>> >
>>
>

Re: HIVE:1.2, Query taking huge time

Posted by Jörn Franke <jo...@gmail.com>.

Additionally, although it is a PoC you should have a realistic data model.
Furthermore, following good data modeling practices should be taken into
account. Joining on a double is not one of them. It should be int.
Furthermore, double is a type that is in most scenarios rarely used. In the
business world you have usually something like decimal(precision,scale)
where you exactly define precision  and scale (usually x,2 for all stuff
related to money). The reason is that rounding needs to be consistent
across all analytics applications.
Also you may then partition of course the table.

Le jeu. 20 août 2015 à 15:46, Xuefu Zhang <xz...@cloudera.com> a écrit :

> Please check out HIVE-11502. For your poc, you can simply get around using
> other data types instead of double.
>
> On Thu, Aug 20, 2015 at 2:08 AM, Nishant Aggarwal <ni...@gmail.com>
> wrote:
>
> > Thanks for the reply Noam. I have already tried the later point of
> > dividing the query. But the challenge comes during the joining of the
> table.
> >
> >
> > Thanks and Regards
> > Nishant Aggarwal, PMP
> > Cell No:- +91 99588 94305
> >
> >
> > On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson <no...@kenshoo.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Have you look at counters in Hadoop side? It's possible you are dealing
> >> with a bad join which causes multiplication of items, if you see huge
> >> number of record input/output in map/reduce phase and keeps increasing
> >> that's probably the case.
> >>
> >> Another thing I would try is to divide the job into several different
> >> smaller queries, for example start with filter only, after than join
> and so
> >> on.
> >>
> >> Noam.
> >>
> >> On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal <
> nishant.agg@gmail.com
> >> > wrote:
> >>
> >>> Dear Hive Users,
> >>>
> >>> I am in process of running over a poc to one of my customer
> >>> demonstrating the huge performance benefits of Hadoop BigData using
> Hive.
> >>>
> >>> Following is the problem statement i am stuck with.
> >>>
> >>> I have generate a large table with 28 columns( all are double). Table
> >>> size on disk is 70GB (i ultimately created compressed table using ORC
> >>> format to save disk space bringing down the table size to < 1GB) with
> more
> >>> than 450Million records.
> >>>
> >>> In order to demonstrate a complex use case i joined this table with
> >>> itself. Following are the queries i have used to create table and  join
> >>> query i am using.
> >>>
> >>> *Create Table and Loading Data, Hive parameters settigs:*
> >>> set hive.vectorized.execution.enabled = true;
> >>> set hive.vectorized.execution.reduce.enabled = true;
> >>> set mapred.max.split.size=100000000;
> >>> set mapred.min.split.size=1000000;
> >>> set hive.auto.convert.join=false;
> >>> set hive.enforce.sorting=true;
> >>> set hive.enforce.bucketing=true;
> >>> set hive.exec.dynamic.partition=true;
> >>> set hive.exec.dynamic.partition.mode=nonstrict;
> >>> set mapreduce.reduce.input.limit=-1;
> >>> set hive.exec.parallel = true;
> >>>
> >>> CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
> >>> double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
> >>> double,col10 double,col11 double,col12 double,col13 double,col14
> >>> double,col15 double,col16 double,col17 double,col18 double,col19
> >>> double,col20 double,col21 double,col22 double,col23 double,col24
> >>> double,col25 double,col26 double,col27 double,col28 double)
> >>> clustered by (col1) sorted by (col1) into 240 buckets
> >>> STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
> >>>
> >>> from huge_numeric_table insert overwrite table huge_numeric_table_orc2
> >>> select * sort by col1;
> >>>
> >>>
> >>> *JOIN QUERY:*
> >>>
> >>> select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as
> AVG5
> >>> from huge_numeric_table_orc2 t1 left outer join
> huge_numeric_table_orc2 t2
> >>> on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12
> >>>
> >>>
> >>> *The problem is that this query gets stuck at reducers :80-85%. and
> goes
> >>> in a loop and never finishes. *
> >>>
> >>> Version of Hive is 1.2.
> >>>
> >>> Please help.
> >>>
> >>>
> >>> Thanks and Regards
> >>> Nishant Aggarwal, PMP
> >>> Cell No:- +91 99588 94305
> >>>
> >>>
> >>
> >> This e-mail, as well as any attached document, may contain material
> which
> >> is confidential and privileged and may include trademark, copyright and
> >> other intellectual property rights that are proprietary to Kenshoo Ltd,
> >>  its subsidiaries or affiliates ("Kenshoo"). This e-mail and its
> >> attachments may be read, copied and used only by the addressee for the
> >> purpose(s) for which it was disclosed herein. If you have received it in
> >> error, please destroy the message and any attachment, and contact us
> >> immediately. If you are not the intended recipient, be aware that any
> >> review, reliance, disclosure, copying, distribution or use of the
> contents
> >> of this message without Kenshoo's express permission is strictly
> prohibited.
> >
> >
> >
>

Re: HIVE:1.2, Query taking huge time

Posted by Jörn Franke <jo...@gmail.com>.

Additionally, although it is a PoC you should have a realistic data model.
Furthermore, following good data modeling practices should be taken into
account. Joining on a double is not one of them. It should be int.
Furthermore, double is a type that is in most scenarios rarely used. In the
business world you have usually something like decimal(precision,scale)
where you exactly define precision  and scale (usually x,2 for all stuff
related to money). The reason is that rounding needs to be consistent
across all analytics applications.
Also you may then partition of course the table.

Le jeu. 20 août 2015 à 15:46, Xuefu Zhang <xz...@cloudera.com> a écrit :

> Please check out HIVE-11502. For your poc, you can simply get around using
> other data types instead of double.
>
> On Thu, Aug 20, 2015 at 2:08 AM, Nishant Aggarwal <ni...@gmail.com>
> wrote:
>
> > Thanks for the reply Noam. I have already tried the later point of
> > dividing the query. But the challenge comes during the joining of the
> table.
> >
> >
> > Thanks and Regards
> > Nishant Aggarwal, PMP
> > Cell No:- +91 99588 94305
> >
> >
> > On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson <no...@kenshoo.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Have you look at counters in Hadoop side? It's possible you are dealing
> >> with a bad join which causes multiplication of items, if you see huge
> >> number of record input/output in map/reduce phase and keeps increasing
> >> that's probably the case.
> >>
> >> Another thing I would try is to divide the job into several different
> >> smaller queries, for example start with filter only, after than join
> and so
> >> on.
> >>
> >> Noam.
> >>
> >> On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal <
> nishant.agg@gmail.com
> >> > wrote:
> >>
> >>> Dear Hive Users,
> >>>
> >>> I am in process of running over a poc to one of my customer
> >>> demonstrating the huge performance benefits of Hadoop BigData using
> Hive.
> >>>
> >>> Following is the problem statement i am stuck with.
> >>>
> >>> I have generate a large table with 28 columns( all are double). Table
> >>> size on disk is 70GB (i ultimately created compressed table using ORC
> >>> format to save disk space bringing down the table size to < 1GB) with
> more
> >>> than 450Million records.
> >>>
> >>> In order to demonstrate a complex use case i joined this table with
> >>> itself. Following are the queries i have used to create table and  join
> >>> query i am using.
> >>>
> >>> *Create Table and Loading Data, Hive parameters settigs:*
> >>> set hive.vectorized.execution.enabled = true;
> >>> set hive.vectorized.execution.reduce.enabled = true;
> >>> set mapred.max.split.size=100000000;
> >>> set mapred.min.split.size=1000000;
> >>> set hive.auto.convert.join=false;
> >>> set hive.enforce.sorting=true;
> >>> set hive.enforce.bucketing=true;
> >>> set hive.exec.dynamic.partition=true;
> >>> set hive.exec.dynamic.partition.mode=nonstrict;
> >>> set mapreduce.reduce.input.limit=-1;
> >>> set hive.exec.parallel = true;
> >>>
> >>> CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
> >>> double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
> >>> double,col10 double,col11 double,col12 double,col13 double,col14
> >>> double,col15 double,col16 double,col17 double,col18 double,col19
> >>> double,col20 double,col21 double,col22 double,col23 double,col24
> >>> double,col25 double,col26 double,col27 double,col28 double)
> >>> clustered by (col1) sorted by (col1) into 240 buckets
> >>> STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
> >>>
> >>> from huge_numeric_table insert overwrite table huge_numeric_table_orc2
> >>> select * sort by col1;
> >>>
> >>>
> >>> *JOIN QUERY:*
> >>>
> >>> select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as
> AVG5
> >>> from huge_numeric_table_orc2 t1 left outer join
> huge_numeric_table_orc2 t2
> >>> on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12
> >>>
> >>>
> >>> *The problem is that this query gets stuck at reducers :80-85%. and
> goes
> >>> in a loop and never finishes. *
> >>>
> >>> Version of Hive is 1.2.
> >>>
> >>> Please help.
> >>>
> >>>
> >>> Thanks and Regards
> >>> Nishant Aggarwal, PMP
> >>> Cell No:- +91 99588 94305
> >>>
> >>>
> >>
> >> This e-mail, as well as any attached document, may contain material
> which
> >> is confidential and privileged and may include trademark, copyright and
> >> other intellectual property rights that are proprietary to Kenshoo Ltd,
> >>  its subsidiaries or affiliates ("Kenshoo"). This e-mail and its
> >> attachments may be read, copied and used only by the addressee for the
> >> purpose(s) for which it was disclosed herein. If you have received it in
> >> error, please destroy the message and any attachment, and contact us
> >> immediately. If you are not the intended recipient, be aware that any
> >> review, reliance, disclosure, copying, distribution or use of the
> contents
> >> of this message without Kenshoo's express permission is strictly
> prohibited.
> >
> >
> >
>

Re: HIVE:1.2, Query taking huge time

Posted by Xuefu Zhang <xz...@cloudera.com>.

Please check out HIVE-11502. For your poc, you can simply get around using
other data types instead of double.

On Thu, Aug 20, 2015 at 2:08 AM, Nishant Aggarwal <ni...@gmail.com>
wrote:

> Thanks for the reply Noam. I have already tried the later point of
> dividing the query. But the challenge comes during the joining of the table.
>
>
> Thanks and Regards
> Nishant Aggarwal, PMP
> Cell No:- +91 99588 94305
>
>
> On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson <no...@kenshoo.com>
> wrote:
>
>> Hi,
>>
>> Have you look at counters in Hadoop side? It's possible you are dealing
>> with a bad join which causes multiplication of items, if you see huge
>> number of record input/output in map/reduce phase and keeps increasing
>> that's probably the case.
>>
>> Another thing I would try is to divide the job into several different
>> smaller queries, for example start with filter only, after than join and so
>> on.
>>
>> Noam.
>>
>> On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal <nishant.agg@gmail.com
>> > wrote:
>>
>>> Dear Hive Users,
>>>
>>> I am in process of running over a poc to one of my customer
>>> demonstrating the huge performance benefits of Hadoop BigData using Hive.
>>>
>>> Following is the problem statement i am stuck with.
>>>
>>> I have generate a large table with 28 columns( all are double). Table
>>> size on disk is 70GB (i ultimately created compressed table using ORC
>>> format to save disk space bringing down the table size to < 1GB) with more
>>> than 450Million records.
>>>
>>> In order to demonstrate a complex use case i joined this table with
>>> itself. Following are the queries i have used to create table and  join
>>> query i am using.
>>>
>>> *Create Table and Loading Data, Hive parameters settigs:*
>>> set hive.vectorized.execution.enabled = true;
>>> set hive.vectorized.execution.reduce.enabled = true;
>>> set mapred.max.split.size=100000000;
>>> set mapred.min.split.size=1000000;
>>> set hive.auto.convert.join=false;
>>> set hive.enforce.sorting=true;
>>> set hive.enforce.bucketing=true;
>>> set hive.exec.dynamic.partition=true;
>>> set hive.exec.dynamic.partition.mode=nonstrict;
>>> set mapreduce.reduce.input.limit=-1;
>>> set hive.exec.parallel = true;
>>>
>>> CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
>>> double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
>>> double,col10 double,col11 double,col12 double,col13 double,col14
>>> double,col15 double,col16 double,col17 double,col18 double,col19
>>> double,col20 double,col21 double,col22 double,col23 double,col24
>>> double,col25 double,col26 double,col27 double,col28 double)
>>> clustered by (col1) sorted by (col1) into 240 buckets
>>> STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
>>>
>>> from huge_numeric_table insert overwrite table huge_numeric_table_orc2
>>> select * sort by col1;
>>>
>>>
>>> *JOIN QUERY:*
>>>
>>> select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5
>>> from huge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2
>>> on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12
>>>
>>>
>>> *The problem is that this query gets stuck at reducers :80-85%. and goes
>>> in a loop and never finishes. *
>>>
>>> Version of Hive is 1.2.
>>>
>>> Please help.
>>>
>>>
>>> Thanks and Regards
>>> Nishant Aggarwal, PMP
>>> Cell No:- +91 99588 94305
>>>
>>>
>>
>> This e-mail, as well as any attached document, may contain material which
>> is confidential and privileged and may include trademark, copyright and
>> other intellectual property rights that are proprietary to Kenshoo Ltd,
>>  its subsidiaries or affiliates ("Kenshoo"). This e-mail and its
>> attachments may be read, copied and used only by the addressee for the
>> purpose(s) for which it was disclosed herein. If you have received it in
>> error, please destroy the message and any attachment, and contact us
>> immediately. If you are not the intended recipient, be aware that any
>> review, reliance, disclosure, copying, distribution or use of the contents
>> of this message without Kenshoo's express permission is strictly prohibited.
>
>
>

Re: HIVE:1.2, Query taking huge time

Posted by Xuefu Zhang <xz...@cloudera.com>.

Please check out HIVE-11502. For your poc, you can simply get around using
other data types instead of double.

On Thu, Aug 20, 2015 at 2:08 AM, Nishant Aggarwal <ni...@gmail.com>
wrote:

> Thanks for the reply Noam. I have already tried the later point of
> dividing the query. But the challenge comes during the joining of the table.
>
>
> Thanks and Regards
> Nishant Aggarwal, PMP
> Cell No:- +91 99588 94305
>
>
> On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson <no...@kenshoo.com>
> wrote:
>
>> Hi,
>>
>> Have you look at counters in Hadoop side? It's possible you are dealing
>> with a bad join which causes multiplication of items, if you see huge
>> number of record input/output in map/reduce phase and keeps increasing
>> that's probably the case.
>>
>> Another thing I would try is to divide the job into several different
>> smaller queries, for example start with filter only, after than join and so
>> on.
>>
>> Noam.
>>
>> On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal <nishant.agg@gmail.com
>> > wrote:
>>
>>> Dear Hive Users,
>>>
>>> I am in process of running over a poc to one of my customer
>>> demonstrating the huge performance benefits of Hadoop BigData using Hive.
>>>
>>> Following is the problem statement i am stuck with.
>>>
>>> I have generate a large table with 28 columns( all are double). Table
>>> size on disk is 70GB (i ultimately created compressed table using ORC
>>> format to save disk space bringing down the table size to < 1GB) with more
>>> than 450Million records.
>>>
>>> In order to demonstrate a complex use case i joined this table with
>>> itself. Following are the queries i have used to create table and  join
>>> query i am using.
>>>
>>> *Create Table and Loading Data, Hive parameters settigs:*
>>> set hive.vectorized.execution.enabled = true;
>>> set hive.vectorized.execution.reduce.enabled = true;
>>> set mapred.max.split.size=100000000;
>>> set mapred.min.split.size=1000000;
>>> set hive.auto.convert.join=false;
>>> set hive.enforce.sorting=true;
>>> set hive.enforce.bucketing=true;
>>> set hive.exec.dynamic.partition=true;
>>> set hive.exec.dynamic.partition.mode=nonstrict;
>>> set mapreduce.reduce.input.limit=-1;
>>> set hive.exec.parallel = true;
>>>
>>> CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
>>> double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
>>> double,col10 double,col11 double,col12 double,col13 double,col14
>>> double,col15 double,col16 double,col17 double,col18 double,col19
>>> double,col20 double,col21 double,col22 double,col23 double,col24
>>> double,col25 double,col26 double,col27 double,col28 double)
>>> clustered by (col1) sorted by (col1) into 240 buckets
>>> STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
>>>
>>> from huge_numeric_table insert overwrite table huge_numeric_table_orc2
>>> select * sort by col1;
>>>
>>>
>>> *JOIN QUERY:*
>>>
>>> select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5
>>> from huge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2
>>> on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12
>>>
>>>
>>> *The problem is that this query gets stuck at reducers :80-85%. and goes
>>> in a loop and never finishes. *
>>>
>>> Version of Hive is 1.2.
>>>
>>> Please help.
>>>
>>>
>>> Thanks and Regards
>>> Nishant Aggarwal, PMP
>>> Cell No:- +91 99588 94305
>>>
>>>
>>
>> This e-mail, as well as any attached document, may contain material which
>> is confidential and privileged and may include trademark, copyright and
>> other intellectual property rights that are proprietary to Kenshoo Ltd,
>>  its subsidiaries or affiliates ("Kenshoo"). This e-mail and its
>> attachments may be read, copied and used only by the addressee for the
>> purpose(s) for which it was disclosed herein. If you have received it in
>> error, please destroy the message and any attachment, and contact us
>> immediately. If you are not the intended recipient, be aware that any
>> review, reliance, disclosure, copying, distribution or use of the contents
>> of this message without Kenshoo's express permission is strictly prohibited.
>
>
>

Re: HIVE:1.2, Query taking huge time

Posted by Nishant Aggarwal <ni...@gmail.com>.

Thanks for the reply Noam. I have already tried the later point of dividing
the query. But the challenge comes during the joining of the table.


Thanks and Regards
Nishant Aggarwal, PMP
Cell No:- +91 99588 94305


On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson <no...@kenshoo.com>
wrote:

> Hi,
>
> Have you look at counters in Hadoop side? It's possible you are dealing
> with a bad join which causes multiplication of items, if you see huge
> number of record input/output in map/reduce phase and keeps increasing
> that's probably the case.
>
> Another thing I would try is to divide the job into several different
> smaller queries, for example start with filter only, after than join and so
> on.
>
> Noam.
>
> On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal <ni...@gmail.com>
> wrote:
>
>> Dear Hive Users,
>>
>> I am in process of running over a poc to one of my customer demonstrating
>> the huge performance benefits of Hadoop BigData using Hive.
>>
>> Following is the problem statement i am stuck with.
>>
>> I have generate a large table with 28 columns( all are double). Table
>> size on disk is 70GB (i ultimately created compressed table using ORC
>> format to save disk space bringing down the table size to < 1GB) with more
>> than 450Million records.
>>
>> In order to demonstrate a complex use case i joined this table with
>> itself. Following are the queries i have used to create table and  join
>> query i am using.
>>
>> *Create Table and Loading Data, Hive parameters settigs:*
>> set hive.vectorized.execution.enabled = true;
>> set hive.vectorized.execution.reduce.enabled = true;
>> set mapred.max.split.size=100000000;
>> set mapred.min.split.size=1000000;
>> set hive.auto.convert.join=false;
>> set hive.enforce.sorting=true;
>> set hive.enforce.bucketing=true;
>> set hive.exec.dynamic.partition=true;
>> set hive.exec.dynamic.partition.mode=nonstrict;
>> set mapreduce.reduce.input.limit=-1;
>> set hive.exec.parallel = true;
>>
>> CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
>> double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
>> double,col10 double,col11 double,col12 double,col13 double,col14
>> double,col15 double,col16 double,col17 double,col18 double,col19
>> double,col20 double,col21 double,col22 double,col23 double,col24
>> double,col25 double,col26 double,col27 double,col28 double)
>> clustered by (col1) sorted by (col1) into 240 buckets
>> STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
>>
>> from huge_numeric_table insert overwrite table huge_numeric_table_orc2
>> select * sort by col1;
>>
>>
>> *JOIN QUERY:*
>>
>> select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5
>> from huge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2
>> on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12
>>
>>
>> *The problem is that this query gets stuck at reducers :80-85%. and goes
>> in a loop and never finishes. *
>>
>> Version of Hive is 1.2.
>>
>> Please help.
>>
>>
>> Thanks and Regards
>> Nishant Aggarwal, PMP
>> Cell No:- +91 99588 94305
>>
>>
>
> This e-mail, as well as any attached document, may contain material which
> is confidential and privileged and may include trademark, copyright and
> other intellectual property rights that are proprietary to Kenshoo Ltd,
>  its subsidiaries or affiliates ("Kenshoo"). This e-mail and its
> attachments may be read, copied and used only by the addressee for the
> purpose(s) for which it was disclosed herein. If you have received it in
> error, please destroy the message and any attachment, and contact us
> immediately. If you are not the intended recipient, be aware that any
> review, reliance, disclosure, copying, distribution or use of the contents
> of this message without Kenshoo's express permission is strictly prohibited.

Re: HIVE:1.2, Query taking huge time

Posted by Nishant Aggarwal <ni...@gmail.com>.

Thanks for the reply Noam. I have already tried the later point of dividing
the query. But the challenge comes during the joining of the table.


Thanks and Regards
Nishant Aggarwal, PMP
Cell No:- +91 99588 94305


On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson <no...@kenshoo.com>
wrote:

> Hi,
>
> Have you look at counters in Hadoop side? It's possible you are dealing
> with a bad join which causes multiplication of items, if you see huge
> number of record input/output in map/reduce phase and keeps increasing
> that's probably the case.
>
> Another thing I would try is to divide the job into several different
> smaller queries, for example start with filter only, after than join and so
> on.
>
> Noam.
>
> On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal <ni...@gmail.com>
> wrote:
>
>> Dear Hive Users,
>>
>> I am in process of running over a poc to one of my customer demonstrating
>> the huge performance benefits of Hadoop BigData using Hive.
>>
>> Following is the problem statement i am stuck with.
>>
>> I have generate a large table with 28 columns( all are double). Table
>> size on disk is 70GB (i ultimately created compressed table using ORC
>> format to save disk space bringing down the table size to < 1GB) with more
>> than 450Million records.
>>
>> In order to demonstrate a complex use case i joined this table with
>> itself. Following are the queries i have used to create table and  join
>> query i am using.
>>
>> *Create Table and Loading Data, Hive parameters settigs:*
>> set hive.vectorized.execution.enabled = true;
>> set hive.vectorized.execution.reduce.enabled = true;
>> set mapred.max.split.size=100000000;
>> set mapred.min.split.size=1000000;
>> set hive.auto.convert.join=false;
>> set hive.enforce.sorting=true;
>> set hive.enforce.bucketing=true;
>> set hive.exec.dynamic.partition=true;
>> set hive.exec.dynamic.partition.mode=nonstrict;
>> set mapreduce.reduce.input.limit=-1;
>> set hive.exec.parallel = true;
>>
>> CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
>> double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
>> double,col10 double,col11 double,col12 double,col13 double,col14
>> double,col15 double,col16 double,col17 double,col18 double,col19
>> double,col20 double,col21 double,col22 double,col23 double,col24
>> double,col25 double,col26 double,col27 double,col28 double)
>> clustered by (col1) sorted by (col1) into 240 buckets
>> STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
>>
>> from huge_numeric_table insert overwrite table huge_numeric_table_orc2
>> select * sort by col1;
>>
>>
>> *JOIN QUERY:*
>>
>> select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5
>> from huge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2
>> on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12
>>
>>
>> *The problem is that this query gets stuck at reducers :80-85%. and goes
>> in a loop and never finishes. *
>>
>> Version of Hive is 1.2.
>>
>> Please help.
>>
>>
>> Thanks and Regards
>> Nishant Aggarwal, PMP
>> Cell No:- +91 99588 94305
>>
>>
>
> This e-mail, as well as any attached document, may contain material which
> is confidential and privileged and may include trademark, copyright and
> other intellectual property rights that are proprietary to Kenshoo Ltd,
>  its subsidiaries or affiliates ("Kenshoo"). This e-mail and its
> attachments may be read, copied and used only by the addressee for the
> purpose(s) for which it was disclosed herein. If you have received it in
> error, please destroy the message and any attachment, and contact us
> immediately. If you are not the intended recipient, be aware that any
> review, reliance, disclosure, copying, distribution or use of the contents
> of this message without Kenshoo's express permission is strictly prohibited.

Re: HIVE:1.2, Query taking huge time

Posted by Noam Hasson <no...@kenshoo.com>.

Hi,

Have you look at counters in Hadoop side? It's possible you are dealing
with a bad join which causes multiplication of items, if you see huge
number of record input/output in map/reduce phase and keeps increasing
that's probably the case.

Another thing I would try is to divide the job into several different
smaller queries, for example start with filter only, after than join and so
on.

Noam.

On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal <ni...@gmail.com>
wrote:

> Dear Hive Users,
>
> I am in process of running over a poc to one of my customer demonstrating
> the huge performance benefits of Hadoop BigData using Hive.
>
> Following is the problem statement i am stuck with.
>
> I have generate a large table with 28 columns( all are double). Table size
> on disk is 70GB (i ultimately created compressed table using ORC format to
> save disk space bringing down the table size to < 1GB) with more than
> 450Million records.
>
> In order to demonstrate a complex use case i joined this table with
> itself. Following are the queries i have used to create table and  join
> query i am using.
>
> *Create Table and Loading Data, Hive parameters settigs:*
> set hive.vectorized.execution.enabled = true;
> set hive.vectorized.execution.reduce.enabled = true;
> set mapred.max.split.size=100000000;
> set mapred.min.split.size=1000000;
> set hive.auto.convert.join=false;
> set hive.enforce.sorting=true;
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set mapreduce.reduce.input.limit=-1;
> set hive.exec.parallel = true;
>
> CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
> double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
> double,col10 double,col11 double,col12 double,col13 double,col14
> double,col15 double,col16 double,col17 double,col18 double,col19
> double,col20 double,col21 double,col22 double,col23 double,col24
> double,col25 double,col26 double,col27 double,col28 double)
> clustered by (col1) sorted by (col1) into 240 buckets
> STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
>
> from huge_numeric_table insert overwrite table huge_numeric_table_orc2
> select * sort by col1;
>
>
> *JOIN QUERY:*
>
> select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5
> from huge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2
> on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12
>
>
> *The problem is that this query gets stuck at reducers :80-85%. and goes
> in a loop and never finishes. *
>
> Version of Hive is 1.2.
>
> Please help.
>
>
> Thanks and Regards
> Nishant Aggarwal, PMP
> Cell No:- +91 99588 94305
>
>

-- 
This e-mail, as well as any attached document, may contain material which 
is confidential and privileged and may include trademark, copyright and 
other intellectual property rights that are proprietary to Kenshoo Ltd, 
 its subsidiaries or affiliates ("Kenshoo"). This e-mail and its 
attachments may be read, copied and used only by the addressee for the 
purpose(s) for which it was disclosed herein. If you have received it in 
error, please destroy the message and any attachment, and contact us 
immediately. If you are not the intended recipient, be aware that any 
review, reliance, disclosure, copying, distribution or use of the contents 
of this message without Kenshoo's express permission is strictly prohibited.

Re: HIVE:1.2, Query taking huge time

Posted by Noam Hasson <no...@kenshoo.com>.

Hi,

Have you look at counters in Hadoop side? It's possible you are dealing
with a bad join which causes multiplication of items, if you see huge
number of record input/output in map/reduce phase and keeps increasing
that's probably the case.

Another thing I would try is to divide the job into several different
smaller queries, for example start with filter only, after than join and so
on.

Noam.

On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal <ni...@gmail.com>
wrote:

> Dear Hive Users,
>
> I am in process of running over a poc to one of my customer demonstrating
> the huge performance benefits of Hadoop BigData using Hive.
>
> Following is the problem statement i am stuck with.
>
> I have generate a large table with 28 columns( all are double). Table size
> on disk is 70GB (i ultimately created compressed table using ORC format to
> save disk space bringing down the table size to < 1GB) with more than
> 450Million records.
>
> In order to demonstrate a complex use case i joined this table with
> itself. Following are the queries i have used to create table and  join
> query i am using.
>
> *Create Table and Loading Data, Hive parameters settigs:*
> set hive.vectorized.execution.enabled = true;
> set hive.vectorized.execution.reduce.enabled = true;
> set mapred.max.split.size=100000000;
> set mapred.min.split.size=1000000;
> set hive.auto.convert.join=false;
> set hive.enforce.sorting=true;
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set mapreduce.reduce.input.limit=-1;
> set hive.exec.parallel = true;
>
> CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
> double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
> double,col10 double,col11 double,col12 double,col13 double,col14
> double,col15 double,col16 double,col17 double,col18 double,col19
> double,col20 double,col21 double,col22 double,col23 double,col24
> double,col25 double,col26 double,col27 double,col28 double)
> clustered by (col1) sorted by (col1) into 240 buckets
> STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
>
> from huge_numeric_table insert overwrite table huge_numeric_table_orc2
> select * sort by col1;
>
>
> *JOIN QUERY:*
>
> select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5
> from huge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2
> on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12
>
>
> *The problem is that this query gets stuck at reducers :80-85%. and goes
> in a loop and never finishes. *
>
> Version of Hive is 1.2.
>
> Please help.
>
>
> Thanks and Regards
> Nishant Aggarwal, PMP
> Cell No:- +91 99588 94305
>
>

-- 
This e-mail, as well as any attached document, may contain material which 
is confidential and privileged and may include trademark, copyright and 
other intellectual property rights that are proprietary to Kenshoo Ltd, 
 its subsidiaries or affiliates ("Kenshoo"). This e-mail and its 
attachments may be read, copied and used only by the addressee for the 
purpose(s) for which it was disclosed herein. If you have received it in 
error, please destroy the message and any attachment, and contact us 
immediately. If you are not the intended recipient, be aware that any 
review, reliance, disclosure, copying, distribution or use of the contents 
of this message without Kenshoo's express permission is strictly prohibited.