You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gourav Sengupta <go...@gmail.com> on 2016/06/09 14:14:04 UTC

HIVE Query 25x faster than SPARK Query

Hi,

Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
and is there a way we can optimize the queries in SPARK without the obvious
hack in Query2.


-----------------------
ENVIRONMENT:
-----------------------

> Table A 533 columns x 24 million rows and Table B has 2 columns x 3
million rows. Both the files are single gzipped csv file.
> Both table A and B are external tables in AWS S3 and created in HIVE
accessed through SPARK using HiveContext
> EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
allowMaximumResource allocation and node types are c3.4xlarge).

--------------
QUERY1:
--------------
select A.PK, B.FK
from A
left outer join B on (A.PK = B.FK)
where B.FK is not null;



This query takes 4 mins in HIVE and 1.1 hours in SPARK


--------------
QUERY 2:
--------------

select A.PK, B.FK
from (select PK from A) A
left outer join B on (A.PK = B.FK)
where B.FK is not null;

This query takes 4.5 mins in SPARK



Regards,
Gourav Sengupta

Re: HIVE Query 25x faster than SPARK Query

Posted by Mich Talebzadeh <mi...@gmail.com>.

I still fail to see how Hive can do orders of magnitude faster compared to
Spark.

Assuming that Hive is using map-reduce, I cannot see a real case for Hive
to do faster than at least under normal operations

Don't take me wrong. I am a fan of Hive. The performance of Hive comes from
deploying the execution engine (mr, spark, tez) to do the execution of the
work.

If we leave that aside for now the other influencing factor would be Hive
Optimizer compared to Spark Optimizer.

If I go back to thread owner point and quote:

"Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
and is there a way we can optimize the queries in SPARK without the obvious
hack in Query2.

Table A 533 columns x 24 million rows and Table B has 2 columns x 3 million
rows. Both the files are single gzipped csv file.
> Both table A and B are external tables in AWS S3 and created in HIVE
accessed through SPARK using HiveContext
> EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
allowMaximumResource allocation and node types are c3.4xlarge).

To take it further and make some reasonable deduction:

With gzipped files:

   1. Hive will not be able to split the csv files into chunks/blocks and
   run multiple maps in parallel
   2.  Spark will give you an RDD with only 1 partition (as of 0.9.0). This
   is because gzipped files are not splittable If you do not repartition the
   RDD, any operations on that RDD will be limited to a single core.

So with zipped files both Hive and Spark have issues. both tables have not
a very large number of rows. With Spark were temporary tables deployed that
IMO does help performance. It is possible that Spark has been spilling to
disk.

We really need the output from GUI jobs, stages and spillage like below to
deduce if there was indeed spillage to disk by Spark see (TungstenAggregate)





HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 9 June 2016 at 21:40, Gourav Sengupta <go...@gmail.com> wrote:

> Hi Stephen,
>
>
> How can a single gzipped CSV file be partitioned and who partitions tables
> based on Primary Key in Hive?  If you read the environments section you
> will be able to see that all the required details are mentioned.
>
> As far as I understand that Hive does work 25x faster (in these particular
> cases) and around 100x faster (when we are using TEZ) when compared to
> SPARK.
>
> It will be interesting to see if Ted includes these findings while they
> are benchmarking SPARK. This is a very typical and a general used case.
>
>
> Regards,
> Gourav
>
> On Thu, Jun 9, 2016 at 5:11 PM, Stephen Boesch <ja...@gmail.com> wrote:
>
>> ooc are the tables partitioned on a.pk and b.fk?  Hive might be using
>> copartitioning in that case: it is one of hive's strengths.
>>
>> 2016-06-09 7:28 GMT-07:00 Gourav Sengupta <go...@gmail.com>:
>>
>>> Hi Mich,
>>>
>>> does not Hive use map-reduce? I thought it to be so. And since I am
>>> running the queries in EMR 4.6 therefore HIVE is not using TEZ.
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Thu, Jun 9, 2016 at 3:25 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> are you using map-reduce with Hive?
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 9 June 2016 at 15:14, Gourav Sengupta <go...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>>>> here and is there a way we can optimize the queries in SPARK without the
>>>>> obvious hack in Query2.
>>>>>
>>>>>
>>>>> -----------------------
>>>>> ENVIRONMENT:
>>>>> -----------------------
>>>>>
>>>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>>>> million rows. Both the files are single gzipped csv file.
>>>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>>>> accessed through SPARK using HiveContext
>>>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>>>
>>>>> --------------
>>>>> QUERY1:
>>>>> --------------
>>>>> select A.PK, B.FK
>>>>> from A
>>>>> left outer join B on (A.PK = B.FK)
>>>>> where B.FK is not null;
>>>>>
>>>>>
>>>>>
>>>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>>>>
>>>>>
>>>>> --------------
>>>>> QUERY 2:
>>>>> --------------
>>>>>
>>>>> select A.PK, B.FK
>>>>> from (select PK from A) A
>>>>> left outer join B on (A.PK = B.FK)
>>>>> where B.FK is not null;
>>>>>
>>>>> This query takes 4.5 mins in SPARK
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Gourav Sengupta
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HIVE Query 25x faster than SPARK Query

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Stephen,


How can a single gzipped CSV file be partitioned and who partitions tables
based on Primary Key in Hive?  If you read the environments section you
will be able to see that all the required details are mentioned.

As far as I understand that Hive does work 25x faster (in these particular
cases) and around 100x faster (when we are using TEZ) when compared to
SPARK.

It will be interesting to see if Ted includes these findings while they are
benchmarking SPARK. This is a very typical and a general used case.


Regards,
Gourav

On Thu, Jun 9, 2016 at 5:11 PM, Stephen Boesch <ja...@gmail.com> wrote:

> ooc are the tables partitioned on a.pk and b.fk?  Hive might be using
> copartitioning in that case: it is one of hive's strengths.
>
> 2016-06-09 7:28 GMT-07:00 Gourav Sengupta <go...@gmail.com>:
>
>> Hi Mich,
>>
>> does not Hive use map-reduce? I thought it to be so. And since I am
>> running the queries in EMR 4.6 therefore HIVE is not using TEZ.
>>
>>
>> Regards,
>> Gourav
>>
>> On Thu, Jun 9, 2016 at 3:25 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> are you using map-reduce with Hive?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 9 June 2016 at 15:14, Gourav Sengupta <go...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>>> here and is there a way we can optimize the queries in SPARK without the
>>>> obvious hack in Query2.
>>>>
>>>>
>>>> -----------------------
>>>> ENVIRONMENT:
>>>> -----------------------
>>>>
>>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>>> million rows. Both the files are single gzipped csv file.
>>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>>> accessed through SPARK using HiveContext
>>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>>
>>>> --------------
>>>> QUERY1:
>>>> --------------
>>>> select A.PK, B.FK
>>>> from A
>>>> left outer join B on (A.PK = B.FK)
>>>> where B.FK is not null;
>>>>
>>>>
>>>>
>>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>>>
>>>>
>>>> --------------
>>>> QUERY 2:
>>>> --------------
>>>>
>>>> select A.PK, B.FK
>>>> from (select PK from A) A
>>>> left outer join B on (A.PK = B.FK)
>>>> where B.FK is not null;
>>>>
>>>> This query takes 4.5 mins in SPARK
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: HIVE Query 25x faster than SPARK Query

Posted by Stephen Boesch <ja...@gmail.com>.

ooc are the tables partitioned on a.pk and b.fk?  Hive might be using
copartitioning in that case: it is one of hive's strengths.

2016-06-09 7:28 GMT-07:00 Gourav Sengupta <go...@gmail.com>:

> Hi Mich,
>
> does not Hive use map-reduce? I thought it to be so. And since I am
> running the queries in EMR 4.6 therefore HIVE is not using TEZ.
>
>
> Regards,
> Gourav
>
> On Thu, Jun 9, 2016 at 3:25 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> > wrote:
>
>> are you using map-reduce with Hive?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 9 June 2016 at 15:14, Gourav Sengupta <go...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>> here and is there a way we can optimize the queries in SPARK without the
>>> obvious hack in Query2.
>>>
>>>
>>> -----------------------
>>> ENVIRONMENT:
>>> -----------------------
>>>
>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>> million rows. Both the files are single gzipped csv file.
>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>> accessed through SPARK using HiveContext
>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>
>>> --------------
>>> QUERY1:
>>> --------------
>>> select A.PK, B.FK
>>> from A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>>
>>>
>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>>
>>>
>>> --------------
>>> QUERY 2:
>>> --------------
>>>
>>> select A.PK, B.FK
>>> from (select PK from A) A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>> This query takes 4.5 mins in SPARK
>>>
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>
>>>
>>>
>>
>

Re: HIVE Query 25x faster than SPARK Query

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Mich,

does not Hive use map-reduce? I thought it to be so. And since I am running
the queries in EMR 4.6 therefore HIVE is not using TEZ.


Regards,
Gourav

On Thu, Jun 9, 2016 at 3:25 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> are you using map-reduce with Hive?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 9 June 2016 at 15:14, Gourav Sengupta <go...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
>> and is there a way we can optimize the queries in SPARK without the obvious
>> hack in Query2.
>>
>>
>> -----------------------
>> ENVIRONMENT:
>> -----------------------
>>
>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>> million rows. Both the files are single gzipped csv file.
>> > Both table A and B are external tables in AWS S3 and created in HIVE
>> accessed through SPARK using HiveContext
>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>> allowMaximumResource allocation and node types are c3.4xlarge).
>>
>> --------------
>> QUERY1:
>> --------------
>> select A.PK, B.FK
>> from A
>> left outer join B on (A.PK = B.FK)
>> where B.FK is not null;
>>
>>
>>
>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>
>>
>> --------------
>> QUERY 2:
>> --------------
>>
>> select A.PK, B.FK
>> from (select PK from A) A
>> left outer join B on (A.PK = B.FK)
>> where B.FK is not null;
>>
>> This query takes 4.5 mins in SPARK
>>
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>
>>
>>
>

Re: HIVE Query 25x faster than SPARK Query

Posted by Mich Talebzadeh <mi...@gmail.com>.

are you using map-reduce with Hive?

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 9 June 2016 at 15:14, Gourav Sengupta <go...@gmail.com> wrote:

> Hi,
>
> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
> and is there a way we can optimize the queries in SPARK without the obvious
> hack in Query2.
>
>
> -----------------------
> ENVIRONMENT:
> -----------------------
>
> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
> million rows. Both the files are single gzipped csv file.
> > Both table A and B are external tables in AWS S3 and created in HIVE
> accessed through SPARK using HiveContext
> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
> allowMaximumResource allocation and node types are c3.4xlarge).
>
> --------------
> QUERY1:
> --------------
> select A.PK, B.FK
> from A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
>
>
> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>
>
> --------------
> QUERY 2:
> --------------
>
> select A.PK, B.FK
> from (select PK from A) A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
> This query takes 4.5 mins in SPARK
>
>
>
> Regards,
> Gourav Sengupta
>
>
>
>

Re: HIVE Query 25x faster than SPARK Query

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

Your statement

"I have a system with 64 GB RAM and SSD and its performance on local
cluster SPARK is way better"

Is this a host with 64GB of RAM and you data is stored on local Solid State
Disks?

Can you kindly provide the parameters you pass to spark-submit:

${SPARK_HOME}/bin/spark-submit \

                --master local[?] \

                --driver-memory ?G \

                --num-executors 1 \

                --executor-memory ?G \


Thanks



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 June 2016 at 11:40, Gourav Sengupta <go...@gmail.com> wrote:

> Hi,
>
> We do have a dimension table with around few hundred columns from which we
> need only a few columns to join with the main fact table which has a few
> million rows. I do not know how one off this case sounds like but  since I
> have been working in data warehousing it sounds like a fairly general used
> case.
>
> Spark in local mode will be way faster compared to SPARK running on
> HADOOP. I have a system with 64 GB RAM and SSD and its performance on local
> cluster SPARK is way better.
>
> Did your join include the same number of columns and rows for the
> dimension table?
>
>
> Regards,
> Gourav Sengupta
>
> On Thu, Jun 16, 2016 at 9:35 AM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> sounds like this is a one off case.
>>
>> Do you have any other use case where you have Hive on MR outperforms
>> Spark?
>>
>> I did some tests on 1 billion row table getting the selectivity of a
>> column using Hive on MR, Hive on Spark engine and Spark running on local
>> mode (to keep it simple)
>>
>>
>> Hive 2, Spark 1.6.1
>>
>> Results:
>>
>> Hive with map-reduce --> 18  minutes
>> Hive on Spark engine -->  6 minutes
>> Spark                -->  2 minutes
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 16 June 2016 at 08:43, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> I agree here.
>>>
>>> However it depends always on your use case !
>>>
>>> Best regards
>>>
>>> On 16 Jun 2016, at 04:58, Gourav Sengupta <go...@gmail.com>
>>> wrote:
>>>
>>> Hi Mahender,
>>>
>>> please ensure that for dimension tables you are enabling the broadcast
>>> method. You must be able to see surprising gains @12x.
>>>
>>> Overall I think that SPARK cannot figure out whether to scan all the
>>> columns in a table or just the ones which are being used causing this
>>> issue.
>>>
>>> When you start using HIVE with ORC and TEZ  (*) you will see some
>>> amazing results, and leaves SPARK way way behind. So pretty much you need
>>> to have your data in memory for matching the performance claims of SPARK
>>> and the advantage in that case you are getting is not because of SPARK
>>> algorithms but just fast I/O from RAM. The advantage of SPARK is that it
>>> makes accessible analytics, querying, and streaming frameworks together.
>>>
>>>
>>> In case you are following the optimisations mentioned in the link you
>>> hardly have any reasons for using SPARK SQL:
>>> http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And
>>> imagine being able to do all of that without having machines which requires
>>> huge RAM, or in short you are achieving those performance gains using
>>> commodity low cost systems around which HADOOP was designed.
>>>
>>> I think that Hortonworks is giving a stiff competition here :)
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam <
>>> Mahender.BigData@outlook.com> wrote:
>>>
>>>> +1,
>>>>
>>>> Even see performance degradation while comparing SPark SQL with Hive.
>>>> We have table of 260 columns. We have executed in hive and SPARK. In
>>>> Hive, it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4
>>>> mins of time.
>>>> On 6/9/2016 3:19 PM, Gavin Yue wrote:
>>>>
>>>> Could you print out the sql execution plan? My guess is about broadcast
>>>> join.
>>>>
>>>>
>>>>
>>>> On Jun 9, 2016, at 07:14, Gourav Sengupta < <go...@gmail.com>
>>>> gourav.sengupta@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>>> here and is there a way we can optimize the queries in SPARK without the
>>>> obvious hack in Query2.
>>>>
>>>>
>>>> -----------------------
>>>> ENVIRONMENT:
>>>> -----------------------
>>>>
>>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>>> million rows. Both the files are single gzipped csv file.
>>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>>> accessed through SPARK using HiveContext
>>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>>
>>>> --------------
>>>> QUERY1:
>>>> --------------
>>>> select A.PK, B.FK
>>>> from A
>>>> left outer join B on (A.PK = B.FK)
>>>> where B.FK is not null;
>>>>
>>>>
>>>>
>>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>>>
>>>>
>>>> --------------
>>>> QUERY 2:
>>>> --------------
>>>>
>>>> select A.PK, B.FK
>>>> from (select PK from A) A
>>>> left outer join B on (A.PK = B.FK)
>>>> where B.FK is not null;
>>>>
>>>> This query takes 4.5 mins in SPARK
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: HIVE Query 25x faster than SPARK Query

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

We do have a dimension table with around few hundred columns from which we
need only a few columns to join with the main fact table which has a few
million rows. I do not know how one off this case sounds like but  since I
have been working in data warehousing it sounds like a fairly general used
case.

Spark in local mode will be way faster compared to SPARK running on HADOOP.
I have a system with 64 GB RAM and SSD and its performance on local cluster
SPARK is way better.

Did your join include the same number of columns and rows for the dimension
table?


Regards,
Gourav Sengupta

On Thu, Jun 16, 2016 at 9:35 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> sounds like this is a one off case.
>
> Do you have any other use case where you have Hive on MR outperforms Spark?
>
> I did some tests on 1 billion row table getting the selectivity of a
> column using Hive on MR, Hive on Spark engine and Spark running on local
> mode (to keep it simple)
>
>
> Hive 2, Spark 1.6.1
>
> Results:
>
> Hive with map-reduce --> 18  minutes
> Hive on Spark engine -->  6 minutes
> Spark                -->  2 minutes
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 June 2016 at 08:43, Jörn Franke <jo...@gmail.com> wrote:
>
>> I agree here.
>>
>> However it depends always on your use case !
>>
>> Best regards
>>
>> On 16 Jun 2016, at 04:58, Gourav Sengupta <go...@gmail.com>
>> wrote:
>>
>> Hi Mahender,
>>
>> please ensure that for dimension tables you are enabling the broadcast
>> method. You must be able to see surprising gains @12x.
>>
>> Overall I think that SPARK cannot figure out whether to scan all the
>> columns in a table or just the ones which are being used causing this
>> issue.
>>
>> When you start using HIVE with ORC and TEZ  (*) you will see some amazing
>> results, and leaves SPARK way way behind. So pretty much you need to have
>> your data in memory for matching the performance claims of SPARK and the
>> advantage in that case you are getting is not because of SPARK algorithms
>> but just fast I/O from RAM. The advantage of SPARK is that it makes
>> accessible analytics, querying, and streaming frameworks together.
>>
>>
>> In case you are following the optimisations mentioned in the link you
>> hardly have any reasons for using SPARK SQL:
>> http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And
>> imagine being able to do all of that without having machines which requires
>> huge RAM, or in short you are achieving those performance gains using
>> commodity low cost systems around which HADOOP was designed.
>>
>> I think that Hortonworks is giving a stiff competition here :)
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam <
>> Mahender.BigData@outlook.com> wrote:
>>
>>> +1,
>>>
>>> Even see performance degradation while comparing SPark SQL with Hive.
>>> We have table of 260 columns. We have executed in hive and SPARK. In
>>> Hive, it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4
>>> mins of time.
>>> On 6/9/2016 3:19 PM, Gavin Yue wrote:
>>>
>>> Could you print out the sql execution plan? My guess is about broadcast
>>> join.
>>>
>>>
>>>
>>> On Jun 9, 2016, at 07:14, Gourav Sengupta < <go...@gmail.com>
>>> gourav.sengupta@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>> here and is there a way we can optimize the queries in SPARK without the
>>> obvious hack in Query2.
>>>
>>>
>>> -----------------------
>>> ENVIRONMENT:
>>> -----------------------
>>>
>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>> million rows. Both the files are single gzipped csv file.
>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>> accessed through SPARK using HiveContext
>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>
>>> --------------
>>> QUERY1:
>>> --------------
>>> select A.PK, B.FK
>>> from A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>>
>>>
>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>>
>>>
>>> --------------
>>> QUERY 2:
>>> --------------
>>>
>>> select A.PK, B.FK
>>> from (select PK from A) A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>> This query takes 4.5 mins in SPARK
>>>
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>
>>>
>>>
>>>
>>
>

Re: HIVE Query 25x faster than SPARK Query

Posted by Mich Talebzadeh <mi...@gmail.com>.

sounds like this is a one off case.

Do you have any other use case where you have Hive on MR outperforms Spark?

I did some tests on 1 billion row table getting the selectivity of a column
using Hive on MR, Hive on Spark engine and Spark running on local mode (to
keep it simple)


Hive 2, Spark 1.6.1

Results:

Hive with map-reduce --> 18  minutes
Hive on Spark engine -->  6 minutes
Spark                -->  2 minutes


HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 June 2016 at 08:43, Jörn Franke <jo...@gmail.com> wrote:

> I agree here.
>
> However it depends always on your use case !
>
> Best regards
>
> On 16 Jun 2016, at 04:58, Gourav Sengupta <go...@gmail.com>
> wrote:
>
> Hi Mahender,
>
> please ensure that for dimension tables you are enabling the broadcast
> method. You must be able to see surprising gains @12x.
>
> Overall I think that SPARK cannot figure out whether to scan all the
> columns in a table or just the ones which are being used causing this
> issue.
>
> When you start using HIVE with ORC and TEZ  (*) you will see some amazing
> results, and leaves SPARK way way behind. So pretty much you need to have
> your data in memory for matching the performance claims of SPARK and the
> advantage in that case you are getting is not because of SPARK algorithms
> but just fast I/O from RAM. The advantage of SPARK is that it makes
> accessible analytics, querying, and streaming frameworks together.
>
>
> In case you are following the optimisations mentioned in the link you
> hardly have any reasons for using SPARK SQL:
> http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And
> imagine being able to do all of that without having machines which requires
> huge RAM, or in short you are achieving those performance gains using
> commodity low cost systems around which HADOOP was designed.
>
> I think that Hortonworks is giving a stiff competition here :)
>
> Regards,
> Gourav Sengupta
>
> On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam <
> Mahender.BigData@outlook.com> wrote:
>
>> +1,
>>
>> Even see performance degradation while comparing SPark SQL with Hive.
>> We have table of 260 columns. We have executed in hive and SPARK. In
>> Hive, it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4
>> mins of time.
>> On 6/9/2016 3:19 PM, Gavin Yue wrote:
>>
>> Could you print out the sql execution plan? My guess is about broadcast
>> join.
>>
>>
>>
>> On Jun 9, 2016, at 07:14, Gourav Sengupta < <go...@gmail.com>
>> gourav.sengupta@gmail.com> wrote:
>>
>> Hi,
>>
>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
>> and is there a way we can optimize the queries in SPARK without the obvious
>> hack in Query2.
>>
>>
>> -----------------------
>> ENVIRONMENT:
>> -----------------------
>>
>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>> million rows. Both the files are single gzipped csv file.
>> > Both table A and B are external tables in AWS S3 and created in HIVE
>> accessed through SPARK using HiveContext
>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>> allowMaximumResource allocation and node types are c3.4xlarge).
>>
>> --------------
>> QUERY1:
>> --------------
>> select A.PK, B.FK
>> from A
>> left outer join B on (A.PK = B.FK)
>> where B.FK is not null;
>>
>>
>>
>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>
>>
>> --------------
>> QUERY 2:
>> --------------
>>
>> select A.PK, B.FK
>> from (select PK from A) A
>> left outer join B on (A.PK = B.FK)
>> where B.FK is not null;
>>
>> This query takes 4.5 mins in SPARK
>>
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>
>>
>>
>>
>

Re: HIVE Query 25x faster than SPARK Query

Posted by Jörn Franke <jo...@gmail.com>.

I agree here.

However it depends always on your use case ! 

Best regards

> On 16 Jun 2016, at 04:58, Gourav Sengupta <go...@gmail.com> wrote:
> 
> Hi Mahender, 
> 
> please ensure that for dimension tables you are enabling the broadcast method. You must be able to see surprising gains @12x. 
> 
> Overall I think that SPARK cannot figure out whether to scan all the columns in a table or just the ones which are being used causing this issue. 
> 
> When you start using HIVE with ORC and TEZ  (*) you will see some amazing results, and leaves SPARK way way behind. So pretty much you need to have your data in memory for matching the performance claims of SPARK and the advantage in that case you are getting is not because of SPARK algorithms but just fast I/O from RAM. The advantage of SPARK is that it makes accessible analytics, querying, and streaming frameworks together.
> 
> 
> In case you are following the optimisations mentioned in the link you hardly have any reasons for using SPARK SQL: http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And imagine being able to do all of that without having machines which requires huge RAM, or in short you are achieving those performance gains using commodity low cost systems around which HADOOP was designed. 
> 
> I think that Hortonworks is giving a stiff competition here :)
> 
> Regards,
> Gourav Sengupta
> 
>> On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam <Ma...@outlook.com> wrote:
>> +1,
>> 
>> Even see performance degradation while comparing SPark SQL with Hive. 
>> We have table of 260 columns. We have executed in hive and SPARK. In Hive, it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 mins of time. 
>>> On 6/9/2016 3:19 PM, Gavin Yue wrote:
>>> Could you print out the sql execution plan? My guess is about broadcast join. 
>>> 
>>> 
>>> 
>>> On Jun 9, 2016, at 07:14, Gourav Sengupta <go...@gmail.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and is there a way we can optimize the queries in SPARK without the obvious hack in Query2.
>>>> 
>>>> 
>>>> -----------------------
>>>> ENVIRONMENT:
>>>> -----------------------
>>>> 
>>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 million rows. Both the files are single gzipped csv file.
>>>> > Both table A and B are external tables in AWS S3 and created in HIVE accessed through SPARK using HiveContext
>>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using allowMaximumResource allocation and node types are c3.4xlarge).
>>>> 
>>>> --------------
>>>> QUERY1: 
>>>> --------------
>>>> select A.PK, B.FK
>>>> from A 
>>>> left outer join B on (A.PK = B.FK)
>>>> where B.FK is not null;
>>>> 
>>>> 
>>>> 
>>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK 
>>>> 
>>>> 
>>>> --------------
>>>> QUERY 2:
>>>> --------------
>>>> 
>>>> select A.PK, B.FK
>>>> from (select PK from A) A 
>>>> left outer join B on (A.PK = B.FK)
>>>> where B.FK is not null;
>>>> 
>>>> This query takes 4.5 mins in SPARK 
>>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> Gourav Sengupta
>

Re: HIVE Query 25x faster than SPARK Query

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Mahender,

please ensure that for dimension tables you are enabling the broadcast
method. You must be able to see surprising gains @12x.

Overall I think that SPARK cannot figure out whether to scan all the
columns in a table or just the ones which are being used causing this
issue.

When you start using HIVE with ORC and TEZ  (*) you will see some amazing
results, and leaves SPARK way way behind. So pretty much you need to have
your data in memory for matching the performance claims of SPARK and the
advantage in that case you are getting is not because of SPARK algorithms
but just fast I/O from RAM. The advantage of SPARK is that it makes
accessible analytics, querying, and streaming frameworks together.

In case you are following the optimisations mentioned in the link you
hardly have any reasons for using SPARK SQL:
http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And
imagine being able to do all of that without having machines which requires
huge RAM, or in short you are achieving those performance gains using
commodity low cost systems around which HADOOP was designed.

I think that Hortonworks is giving a stiff competition here :)

Regards,
Gourav Sengupta

On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam <
Mahender.BigData@outlook.com> wrote:

> +1,
>
> Even see performance degradation while comparing SPark SQL with Hive.
> We have table of 260 columns. We have executed in hive and SPARK. In Hive,
> it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 mins
> of time.
> On 6/9/2016 3:19 PM, Gavin Yue wrote:
>
> Could you print out the sql execution plan? My guess is about broadcast
> join.
>
>
>
> On Jun 9, 2016, at 07:14, Gourav Sengupta < <go...@gmail.com>
> gourav.sengupta@gmail.com> wrote:
>
> Hi,
>
> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
> and is there a way we can optimize the queries in SPARK without the obvious
> hack in Query2.
>
>
> -----------------------
> ENVIRONMENT:
> -----------------------
>
> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
> million rows. Both the files are single gzipped csv file.
> > Both table A and B are external tables in AWS S3 and created in HIVE
> accessed through SPARK using HiveContext
> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
> allowMaximumResource allocation and node types are c3.4xlarge).
>
> --------------
> QUERY1:
> --------------
> select A.PK, B.FK
> from A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
>
>
> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>
>
> --------------
> QUERY 2:
> --------------
>
> select A.PK, B.FK
> from (select PK from A) A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
> This query takes 4.5 mins in SPARK
>
>
>
> Regards,
> Gourav Sengupta
>
>
>
>
>

Re: HIVE Query 25x faster than SPARK Query

Posted by Mahender Sarangam <Ma...@outlook.com>.

+1,

Even see performance degradation while comparing SPark SQL with Hive.
We have table of 260 columns. We have executed in hive and SPARK. In Hive, it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 mins of time.

On 6/9/2016 3:19 PM, Gavin Yue wrote:
Could you print out the sql execution plan? My guess is about broadcast join.

On Jun 9, 2016, at 07:14, Gourav Sengupta <<m...@gmail.com>> wrote:

Hi,

Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and is there a way we can optimize the queries in SPARK without the obvious hack in Query2.

-----------------------
ENVIRONMENT:
-----------------------

> Table A 533 columns x 24 million rows and Table B has 2 columns x 3 million rows. Both the files are single gzipped csv file.
> Both table A and B are external tables in AWS S3 and created in HIVE accessed through SPARK using HiveContext
> EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using allowMaximumResource allocation and node types are c3.4xlarge).

--------------
QUERY1:
--------------
select A.PK<http://A.PK>, B.FK<http://B.FK>
from A
left outer join B on (A.PK<http://A.PK> = B.FK<http://B.FK>)
where B.FK<http://B.FK> is not null;

This query takes 4 mins in HIVE and 1.1 hours in SPARK

--------------
QUERY 2:
--------------

select A.PK<http://A.PK>, B.FK<http://B.FK>
from (select PK from A) A
left outer join B on (A.PK<http://A.PK> = B.FK<http://B.FK>)
where B.FK<http://B.FK> is not null;

This query takes 4.5 mins in SPARK

Regards,
Gourav Sengupta

Re: HIVE Query 25x faster than SPARK Query

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Gavin,

for the first time someone is responding to this thread with a meaningful
conversation - thanks for that.

Okay, I did not tweak the spark.sql.autoBroadcastJoinThreshold parameter
and since the cached field was around 75 MB therefore I do not think that
broadcast join was used.

But I will surely be excited to see if I am going wrong here and post the
results of sql.describe(). Thanks a ton once again.


Hi Ted,

Is there anyway you can throw some light on this before I post this in a
blog?


Regards,
Gourav Sengupta


On Fri, Jun 10, 2016 at 7:22 PM, Gavin Yue <yu...@gmail.com> wrote:

> Yes.  because in the second query, you did a  (select PK from A) A .  I
>  guess it could the the subquery makes the results much smaller and make
> the broadcastJoin, so it is much faster.
>
> you could use sql.describe() to check the execution plan.
>
>
> On Fri, Jun 10, 2016 at 1:41 AM, Gourav Sengupta <
> gourav.sengupta@gmail.com> wrote:
>
>> Hi,
>>
>> I think if we try to see why is Query 2 faster than Query 1 then all the
>> answers will be given without beating around the bush. That is the right
>> way to find out what is happening and why.
>>
>>
>> Regards,
>> Gourav
>>
>> On Thu, Jun 9, 2016 at 11:19 PM, Gavin Yue <yu...@gmail.com>
>> wrote:
>>
>>> Could you print out the sql execution plan? My guess is about broadcast
>>> join.
>>>
>>>
>>>
>>> On Jun 9, 2016, at 07:14, Gourav Sengupta <go...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>> here and is there a way we can optimize the queries in SPARK without the
>>> obvious hack in Query2.
>>>
>>>
>>> -----------------------
>>> ENVIRONMENT:
>>> -----------------------
>>>
>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>> million rows. Both the files are single gzipped csv file.
>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>> accessed through SPARK using HiveContext
>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>
>>> --------------
>>> QUERY1:
>>> --------------
>>> select A.PK, B.FK
>>> from A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>>
>>>
>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>>
>>>
>>> --------------
>>> QUERY 2:
>>> --------------
>>>
>>> select A.PK, B.FK
>>> from (select PK from A) A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>> This query takes 4.5 mins in SPARK
>>>
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>
>>>
>>>
>>
>

Re: HIVE Query 25x faster than SPARK Query

Posted by Gavin Yue <yu...@gmail.com>.

Yes.  because in the second query, you did a  (select PK from A) A .  I
 guess it could the the subquery makes the results much smaller and make
the broadcastJoin, so it is much faster.

you could use sql.describe() to check the execution plan.


On Fri, Jun 10, 2016 at 1:41 AM, Gourav Sengupta <go...@gmail.com>
wrote:

> Hi,
>
> I think if we try to see why is Query 2 faster than Query 1 then all the
> answers will be given without beating around the bush. That is the right
> way to find out what is happening and why.
>
>
> Regards,
> Gourav
>
> On Thu, Jun 9, 2016 at 11:19 PM, Gavin Yue <yu...@gmail.com> wrote:
>
>> Could you print out the sql execution plan? My guess is about broadcast
>> join.
>>
>>
>>
>> On Jun 9, 2016, at 07:14, Gourav Sengupta <go...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
>> and is there a way we can optimize the queries in SPARK without the obvious
>> hack in Query2.
>>
>>
>> -----------------------
>> ENVIRONMENT:
>> -----------------------
>>
>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>> million rows. Both the files are single gzipped csv file.
>> > Both table A and B are external tables in AWS S3 and created in HIVE
>> accessed through SPARK using HiveContext
>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>> allowMaximumResource allocation and node types are c3.4xlarge).
>>
>> --------------
>> QUERY1:
>> --------------
>> select A.PK, B.FK
>> from A
>> left outer join B on (A.PK = B.FK)
>> where B.FK is not null;
>>
>>
>>
>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>
>>
>> --------------
>> QUERY 2:
>> --------------
>>
>> select A.PK, B.FK
>> from (select PK from A) A
>> left outer join B on (A.PK = B.FK)
>> where B.FK is not null;
>>
>> This query takes 4.5 mins in SPARK
>>
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>
>>
>>
>

Re: HIVE Query 25x faster than SPARK Query

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

I think if we try to see why is Query 2 faster than Query 1 then all the
answers will be given without beating around the bush. That is the right
way to find out what is happening and why.


Regards,
Gourav

On Thu, Jun 9, 2016 at 11:19 PM, Gavin Yue <yu...@gmail.com> wrote:

> Could you print out the sql execution plan? My guess is about broadcast
> join.
>
>
>
> On Jun 9, 2016, at 07:14, Gourav Sengupta <go...@gmail.com>
> wrote:
>
> Hi,
>
> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
> and is there a way we can optimize the queries in SPARK without the obvious
> hack in Query2.
>
>
> -----------------------
> ENVIRONMENT:
> -----------------------
>
> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
> million rows. Both the files are single gzipped csv file.
> > Both table A and B are external tables in AWS S3 and created in HIVE
> accessed through SPARK using HiveContext
> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
> allowMaximumResource allocation and node types are c3.4xlarge).
>
> --------------
> QUERY1:
> --------------
> select A.PK, B.FK
> from A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
>
>
> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>
>
> --------------
> QUERY 2:
> --------------
>
> select A.PK, B.FK
> from (select PK from A) A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
> This query takes 4.5 mins in SPARK
>
>
>
> Regards,
> Gourav Sengupta
>
>
>
>

Re: HIVE Query 25x faster than SPARK Query

Posted by Gavin Yue <yu...@gmail.com>.

Could you print out the sql execution plan? My guess is about broadcast join. 



> On Jun 9, 2016, at 07:14, Gourav Sengupta <go...@gmail.com> wrote:
> 
> Hi,
> 
> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and is there a way we can optimize the queries in SPARK without the obvious hack in Query2.
> 
> 
> -----------------------
> ENVIRONMENT:
> -----------------------
> 
> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 million rows. Both the files are single gzipped csv file.
> > Both table A and B are external tables in AWS S3 and created in HIVE accessed through SPARK using HiveContext
> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using allowMaximumResource allocation and node types are c3.4xlarge).
> 
> --------------
> QUERY1: 
> --------------
> select A.PK, B.FK
> from A 
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
> 
> 
> 
> This query takes 4 mins in HIVE and 1.1 hours in SPARK 
> 
> 
> --------------
> QUERY 2:
> --------------
> 
> select A.PK, B.FK
> from (select PK from A) A 
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
> 
> This query takes 4.5 mins in SPARK 
> 
> 
> 
> Regards,
> Gourav Sengupta
> 
> 
>