You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bigdata techguy <bi...@gmail.com> on 2015/07/29 18:49:15 UTC

HiveQL to SparkSQL

Hi All,

I have a fairly complex HiveQL data processing which I am trying to convert
to SparkSQL to improve performance. Below is what it does.

Select around 100 columns including Aggregates
>From a FACT_TABLE
Joined to the summary of the same FACT_TABLE
Joined to 2 smaller DIMENSION tables.

The data processing currently takes around an hour to complete processing.

This is what I have tried so far.

1. Use hiveContext to query the DIMENSION tables, store it as DataFrames
and "registerTempTable".

2.  Use hiveContext to query the summary of FACT_TABLE, store it as
DataFrames and "registerTempTable".

3. Use the "Temp" tables from above 2 steps to get the final RecordSet to
another DataFrame.

4. Save the DataFrame from step 3 to Hive with "InsertOverwrite" using
"saveAsTable".

Below are my questions. Any response would be much appreciated. Thanks.

A. Is there a better approach?
B. Does breaking down the big Hive query into multiple steps with multiple
DataFrames expected to give better performance?
C. Is there an opportunity to intermix RDD with SparkSQL in this case?
D. Can the "Caching" of a DataFrame improve performance?
E. Are there other suggestions to improve performance?

Thank You for your time.

Re: HiveQL to SparkSQL

Posted by Bigdata techguy <bi...@gmail.com>.
Did anybody try to convert HiveQL queries to SparkSQL? If so, would you
share the experience, pros & cons please? Thank you.

On Thu, Jul 30, 2015 at 10:37 AM, Bigdata techguy <bi...@gmail.com>
wrote:

> Thanks Jorn for the response and for the pointer questions to Hive
> optimization tips.
>
> I believe I have done the possible & applicable things to improve hive
> query performance including but not limited to - running on TEZ, using
> partitioning, bucketing, using explain to make sure partition pruning is
> happening, using compression, using the best data types for join columns,
> denormalizing etc:. I am using Hive version - 0.13.
>
> The idea behind this POC is to find the strengths of SparkSQL over HiveQL
> and identify the use cases where SparkSQL can perform better than HiveQL
> other than the "iterative use cases". In general, what would be the
> SparkSQL use scenarios?
>
> I am pretty sure someone have tried this before and compared
> performance...Any responses would be much appreciated. Thank you.
>
>
> On Wed, Jul 29, 2015 at 1:57 PM, Jörn Franke <jo...@gmail.com> wrote:
>
>> What Hive Version are you using? Do you run it in on TEZ? Are you using
>> the ORC Format? Do you use compression? Snappy? Do you use Bloom filters?
>> Do you insert the data sorted on the right columns? Do you use
>> partitioning? Did you increase the replication factor for often used tables
>> or partitions? Do you use bucketing? Is your data model appropriate (join
>> columns  as int , use numeric data types where appropriate ,  dates as
>> int...), dif you calculate statistics? Did you use indexes (compressed, ORC
>> Format?) do you provide mapjoin hints? Did you do any other Hive
>> optimization? Did you use explain to verify that only selected partitions,
>> indexes, Bloom filters had been used?
>> Did you verify that no other application has taken resources? What is the
>> CPU level  on namenode, hiveserver2? If it is high then you need Mord
>> memory there!
>>
>> First rule is to get it Hive right before you think about in-memory.
>> Caching will only help for iterative stuff. You may think about
>> denormalizing the model even more to avoid joins  as much  as possible.
>>
>> Bigdata techguy <bi...@gmail.com> schrieb am Mi., 29.07.2015,
>> 18:49:
>>
>>> Hi All,
>>>
>>> I have a fairly complex HiveQL data processing which I am trying to
>>> convert to SparkSQL to improve performance. Below is what it does.
>>>
>>> Select around 100 columns including Aggregates
>>> From a FACT_TABLE
>>> Joined to the summary of the same FACT_TABLE
>>> Joined to 2 smaller DIMENSION tables.
>>>
>>> The data processing currently takes around an hour to complete
>>> processing.
>>>
>>> This is what I have tried so far.
>>>
>>> 1. Use hiveContext to query the DIMENSION tables, store it as DataFrames
>>> and "registerTempTable".
>>>
>>> 2.  Use hiveContext to query the summary of FACT_TABLE, store it as
>>> DataFrames and "registerTempTable".
>>>
>>> 3. Use the "Temp" tables from above 2 steps to get the final RecordSet
>>> to another DataFrame.
>>>
>>> 4. Save the DataFrame from step 3 to Hive with "InsertOverwrite" using
>>> "saveAsTable".
>>>
>>> Below are my questions. Any response would be much appreciated. Thanks.
>>>
>>> A. Is there a better approach?
>>> B. Does breaking down the big Hive query into multiple steps with
>>> multiple DataFrames expected to give better performance?
>>> C. Is there an opportunity to intermix RDD with SparkSQL in this case?
>>> D. Can the "Caching" of a DataFrame improve performance?
>>> E. Are there other suggestions to improve performance?
>>>
>>> Thank You for your time.
>>>
>>>
>

Re: HiveQL to SparkSQL

Posted by Bigdata techguy <bi...@gmail.com>.
Thanks Jorn for the response and for the pointer questions to Hive
optimization tips.

I believe I have done the possible & applicable things to improve hive
query performance including but not limited to - running on TEZ, using
partitioning, bucketing, using explain to make sure partition pruning is
happening, using compression, using the best data types for join columns,
denormalizing etc:. I am using Hive version - 0.13.

The idea behind this POC is to find the strengths of SparkSQL over HiveQL
and identify the use cases where SparkSQL can perform better than HiveQL
other than the "iterative use cases". In general, what would be the
SparkSQL use scenarios?

I am pretty sure someone have tried this before and compared
performance...Any responses would be much appreciated. Thank you.


On Wed, Jul 29, 2015 at 1:57 PM, Jörn Franke <jo...@gmail.com> wrote:

> What Hive Version are you using? Do you run it in on TEZ? Are you using
> the ORC Format? Do you use compression? Snappy? Do you use Bloom filters?
> Do you insert the data sorted on the right columns? Do you use
> partitioning? Did you increase the replication factor for often used tables
> or partitions? Do you use bucketing? Is your data model appropriate (join
> columns  as int , use numeric data types where appropriate ,  dates as
> int...), dif you calculate statistics? Did you use indexes (compressed, ORC
> Format?) do you provide mapjoin hints? Did you do any other Hive
> optimization? Did you use explain to verify that only selected partitions,
> indexes, Bloom filters had been used?
> Did you verify that no other application has taken resources? What is the
> CPU level  on namenode, hiveserver2? If it is high then you need Mord
> memory there!
>
> First rule is to get it Hive right before you think about in-memory.
> Caching will only help for iterative stuff. You may think about
> denormalizing the model even more to avoid joins  as much  as possible.
>
> Bigdata techguy <bi...@gmail.com> schrieb am Mi., 29.07.2015,
> 18:49:
>
>> Hi All,
>>
>> I have a fairly complex HiveQL data processing which I am trying to
>> convert to SparkSQL to improve performance. Below is what it does.
>>
>> Select around 100 columns including Aggregates
>> From a FACT_TABLE
>> Joined to the summary of the same FACT_TABLE
>> Joined to 2 smaller DIMENSION tables.
>>
>> The data processing currently takes around an hour to complete
>> processing.
>>
>> This is what I have tried so far.
>>
>> 1. Use hiveContext to query the DIMENSION tables, store it as DataFrames
>> and "registerTempTable".
>>
>> 2.  Use hiveContext to query the summary of FACT_TABLE, store it as
>> DataFrames and "registerTempTable".
>>
>> 3. Use the "Temp" tables from above 2 steps to get the final RecordSet to
>> another DataFrame.
>>
>> 4. Save the DataFrame from step 3 to Hive with "InsertOverwrite" using
>> "saveAsTable".
>>
>> Below are my questions. Any response would be much appreciated. Thanks.
>>
>> A. Is there a better approach?
>> B. Does breaking down the big Hive query into multiple steps with
>> multiple DataFrames expected to give better performance?
>> C. Is there an opportunity to intermix RDD with SparkSQL in this case?
>> D. Can the "Caching" of a DataFrame improve performance?
>> E. Are there other suggestions to improve performance?
>>
>> Thank You for your time.
>>
>>

Re: HiveQL to SparkSQL

Posted by Jörn Franke <jo...@gmail.com>.
What Hive Version are you using? Do you run it in on TEZ? Are you using the
ORC Format? Do you use compression? Snappy? Do you use Bloom filters? Do
you insert the data sorted on the right columns? Do you use partitioning?
Did you increase the replication factor for often used tables or
partitions? Do you use bucketing? Is your data model appropriate (join
columns  as int , use numeric data types where appropriate ,  dates as
int...), dif you calculate statistics? Did you use indexes (compressed, ORC
Format?) do you provide mapjoin hints? Did you do any other Hive
optimization? Did you use explain to verify that only selected partitions,
indexes, Bloom filters had been used?
Did you verify that no other application has taken resources? What is the
CPU level  on namenode, hiveserver2? If it is high then you need Mord
memory there!

First rule is to get it Hive right before you think about in-memory.
Caching will only help for iterative stuff. You may think about
denormalizing the model even more to avoid joins  as much  as possible.

Bigdata techguy <bi...@gmail.com> schrieb am Mi., 29.07.2015,
18:49:

> Hi All,
>
> I have a fairly complex HiveQL data processing which I am trying to
> convert to SparkSQL to improve performance. Below is what it does.
>
> Select around 100 columns including Aggregates
> From a FACT_TABLE
> Joined to the summary of the same FACT_TABLE
> Joined to 2 smaller DIMENSION tables.
>
> The data processing currently takes around an hour to complete processing.
>
>
> This is what I have tried so far.
>
> 1. Use hiveContext to query the DIMENSION tables, store it as DataFrames
> and "registerTempTable".
>
> 2.  Use hiveContext to query the summary of FACT_TABLE, store it as
> DataFrames and "registerTempTable".
>
> 3. Use the "Temp" tables from above 2 steps to get the final RecordSet to
> another DataFrame.
>
> 4. Save the DataFrame from step 3 to Hive with "InsertOverwrite" using
> "saveAsTable".
>
> Below are my questions. Any response would be much appreciated. Thanks.
>
> A. Is there a better approach?
> B. Does breaking down the big Hive query into multiple steps with multiple
> DataFrames expected to give better performance?
> C. Is there an opportunity to intermix RDD with SparkSQL in this case?
> D. Can the "Caching" of a DataFrame improve performance?
> E. Are there other suggestions to improve performance?
>
> Thank You for your time.
>
>