You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bigdata techguy <bi...@gmail.com> on 2015/08/03 18:12:04 UTC

Re: HiveQL to SparkSQL

Did anybody try to convert HiveQL queries to SparkSQL? If so, would you
share the experience, pros & cons please? Thank you.

On Thu, Jul 30, 2015 at 10:37 AM, Bigdata techguy <bi...@gmail.com>
wrote:

> Thanks Jorn for the response and for the pointer questions to Hive
> optimization tips.
>
> I believe I have done the possible & applicable things to improve hive
> query performance including but not limited to - running on TEZ, using
> partitioning, bucketing, using explain to make sure partition pruning is
> happening, using compression, using the best data types for join columns,
> denormalizing etc:. I am using Hive version - 0.13.
>
> The idea behind this POC is to find the strengths of SparkSQL over HiveQL
> and identify the use cases where SparkSQL can perform better than HiveQL
> other than the "iterative use cases". In general, what would be the
> SparkSQL use scenarios?
>
> I am pretty sure someone have tried this before and compared
> performance...Any responses would be much appreciated. Thank you.
>
>
> On Wed, Jul 29, 2015 at 1:57 PM, Jörn Franke <jo...@gmail.com> wrote:
>
>> What Hive Version are you using? Do you run it in on TEZ? Are you using
>> the ORC Format? Do you use compression? Snappy? Do you use Bloom filters?
>> Do you insert the data sorted on the right columns? Do you use
>> partitioning? Did you increase the replication factor for often used tables
>> or partitions? Do you use bucketing? Is your data model appropriate (join
>> columns  as int , use numeric data types where appropriate ,  dates as
>> int...), dif you calculate statistics? Did you use indexes (compressed, ORC
>> Format?) do you provide mapjoin hints? Did you do any other Hive
>> optimization? Did you use explain to verify that only selected partitions,
>> indexes, Bloom filters had been used?
>> Did you verify that no other application has taken resources? What is the
>> CPU level  on namenode, hiveserver2? If it is high then you need Mord
>> memory there!
>>
>> First rule is to get it Hive right before you think about in-memory.
>> Caching will only help for iterative stuff. You may think about
>> denormalizing the model even more to avoid joins  as much  as possible.
>>
>> Bigdata techguy <bi...@gmail.com> schrieb am Mi., 29.07.2015,
>> 18:49:
>>
>>> Hi All,
>>>
>>> I have a fairly complex HiveQL data processing which I am trying to
>>> convert to SparkSQL to improve performance. Below is what it does.
>>>
>>> Select around 100 columns including Aggregates
>>> From a FACT_TABLE
>>> Joined to the summary of the same FACT_TABLE
>>> Joined to 2 smaller DIMENSION tables.
>>>
>>> The data processing currently takes around an hour to complete
>>> processing.
>>>
>>> This is what I have tried so far.
>>>
>>> 1. Use hiveContext to query the DIMENSION tables, store it as DataFrames
>>> and "registerTempTable".
>>>
>>> 2.  Use hiveContext to query the summary of FACT_TABLE, store it as
>>> DataFrames and "registerTempTable".
>>>
>>> 3. Use the "Temp" tables from above 2 steps to get the final RecordSet
>>> to another DataFrame.
>>>
>>> 4. Save the DataFrame from step 3 to Hive with "InsertOverwrite" using
>>> "saveAsTable".
>>>
>>> Below are my questions. Any response would be much appreciated. Thanks.
>>>
>>> A. Is there a better approach?
>>> B. Does breaking down the big Hive query into multiple steps with
>>> multiple DataFrames expected to give better performance?
>>> C. Is there an opportunity to intermix RDD with SparkSQL in this case?
>>> D. Can the "Caching" of a DataFrame improve performance?
>>> E. Are there other suggestions to improve performance?
>>>
>>> Thank You for your time.
>>>
>>>
>