You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sarfraz Ramay <sa...@gmail.com> on 2014/06/29 13:59:21 UTC

Suggestions for different coding techniques in Hive

Hi,

I am doing my MSc thesis on investigation of different coding techniques in
Hive. I am looking for suggestions on 9 techniques, 3 easy, 3 medium and 3
hard. I will have to code these techniques up and compare and evaluate
them. I have extracted a list of techniques below from the book Programming
Hive by Jason et.al. which i think would make for an interesting
comparison. Please share your thoughts and suggestions.

1.       Using single pass for calculating multiple aggregate functions vs
multiple passes.

2.       Using different bucket sizes to determine the effect of
performance. The cut off point where performance starts to suffer due to
either the bucket size being too small or too big. Automatic bucketing vs
manual bucketing.

3.       Position of tables in join, largest table should be kept in the
end of from clause. Use the /*+STREAMABLE*/ hint to stream the largest
table.

4.       Comparison of User Defined Table Generating Function (UDTF) such
as PARSE_URL_TUPLE & PARSE_PATH when reading data such as URL from the
table column.

5.       Comparison of ORDER by, DISTRIBUTE by and SORT by clause.

6.       Map-side joins with and without Bucketing.

7.       Semi-join instead of sub-queries.

8.       Nested select queries vs group by HAVING clause

9.       Floating point comparisons with IMPLICIT and EXPLICIT casting

10.   Common Join keys between tables will get better results as more
common columns means more optimization opportunities.

11.   Outer and inner JOIN with where clauses (ref pg 103 of programming
hive) in comparison to the nested select statement with where predicate and
then joining them

12.   Sort –Merge-Bucket vs map-join SMB
 13. Cartesian products in queries such as SELECT * FROM stocks JOIN
dividends WHERE stock.symbol = dividends.symbol and stock.symbol='AAPL';
whereas in other SQL engines this will automatically be translated to
appropriate join but not in Hive because the join column is not given in
the ON clause (Programming in Hive book)

Regards,
Sarfraz Rasheed Ramay (DIT)
Dublin, Ireland.