You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Raihan Jamal <ja...@gmail.com> on 2012/07/10 00:37:23 UTC

Custom Mapper and Reducer vs HiveQL in terms of Performance

*Problem Statement:-*

I need to compare two tables Table1 and Table2 and they both store same
thing. So I need to compare Table2 with Table1 as Table1 is the main table
through which comparisons need to be made. So after comparing I need to
make a report that Table2 has some sort of discrepancy. And these two
tables has lots of data, around TB of data. So currently I have written
HiveQL to do the comparisons and get the data back.

So my question is which is better in terms of PERFORMANCE, writing a CUSTOM
MAPPER and REDUCERto do this kind of job or the HiveQL that I wrote will be
fine as I will be joining these two tables on millions of records. As far
as I know HiveQL internally (behind the scenes) generates optimized custom
map-reducer and submits for execution and gets back the results.


*Raihan Jamal*

Re: Custom Mapper and Reducer vs HiveQL in terms of Performance

Posted by Esteban Gutierrez <es...@cloudera.com>.
Raihan,

There is no need to implement a custom mapper or reducer. If you are
experiencing issues with performance you might consider to use bucketized
tables and do a bucketed map join/ sorted merge map join. A good example of
performance in joins can be found in this slide from Facebook:
https://cwiki.apache.org/Hive/presentations.data/Hive%20Summit%202011-join.pdfbut
basically you need to choose a good strategy depending on your data.

Regards,
Esteban.





--
Cloudera, Inc.




On Thu, Jul 12, 2012 at 2:18 PM, Raihan Jamal <ja...@gmail.com> wrote:

> Sending it again. As I haven't got any reply on this. Any personal
> experience will be appreciated.
>
>
>
> *Raihan Jamal*
>
>
>
> On Mon, Jul 9, 2012 at 3:37 PM, Raihan Jamal <ja...@gmail.com>wrote:
>
>>  *Problem Statement:-*
>>
>> I need to compare two tables Table1 and Table2 and they both store same
>> thing. So I need to compare Table2 with Table1 as Table1 is the main
>> table through which comparisons need to be made. So after comparing I need
>> to make a report that Table2 has some sort of discrepancy. And these two
>> tables has lots of data, around TB of data. So currently I have written
>> HiveQL to do the comparisons and get the data back.
>>
>> So my question is which is better in terms of PERFORMANCE, writing a CUSTOM
>> MAPPER and REDUCERto do this kind of job or the HiveQL that I wrote will
>> be fine as I will be joining these two tables on millions of records. As
>> far as I know HiveQL internally (behind the scenes) generates optimized
>> custom map-reducer and submits for execution and gets back the results.
>>
>>
>> *Raihan Jamal*
>>
>>
>

Re: Custom Mapper and Reducer vs HiveQL in terms of Performance

Posted by Raihan Jamal <ja...@gmail.com>.
Sending it again. As I haven't got any reply on this. Any personal
experience will be appreciated.



*Raihan Jamal*



On Mon, Jul 9, 2012 at 3:37 PM, Raihan Jamal <ja...@gmail.com> wrote:

>  *Problem Statement:-*
>
> I need to compare two tables Table1 and Table2 and they both store same
> thing. So I need to compare Table2 with Table1 as Table1 is the main
> table through which comparisons need to be made. So after comparing I need
> to make a report that Table2 has some sort of discrepancy. And these two
> tables has lots of data, around TB of data. So currently I have written
> HiveQL to do the comparisons and get the data back.
>
> So my question is which is better in terms of PERFORMANCE, writing a CUSTOM
> MAPPER and REDUCERto do this kind of job or the HiveQL that I wrote will
> be fine as I will be joining these two tables on millions of records. As
> far as I know HiveQL internally (behind the scenes) generates optimized
> custom map-reducer and submits for execution and gets back the results.
>
>
> *Raihan Jamal*
>
>