You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Felix.徐" <yg...@gmail.com> on 2013/06/28 17:40:40 UTC

Performance difference between tuning reducer num and partition table

Hi all,

Here is the scenario, suppose I have 2 tables A and B, I would like to
perform a simple join on them,

We can do it like this:

INSERT OVERWRITE TABLE C
SELECT .... FROM A JOIN B on A.id=B.id

In order to speed up this query since table A and B have lots of data,
another approach is :

Say I partition table A and B into 10 partitions respectively, and write
the query like this

INSERT OVERWRITE TABLE C PARTITION(pid=1)
SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1

then I run this query 10 times concurrently (pid ranges from 1 to 10)

And my question is that , in my observation of some more complex queries,
the second solution is about 15% faster than the first solution,
is it simply because the setting of reducer num is not optimal?
If the resource is not a limit and it is possible to set the proper reducer
nums in the first solution , can they achieve the same performance? Is
there any other fact that can cause performance difference between
them(non-partition VS partition+concurrent) besides the job parameter
issues?

Thanks!

Re: Performance difference between tuning reducer num and partition table

Posted by "Felix.徐" <yg...@gmail.com>.
Hi Dean,

Thanks for your reply. If I don't set the number of reducers in the 1st run
, the number of reducers will be much smaller and the performance will be
worse. The total output file size is about 200MB, I see that many reduce
output files are empty, only 10 of them have data.

Another question is that , is there any documentation about the job
specific parameters of MapReduce and Hive?




2013/6/29 Dean Wampler <de...@gmail.com>

> What happens if you don't set the number of reducers in the 1st run? How
> many reducers are executed. If it's a much smaller number, the extra
> overhead could matter. Another clue is the size of the files the first run
> produced, i.e., do you have 30 small (much less than a block size) files?
>
> On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐 <yg...@gmail.com> wrote:
>
>> Hi Stephen,
>>
>> My query is actually more complex , hive will generate 2 mapreduces,
>> in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
>> 30 reducers (reducer num is set manually)
>> in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
>> reducers for each partition
>>
>> I do not know whether they could achieve the same performance if the
>> reducers num is set properly.
>>
>>
>> 2013/6/29 Stephen Sprague <sp...@gmail.com>
>>
>>> great question.  your parallelization seems to trump hadoop's.    I
>>> guess i'd ask what are the _total_ number of Mappers and Reducers that run
>>> on your cluster for these two scenarios?   I'd be curious if there are the
>>> same.
>>>
>>>
>>>
>>>
>>> On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 <yg...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Here is the scenario, suppose I have 2 tables A and B, I would like to
>>>> perform a simple join on them,
>>>>
>>>> We can do it like this:
>>>>
>>>> INSERT OVERWRITE TABLE C
>>>> SELECT .... FROM A JOIN B on A.id=B.id
>>>>
>>>> In order to speed up this query since table A and B have lots of data,
>>>> another approach is :
>>>>
>>>> Say I partition table A and B into 10 partitions respectively, and
>>>> write the query like this
>>>>
>>>> INSERT OVERWRITE TABLE C PARTITION(pid=1)
>>>> SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>>>>
>>>> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>>>>
>>>> And my question is that , in my observation of some more complex
>>>> queries, the second solution is about 15% faster than the first solution,
>>>> is it simply because the setting of reducer num is not optimal?
>>>> If the resource is not a limit and it is possible to set the proper
>>>> reducer nums in the first solution , can they achieve the same performance?
>>>> Is there any other fact that can cause performance difference between
>>>> them(non-partition VS partition+concurrent) besides the job parameter
>>>> issues?
>>>>
>>>> Thanks!
>>>>
>>>
>>>
>>
>
>
> --
> Dean Wampler, Ph.D.
> @deanwampler
> http://polyglotprogramming.com

Re: Performance difference between tuning reducer num and partition table

Posted by Dean Wampler <de...@gmail.com>.
What happens if you don't set the number of reducers in the 1st run? How
many reducers are executed. If it's a much smaller number, the extra
overhead could matter. Another clue is the size of the files the first run
produced, i.e., do you have 30 small (much less than a block size) files?

On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐 <yg...@gmail.com> wrote:

> Hi Stephen,
>
> My query is actually more complex , hive will generate 2 mapreduces,
> in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
> 30 reducers (reducer num is set manually)
> in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
> reducers for each partition
>
> I do not know whether they could achieve the same performance if the
> reducers num is set properly.
>
>
> 2013/6/29 Stephen Sprague <sp...@gmail.com>
>
>> great question.  your parallelization seems to trump hadoop's.    I guess
>> i'd ask what are the _total_ number of Mappers and Reducers that run on
>> your cluster for these two scenarios?   I'd be curious if there are the
>> same.
>>
>>
>>
>>
>> On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 <yg...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Here is the scenario, suppose I have 2 tables A and B, I would like to
>>> perform a simple join on them,
>>>
>>> We can do it like this:
>>>
>>> INSERT OVERWRITE TABLE C
>>> SELECT .... FROM A JOIN B on A.id=B.id
>>>
>>> In order to speed up this query since table A and B have lots of data,
>>> another approach is :
>>>
>>> Say I partition table A and B into 10 partitions respectively, and write
>>> the query like this
>>>
>>> INSERT OVERWRITE TABLE C PARTITION(pid=1)
>>> SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>>>
>>> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>>>
>>> And my question is that , in my observation of some more complex
>>> queries, the second solution is about 15% faster than the first solution,
>>> is it simply because the setting of reducer num is not optimal?
>>> If the resource is not a limit and it is possible to set the proper
>>> reducer nums in the first solution , can they achieve the same performance?
>>> Is there any other fact that can cause performance difference between
>>> them(non-partition VS partition+concurrent) besides the job parameter
>>> issues?
>>>
>>> Thanks!
>>>
>>
>>
>


-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Performance difference between tuning reducer num and partition table

Posted by "Felix.徐" <yg...@gmail.com>.
Hi Stephen,

My query is actually more complex , hive will generate 2 mapreduces,
in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
30 reducers (reducer num is set manually)
in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
reducers for each partition

I do not know whether they could achieve the same performance if the
reducers num is set properly.


2013/6/29 Stephen Sprague <sp...@gmail.com>

> great question.  your parallelization seems to trump hadoop's.    I guess
> i'd ask what are the _total_ number of Mappers and Reducers that run on
> your cluster for these two scenarios?   I'd be curious if there are the
> same.
>
>
>
>
> On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 <yg...@gmail.com> wrote:
>
>> Hi all,
>>
>> Here is the scenario, suppose I have 2 tables A and B, I would like to
>> perform a simple join on them,
>>
>> We can do it like this:
>>
>> INSERT OVERWRITE TABLE C
>> SELECT .... FROM A JOIN B on A.id=B.id
>>
>> In order to speed up this query since table A and B have lots of data,
>> another approach is :
>>
>> Say I partition table A and B into 10 partitions respectively, and write
>> the query like this
>>
>> INSERT OVERWRITE TABLE C PARTITION(pid=1)
>> SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>>
>> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>>
>> And my question is that , in my observation of some more complex queries,
>> the second solution is about 15% faster than the first solution,
>> is it simply because the setting of reducer num is not optimal?
>> If the resource is not a limit and it is possible to set the proper
>> reducer nums in the first solution , can they achieve the same performance?
>> Is there any other fact that can cause performance difference between
>> them(non-partition VS partition+concurrent) besides the job parameter
>> issues?
>>
>> Thanks!
>>
>
>

Re: Performance difference between tuning reducer num and partition table

Posted by Stephen Sprague <sp...@gmail.com>.
great question.  your parallelization seems to trump hadoop's.    I guess
i'd ask what are the _total_ number of Mappers and Reducers that run on
your cluster for these two scenarios?   I'd be curious if there are the
same.




On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 <yg...@gmail.com> wrote:

> Hi all,
>
> Here is the scenario, suppose I have 2 tables A and B, I would like to
> perform a simple join on them,
>
> We can do it like this:
>
> INSERT OVERWRITE TABLE C
> SELECT .... FROM A JOIN B on A.id=B.id
>
> In order to speed up this query since table A and B have lots of data,
> another approach is :
>
> Say I partition table A and B into 10 partitions respectively, and write
> the query like this
>
> INSERT OVERWRITE TABLE C PARTITION(pid=1)
> SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>
> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>
> And my question is that , in my observation of some more complex queries,
> the second solution is about 15% faster than the first solution,
> is it simply because the setting of reducer num is not optimal?
> If the resource is not a limit and it is possible to set the proper
> reducer nums in the first solution , can they achieve the same performance?
> Is there any other fact that can cause performance difference between
> them(non-partition VS partition+concurrent) besides the job parameter
> issues?
>
> Thanks!
>