You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tez.apache.org by David Ginzburg <da...@gmail.com> on 2015/05/25 11:11:17 UTC

Re: hive.tez.auto.reducer.parallelism can slove skew join problem?

Thank you,
Already tried this with no effect on number of reducers

On Mon, May 25, 2015 at 3:51 AM, r7raul1984@163.com <r7...@163.com>
wrote:

>
> when one reduce process too many data(skew join)  set hive.tez.auto.reducer.parallelism
> =true can slove this problem?
>
> ------------------------------
> r7raul1984@163.com
>

Re: hive.tez.auto.reducer.parallelism can slove skew join problem?

Posted by Rajesh Balamohan <ra...@gmail.com>.
For hive, "hive.exec.reducers.bytes.per.reducer" (default should be around
256000000).

~Rajesh.B

On Mon, May 25, 2015 at 5:19 PM, David Ginzburg <da...@gmail.com>
wrote:

> Thank you again !
>
> The distribution over the partitions is quite uniform.
>
> Regarding option #1, how can I increase the number of reducers for the
> vertex. ?
>
> On Mon, May 25, 2015 at 2:11 PM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com> wrote:
>
>>
>> Forgot to mention another scenario #3 in earlier mail.
>>
>> 1. If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is
>> approximately 1.0, you can possibly increase the number of reducers for the
>> vertex.
>>
>> 2. If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is lot less
>> than 0.2 (~20%) and if almost all the records are processed by this
>> reducer, it could mean data skew.  In this case, you might want to consider
>> increasing the amount of memory allocated (try increasing the container
>> size to check if it is helping the situation)
>>
>> 3. In some cases, REDUCE_INPUT_GROUPS/REDUCE_INPUT_RECORDS ratio might be
>> in between (i.e 0.3 - 0.8). In such cases, if most of the records are
>> processed by this reducer, you might want to check the partition logic.
>>
>>
>> To answer your question, yes, based on counters if you find that #2 is
>> the case, you might want to increase the memory and try it out.
>>
>>
>>
>> On Mon, May 25, 2015 at 3:25 PM, David Ginzburg <da...@gmail.com>
>> wrote:
>>
>>> Thank you,
>>> It is my understanding that you suspect a skew in the data, and suggest
>>> an increase of heap for that single reducer ?
>>>
>>> On Mon, May 25, 2015 at 12:45 PM, Rajesh Balamohan <
>>> rajesh.balamohan@gmail.com> wrote:
>>>
>>>>
>>>> As of today, Tez autoparallelism can only decrease the number of
>>>> reducers allocated. It can not increase the number of tasks at runtime
>>>> (could be there in future releases).
>>>>
>>>> - If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is
>>>> approximately 1.0, you can possibly increase the number of reducers for the
>>>> vertex.
>>>> - If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is lot
>>>> less than 0.2 (~20%), this could potentially mean single reducer taking up
>>>> most of the records.  In this case, you might want to consider increasing
>>>> the amount of memory allocated (try increasing the container size to check
>>>> if it is helping the situation)
>>>>
>>>> ~Rajesh.B
>>>>
>>>> On Mon, May 25, 2015 at 2:41 PM, David Ginzburg <
>>>> davidginzburg@gmail.com> wrote:
>>>>
>>>>> Thank you,
>>>>> Already tried this with no effect on number of reducers
>>>>>
>>>>> On Mon, May 25, 2015 at 3:51 AM, r7raul1984@163.com <
>>>>> r7raul1984@163.com> wrote:
>>>>>
>>>>>>
>>>>>> when one reduce process too many data(skew join)  set hive.tez.auto.reducer.parallelism
>>>>>> =true can slove this problem?
>>>>>>
>>>>>> ------------------------------
>>>>>> r7raul1984@163.com
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ~Rajesh.B
>>>>
>>>
>>>
>>
>>
>> --
>> ~Rajesh.B
>>
>
>


-- 
~Rajesh.B

Re: hive.tez.auto.reducer.parallelism can slove skew join problem?

Posted by David Ginzburg <da...@gmail.com>.
Thank you again !

The distribution over the partitions is quite uniform.

Regarding option #1, how can I increase the number of reducers for the
vertex. ?

On Mon, May 25, 2015 at 2:11 PM, Rajesh Balamohan <
rajesh.balamohan@gmail.com> wrote:

>
> Forgot to mention another scenario #3 in earlier mail.
>
> 1. If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is
> approximately 1.0, you can possibly increase the number of reducers for the
> vertex.
>
> 2. If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is lot less
> than 0.2 (~20%) and if almost all the records are processed by this
> reducer, it could mean data skew.  In this case, you might want to consider
> increasing the amount of memory allocated (try increasing the container
> size to check if it is helping the situation)
>
> 3. In some cases, REDUCE_INPUT_GROUPS/REDUCE_INPUT_RECORDS ratio might be
> in between (i.e 0.3 - 0.8). In such cases, if most of the records are
> processed by this reducer, you might want to check the partition logic.
>
>
> To answer your question, yes, based on counters if you find that #2 is the
> case, you might want to increase the memory and try it out.
>
>
>
> On Mon, May 25, 2015 at 3:25 PM, David Ginzburg <da...@gmail.com>
> wrote:
>
>> Thank you,
>> It is my understanding that you suspect a skew in the data, and suggest
>> an increase of heap for that single reducer ?
>>
>> On Mon, May 25, 2015 at 12:45 PM, Rajesh Balamohan <
>> rajesh.balamohan@gmail.com> wrote:
>>
>>>
>>> As of today, Tez autoparallelism can only decrease the number of
>>> reducers allocated. It can not increase the number of tasks at runtime
>>> (could be there in future releases).
>>>
>>> - If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is
>>> approximately 1.0, you can possibly increase the number of reducers for the
>>> vertex.
>>> - If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is lot less
>>> than 0.2 (~20%), this could potentially mean single reducer taking up most
>>> of the records.  In this case, you might want to consider increasing the
>>> amount of memory allocated (try increasing the container size to check if
>>> it is helping the situation)
>>>
>>> ~Rajesh.B
>>>
>>> On Mon, May 25, 2015 at 2:41 PM, David Ginzburg <davidginzburg@gmail.com
>>> > wrote:
>>>
>>>> Thank you,
>>>> Already tried this with no effect on number of reducers
>>>>
>>>> On Mon, May 25, 2015 at 3:51 AM, r7raul1984@163.com <r7raul1984@163.com
>>>> > wrote:
>>>>
>>>>>
>>>>> when one reduce process too many data(skew join)  set hive.tez.auto.reducer.parallelism
>>>>> =true can slove this problem?
>>>>>
>>>>> ------------------------------
>>>>> r7raul1984@163.com
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> ~Rajesh.B
>>>
>>
>>
>
>
> --
> ~Rajesh.B
>

Re: hive.tez.auto.reducer.parallelism can slove skew join problem?

Posted by Rajesh Balamohan <ra...@gmail.com>.
Forgot to mention another scenario #3 in earlier mail.

1. If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is
approximately 1.0, you can possibly increase the number of reducers for the
vertex.

2. If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is lot less
than 0.2 (~20%) and if almost all the records are processed by this
reducer, it could mean data skew.  In this case, you might want to consider
increasing the amount of memory allocated (try increasing the container
size to check if it is helping the situation)

3. In some cases, REDUCE_INPUT_GROUPS/REDUCE_INPUT_RECORDS ratio might be
in between (i.e 0.3 - 0.8). In such cases, if most of the records are
processed by this reducer, you might want to check the partition logic.


To answer your question, yes, based on counters if you find that #2 is the
case, you might want to increase the memory and try it out.



On Mon, May 25, 2015 at 3:25 PM, David Ginzburg <da...@gmail.com>
wrote:

> Thank you,
> It is my understanding that you suspect a skew in the data, and suggest an
> increase of heap for that single reducer ?
>
> On Mon, May 25, 2015 at 12:45 PM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com> wrote:
>
>>
>> As of today, Tez autoparallelism can only decrease the number of reducers
>> allocated. It can not increase the number of tasks at runtime (could be
>> there in future releases).
>>
>> - If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is
>> approximately 1.0, you can possibly increase the number of reducers for the
>> vertex.
>> - If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is lot less
>> than 0.2 (~20%), this could potentially mean single reducer taking up most
>> of the records.  In this case, you might want to consider increasing the
>> amount of memory allocated (try increasing the container size to check if
>> it is helping the situation)
>>
>> ~Rajesh.B
>>
>> On Mon, May 25, 2015 at 2:41 PM, David Ginzburg <da...@gmail.com>
>> wrote:
>>
>>> Thank you,
>>> Already tried this with no effect on number of reducers
>>>
>>> On Mon, May 25, 2015 at 3:51 AM, r7raul1984@163.com <r7...@163.com>
>>> wrote:
>>>
>>>>
>>>> when one reduce process too many data(skew join)  set hive.tez.auto.reducer.parallelism
>>>> =true can slove this problem?
>>>>
>>>> ------------------------------
>>>> r7raul1984@163.com
>>>>
>>>
>>>
>>
>>
>> --
>> ~Rajesh.B
>>
>
>


-- 
~Rajesh.B

Re: hive.tez.auto.reducer.parallelism can slove skew join problem?

Posted by David Ginzburg <da...@gmail.com>.
Thank you,
It is my understanding that you suspect a skew in the data, and suggest an
increase of heap for that single reducer ?

On Mon, May 25, 2015 at 12:45 PM, Rajesh Balamohan <
rajesh.balamohan@gmail.com> wrote:

>
> As of today, Tez autoparallelism can only decrease the number of reducers
> allocated. It can not increase the number of tasks at runtime (could be
> there in future releases).
>
> - If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is
> approximately 1.0, you can possibly increase the number of reducers for the
> vertex.
> - If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is lot less
> than 0.2 (~20%), this could potentially mean single reducer taking up most
> of the records.  In this case, you might want to consider increasing the
> amount of memory allocated (try increasing the container size to check if
> it is helping the situation)
>
> ~Rajesh.B
>
> On Mon, May 25, 2015 at 2:41 PM, David Ginzburg <da...@gmail.com>
> wrote:
>
>> Thank you,
>> Already tried this with no effect on number of reducers
>>
>> On Mon, May 25, 2015 at 3:51 AM, r7raul1984@163.com <r7...@163.com>
>> wrote:
>>
>>>
>>> when one reduce process too many data(skew join)  set hive.tez.auto.reducer.parallelism
>>> =true can slove this problem?
>>>
>>> ------------------------------
>>> r7raul1984@163.com
>>>
>>
>>
>
>
> --
> ~Rajesh.B
>

Re: hive.tez.auto.reducer.parallelism can slove skew join problem?

Posted by Rajesh Balamohan <ra...@gmail.com>.
As of today, Tez autoparallelism can only decrease the number of reducers
allocated. It can not increase the number of tasks at runtime (could be
there in future releases).

- If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is
approximately 1.0, you can possibly increase the number of reducers for the
vertex.
- If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is lot less
than 0.2 (~20%), this could potentially mean single reducer taking up most
of the records.  In this case, you might want to consider increasing the
amount of memory allocated (try increasing the container size to check if
it is helping the situation)

~Rajesh.B

On Mon, May 25, 2015 at 2:41 PM, David Ginzburg <da...@gmail.com>
wrote:

> Thank you,
> Already tried this with no effect on number of reducers
>
> On Mon, May 25, 2015 at 3:51 AM, r7raul1984@163.com <r7...@163.com>
> wrote:
>
>>
>> when one reduce process too many data(skew join)  set hive.tez.auto.reducer.parallelism
>> =true can slove this problem?
>>
>> ------------------------------
>> r7raul1984@163.com
>>
>
>


-- 
~Rajesh.B