You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Jay <ja...@gmail.com> on 2017/09/13 09:24:57 UTC

Out of Memory while generating ORC Splits

Hi All,

I am running a simple select query as below

select distinct vehicle_no from
rmd.gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3 where incident_dt =
'2999-01-01';

The table is a 2 level partitioned table as shown below

drwx------   - gpadmin hdfs          0 2017-09-12 14:36
/apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2010-01-01
drwx------   - gpadmin hdfs          0 2017-09-12 14:36
/apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2011-01-01
drwx------   - gpadmin hdfs          0 2017-09-12 14:35
/apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2012-01-01
drwx------   - gpadmin hdfs          0 2017-09-12 14:36
/apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2013-01-01
drwx------   - gpadmin hdfs          0 2017-09-12 14:36
/apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-01-01
drwx------   - gpadmin hdfs          0 2017-09-12 14:36
/apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-02-01
drwx------   - gpadmin hdfs          0 2017-09-12 14:36
/apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-03-01
drwx------   - gpadmin hdfs          0 2017-09-12 14:36
/apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-04-01
drwx------   - gpadmin hdfs          0 2017-09-12 14:34
/apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-05-01
drwx------   - gpadmin hdfs          0 2017-09-12 14:33
/apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-06-01
drwx------   - gpadmin hdfs          0 2017-09-12 14:33
/apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-07-01


The ORC files have been created with a rough size of 2 GB and have ZLIB
compression.

When the hive.exec.orc.split.strategy is set to HYBRID in our HDP 2.6.1
cluster the MAP phase is stuck in the INITIALIZATION phases and after about
20 minutes it fails with OOM.

When I change hive.exec.orc.split.strategy to BI the SQL runs fine without
any issues.

My question is what parameter controls the memory assigned while Hive/Tez
generates the splits?

the hive container size is set to 8GB

Thanks,
Jayadeep

Re: Out of Memory while generating ORC Splits

Posted by Rajesh Balamohan <rb...@apache.org>.
>> Also, this parameter is set to false by default should this be set to
true ?
This is set to false for predictable query performance with multiple runs.
If set to true, there is a chance of footer getting evicted during GC and
get into unpredictable performance.

>> I see that the number of mappers generated by setting this parameter to
true is less than the number of mappers generated by setting the
split.strategy=BI. Therefore, I am hoping that using this parameter along
with HYBRID is better than using BI split strategy. Can you please comment
on this?
BI does not carry out any optimization at split gen phase. It would be
equal to the number of files being included in the split gen. ETL can do
additional optimizations based on the information present in footer. Hybrid
would combine BI and ETL strategy on need basis. Hence, the number of
splits generated by HYBRID/ETL can be different than that of BI.
"hive.orc.cache.use.soft.references=true" needs to be used along with
HYBRID/ETL strategy. HYBRID is the default strategy.

On Wed, Sep 13, 2017 at 3:23 PM, Jay <ja...@gmail.com> wrote:

> Thanks Rajesh, this helped solve the OOM issue. I was going through the
> wiki documentation for this parameter and was not able to understand it
> clearly, can you please explain the significance of this ?
>
> I have 2 questions:-
> 1. Also, this parameter is set to false by default should this be set to
> true ?
> 2. I see that the number of mappers generated by setting this parameter to
> true is less than the number of mappers generated by setting the
> split.strategy=BI. Therefore, I am hoping that using this parameter along
> with HYBRID is better than using BI split strategy. Can you please comment
> on this?
>
> Thanks,
> Jayadeep
>
> On Wed, Sep 13, 2017 at 3:14 PM, Rajesh Balamohan <rb...@apache.org>
> wrote:
>
>> With "HYBRID" can you try with "hive.orc.cache.use.soft.references=true"?
>> That should help in preventing OOM with Hybrid strategy.
>>
>> ~Rajesh.B
>>
>> On Wed, Sep 13, 2017 at 2:54 PM, Jay <ja...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I am running a simple select query as below
>>>
>>> select distinct vehicle_no from rmd.gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3
>>> where incident_dt = '2999-01-01';
>>>
>>> The table is a 2 level partitioned table as shown below
>>>
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2010-01-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2011-01-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:35
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2012-01-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2013-01-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-01-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-02-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-03-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-04-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:34
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-05-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:33
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-06-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:33
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-07-01
>>>
>>>
>>> The ORC files have been created with a rough size of 2 GB and have ZLIB
>>> compression.
>>>
>>> When the hive.exec.orc.split.strategy is set to HYBRID in our HDP 2.6.1
>>> cluster the MAP phase is stuck in the INITIALIZATION phases and after about
>>> 20 minutes it fails with OOM.
>>>
>>> When I change hive.exec.orc.split.strategy to BI the SQL runs fine
>>> without any issues.
>>>
>>> My question is what parameter controls the memory assigned while
>>> Hive/Tez generates the splits?
>>>
>>> the hive container size is set to 8GB
>>>
>>> Thanks,
>>> Jayadeep
>>>
>>
>>
>

Re: Out of Memory while generating ORC Splits

Posted by Jay <ja...@gmail.com>.
Thanks Rajesh, this helped solve the OOM issue. I was going through the
wiki documentation for this parameter and was not able to understand it
clearly, can you please explain the significance of this ?

I have 2 questions:-
1. Also, this parameter is set to false by default should this be set to
true ?
2. I see that the number of mappers generated by setting this parameter to
true is less than the number of mappers generated by setting the
split.strategy=BI. Therefore, I am hoping that using this parameter along
with HYBRID is better than using BI split strategy. Can you please comment
on this?

Thanks,
Jayadeep

On Wed, Sep 13, 2017 at 3:14 PM, Rajesh Balamohan <rb...@apache.org>
wrote:

> With "HYBRID" can you try with "hive.orc.cache.use.soft.references=true"?
> That should help in preventing OOM with Hybrid strategy.
>
> ~Rajesh.B
>
> On Wed, Sep 13, 2017 at 2:54 PM, Jay <ja...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am running a simple select query as below
>>
>> select distinct vehicle_no from rmd.gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3
>> where incident_dt = '2999-01-01';
>>
>> The table is a 2 level partitioned table as shown below
>>
>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2010-01-01
>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2011-01-01
>> drwx------   - gpadmin hdfs          0 2017-09-12 14:35
>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2012-01-01
>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2013-01-01
>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-01-01
>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-02-01
>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-03-01
>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-04-01
>> drwx------   - gpadmin hdfs          0 2017-09-12 14:34
>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-05-01
>> drwx------   - gpadmin hdfs          0 2017-09-12 14:33
>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-06-01
>> drwx------   - gpadmin hdfs          0 2017-09-12 14:33
>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-07-01
>>
>>
>> The ORC files have been created with a rough size of 2 GB and have ZLIB
>> compression.
>>
>> When the hive.exec.orc.split.strategy is set to HYBRID in our HDP 2.6.1
>> cluster the MAP phase is stuck in the INITIALIZATION phases and after about
>> 20 minutes it fails with OOM.
>>
>> When I change hive.exec.orc.split.strategy to BI the SQL runs fine
>> without any issues.
>>
>> My question is what parameter controls the memory assigned while Hive/Tez
>> generates the splits?
>>
>> the hive container size is set to 8GB
>>
>> Thanks,
>> Jayadeep
>>
>
>

Re: Out of Memory while generating ORC Splits

Posted by Rajesh Balamohan <rb...@apache.org>.
With "HYBRID" can you try with "hive.orc.cache.use.soft.references=true"?
That should help in preventing OOM with Hybrid strategy.

~Rajesh.B

On Wed, Sep 13, 2017 at 2:54 PM, Jay <ja...@gmail.com> wrote:

> Hi All,
>
> I am running a simple select query as below
>
> select distinct vehicle_no from rmd.gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3
> where incident_dt = '2999-01-01';
>
> The table is a 2 level partitioned table as shown below
>
> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_
> ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2010-01-01
> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_
> ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2011-01-01
> drwx------   - gpadmin hdfs          0 2017-09-12 14:35
> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_
> ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2012-01-01
> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_
> ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2013-01-01
> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_
> ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-01-01
> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_
> ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-02-01
> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_
> ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-03-01
> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_
> ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-04-01
> drwx------   - gpadmin hdfs          0 2017-09-12 14:34
> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_
> ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-05-01
> drwx------   - gpadmin hdfs          0 2017-09-12 14:33
> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_
> ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-06-01
> drwx------   - gpadmin hdfs          0 2017-09-12 14:33
> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_
> ext_concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-07-01
>
>
> The ORC files have been created with a rough size of 2 GB and have ZLIB
> compression.
>
> When the hive.exec.orc.split.strategy is set to HYBRID in our HDP 2.6.1
> cluster the MAP phase is stuck in the INITIALIZATION phases and after about
> 20 minutes it fails with OOM.
>
> When I change hive.exec.orc.split.strategy to BI the SQL runs fine without
> any issues.
>
> My question is what parameter controls the memory assigned while Hive/Tez
> generates the splits?
>
> the hive container size is set to 8GB
>
> Thanks,
> Jayadeep
>