You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Bijay Kumar Pathak <bk...@mtu.edu> on 2016/05/04 17:02:38 UTC

Performance with Insert overwrite into Hive Table.

Hello,

I am writing Dataframe of around 60+ GB into partitioned Hive Table using
hiveContext in parquet format. The Spark insert overwrite jobs completes in
a reasonable amount of time around 20 minutes.

But the job is taking a huge amount of time more than 2 hours to copy data
from .hivestaging directory in HDFS to final partition directory. What
could be the potential problem over here?

hive_c.sql("""
                INSERT OVERWRITE TABLE {0} PARTITION
(row_eff_end_dt='{1}', ccd_dt)
                SELECT * from temp_table
                """.format(table_name, eff_end_dt)
                )

And the below process from the log is taking more than 2 hours.

16/05/04 06:41:28 INFO Hive: Replacing
src:hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-10000/ccd_dt=2012-09-02/part-00306,
dest: hdfs://internal:8020/user/hadoop/so_core_us/row_eff_end_dt=9999-12-31/ccd_dt=2012-09-02/part-00306,
Status:true
16/05/04 06:41:28 INFO Hive: New loading path =
hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-10000/ccd_dt=2012-09-02
with partSpec {row_eff_end_dt=9999-12-31, ccd_dt=2012-09-02}


Thanks,
Bijay

Re: Performance with Insert overwrite into Hive Table.

Posted by Bijay Kumar Pathak <bk...@mtu.edu>.

Thanks Ted. This looks like the issue since I am running it in EMR and the
Hive version is 1.0.0.


Thanks,
Bijay

On Wed, May 4, 2016 at 10:29 AM, Ted Yu <yu...@gmail.com> wrote:

> Looks like you were hitting HIVE-11940
>
> On Wed, May 4, 2016 at 10:02 AM, Bijay Kumar Pathak <bk...@mtu.edu>
> wrote:
>
>> Hello,
>>
>> I am writing Dataframe of around 60+ GB into partitioned Hive Table using
>> hiveContext in parquet format. The Spark insert overwrite jobs completes in
>> a reasonable amount of time around 20 minutes.
>>
>> But the job is taking a huge amount of time more than 2 hours to copy
>> data from .hivestaging directory in HDFS to final partition directory. What
>> could be the potential problem over here?
>>
>> hive_c.sql("""
>>                 INSERT OVERWRITE TABLE {0} PARTITION (row_eff_end_dt='{1}', ccd_dt)
>>                 SELECT * from temp_table
>>                 """.format(table_name, eff_end_dt)
>>                 )
>>
>> And the below process from the log is taking more than 2 hours.
>>
>> 16/05/04 06:41:28 INFO Hive: Replacing src:hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-10000/ccd_dt=2012-09-02/part-00306, dest: hdfs://internal:8020/user/hadoop/so_core_us/row_eff_end_dt=9999-12-31/ccd_dt=2012-09-02/part-00306, Status:true
>> 16/05/04 06:41:28 INFO Hive: New loading path = hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-10000/ccd_dt=2012-09-02 with partSpec {row_eff_end_dt=9999-12-31, ccd_dt=2012-09-02}
>>
>>
>> Thanks,
>> Bijay
>>
>
>

Re: Performance with Insert overwrite into Hive Table.

Posted by Ted Yu <yu...@gmail.com>.

Looks like you were hitting HIVE-11940

On Wed, May 4, 2016 at 10:02 AM, Bijay Kumar Pathak <bk...@mtu.edu>
wrote:

> Hello,
>
> I am writing Dataframe of around 60+ GB into partitioned Hive Table using
> hiveContext in parquet format. The Spark insert overwrite jobs completes in
> a reasonable amount of time around 20 minutes.
>
> But the job is taking a huge amount of time more than 2 hours to copy data
> from .hivestaging directory in HDFS to final partition directory. What
> could be the potential problem over here?
>
> hive_c.sql("""
>                 INSERT OVERWRITE TABLE {0} PARTITION (row_eff_end_dt='{1}', ccd_dt)
>                 SELECT * from temp_table
>                 """.format(table_name, eff_end_dt)
>                 )
>
> And the below process from the log is taking more than 2 hours.
>
> 16/05/04 06:41:28 INFO Hive: Replacing src:hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-10000/ccd_dt=2012-09-02/part-00306, dest: hdfs://internal:8020/user/hadoop/so_core_us/row_eff_end_dt=9999-12-31/ccd_dt=2012-09-02/part-00306, Status:true
> 16/05/04 06:41:28 INFO Hive: New loading path = hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-10000/ccd_dt=2012-09-02 with partSpec {row_eff_end_dt=9999-12-31, ccd_dt=2012-09-02}
>
>
> Thanks,
> Bijay
>