You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "J.P Feng (JIRA)" <ji...@apache.org> on 2016/10/26 01:25:58 UTC
[jira] [Closed] (SPARK-18077) Run insert overwrite statements in spark to overwrite a partitioned table is very slow

     [ https://issues.apache.org/jira/browse/SPARK-18077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

J.P Feng closed SPARK-18077.
----------------------------
    Resolution: Won't Fix

i would try to open another one, for there are some mistakes in this issue.

> Run insert overwrite statements in spark  to overwrite a partitioned table is very slow
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-18077
>                 URL: https://issues.apache.org/jira/browse/SPARK-18077
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>         Environment: spark 2.0
> hive 2.0.1
> driver memory: 4g
> total executors: 4
> executor memory: 10g
> total cores: 13
>            Reporter: J.P Feng
>              Labels: hive, insert, sparkSQL
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Hello,all. I face a strange thing in my project.
> there is a table:
> CREATE TABLE `login4game`(`account_name` string, `role_id` string, `server_id` string, `recdate` string)
> PARTITIONED BY (`pt` string, `dt` string) stored as orc;
> another table:
> CREATE TABLE `tbllog_login`(`server` string,`role_id` bigint, `account_name` string, `happened_time` int)
> PARTITIONED BY (`pt` string, `dt` string)
> ----------------------------------------------------------------------------------------------
> Test-1:
> executed sql in spark-shell or spark-sql( before i run this sql, there is much data in partition(pt='mix_en', dt='2016-10-21') of table login4game ):
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')    select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login  where pt='mix_en' and  dt='2016-10-21' 
> it will cost a lot of time, below is a part of the logs:
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> [Stage 5:=======================================>               (144 + 8) / 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, real=0.08 secs] 
> [Stage 5:=========================================>             (152 + 8) / 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, real=0.08 secs] 
> [Stage 5:============================================>          (160 + 8) / 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, real=0.07 secs] 
> [Stage 5:================================================>      (176 + 8) / 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
> [Stage 5:================================================>      (177 + 8) / 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
> [Stage 5:====================================================>  (192 + 8) / 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, real=0.06 secs] 
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> .......
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> i can see, the origin data is moved to .trash
> and then, there is no log printing, and after about 10 min, the log print again:
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00000, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000, Status:true
> 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00001, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001, Status:true
> 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00002, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002, Status:true
> 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00003, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003, Status:true
> 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00004, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004, Status:true
> ....
> 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00199, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199, Status:true
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> after which, the sql executing is end. but it costs over 10 min to handle 257128 lines of data
> ---------------------------------------------------------------------------------------------
> Test-2:
> if i remove all data of partition(pt='mix_en',dt='2016-10-21') in table login4game, and run such sql:
> ALTER TABLE login4game DROP IF EXISTS PARTITION (pt='mix_en', dt='2016-10-21');
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')    select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login  where pt='mix_en' and  dt='2016-10-21' 
> the logs is the same:
> ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> [Stage 5:=======================================>               (144 + 8) / 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, real=0.08 secs] 
> [Stage 5:=========================================>             (152 + 8) / 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, real=0.08 secs] 
> [Stage 5:============================================>          (160 + 8) / 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, real=0.07 secs] 
> [Stage 5:================================================>      (176 + 8) / 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
> [Stage 5:================================================>      (177 + 8) / 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
> [Stage 5:====================================================>  (192 + 8) / 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, real=0.06 secs] 
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> .......
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> but after moving to trash, i don't need to wait for 10min, and the next log appears immediately.
> for about 20 seconds, this sql is finished, and i get the same result as the above executing(the drop partitions statement costs about 5 second).
> ----------------------------------------------------------------------------------------------
> Test-3:
> now, if i run the same sql in hive server(not using spark-sql or spark-shell to run, just run it in beeline or hive client):
> before running, i also load the same data in to login4game in partition(pt='mix_en',dt='2016-10-21')
> then i run such sql:
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')    select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login  where pt='mix_en' and  dt='2016-10-21' 
> after running, i get the same result  for about 20 seconds.
> ==========================================================
> now, it's the question, why its so slow when i use spark-sql to execute insert overwrite statement to handle a partitioned table with data(if there is no data in selected partitions, it runs fast) ?
> important point:
> 1. it costs a lot of time in running move to trash when there is much data
> 2. running the same sql in beeline is fast
> 3. drop data from selected partitions before running executing insert overwrite statement will speeding the execution
> Is there anyone could help me? Thanks a lot!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org