You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "J.P Feng (JIRA)" <ji...@apache.org> on 2016/10/24 15:09:58 UTC

[jira] [Updated] (SPARK-18077) Run insert overwrite statements in spark to overwrite a partitioned table is very slow

     [ https://issues.apache.org/jira/browse/SPARK-18077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

J.P Feng updated SPARK-18077:
-----------------------------
    Description: 
Hello,all. I face a strange thing in my project.

there is a table:

CREATE TABLE `login4game`(`account_name` string, `role_id` string, `server_id` string, `recdate` string)
PARTITIONED BY (`pt` string, `dt` string) stored as orc;

another table:
CREATE TABLE `tbllog_login`(`server` string,`role_id` bigint, `account_name` string, `happened_time` int)
PARTITIONED BY (`pt` string, `dt` string)


----------------------------------------------------------------------------------------------
Test-1:

executed sql in spark-shell or spark-sql( before i run this sql, there is much data in partition(pt='mix_en', dt='2016-10-21') of table login4game ):

insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')    select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login  where pt='mix_en' and  dt='2016-10-21' 

it will cost a lot of time, below is a part of the logs:

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////
[Stage 5:=======================================>               (144 + 8) / 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, real=0.08 secs] 
[Stage 5:=========================================>             (152 + 8) / 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, real=0.08 secs] 
[Stage 5:============================================>          (160 + 8) / 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, real=0.07 secs] 
[Stage 5:================================================>      (176 + 8) / 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
[Stage 5:================================================>      (177 + 8) / 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
[Stage 5:====================================================>  (192 + 8) / 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, real=0.06 secs] 
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
.......
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////


i can see, the origin data is moved to .trash

and then, there is no log printing, and after about 10 min, the log print again:

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////
16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00000, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000, Status:true
16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00001, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001, Status:true
16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00002, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002, Status:true
16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00003, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003, Status:true
16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00004, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004, Status:true
....
16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00199, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199, Status:true
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////


after which, the sql executing is end. but it costs over 10 min to handle 257128 lines of data

---------------------------------------------------------------------------------------------
Test-2:

if i remove all data of partition(pt='mix_en',dt='2016-10-21') in table login4game, and run such sql:

ALTER TABLE login4game DROP IF EXISTS PARTITION (pt='mix_en', dt='2016-10-21');

insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')    select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login  where pt='mix_en' and  dt='2016-10-21' 


the logs is the same:

////////////////////////////////////////////////////////////////////////////////////////////////////////////////
[Stage 5:=======================================>               (144 + 8) / 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, real=0.08 secs] 
[Stage 5:=========================================>             (152 + 8) / 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, real=0.08 secs] 
[Stage 5:============================================>          (160 + 8) / 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, real=0.07 secs] 
[Stage 5:================================================>      (176 + 8) / 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
[Stage 5:================================================>      (177 + 8) / 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
[Stage 5:====================================================>  (192 + 8) / 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, real=0.06 secs] 
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
.......
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' to trash at: hdfs://master.com/user/hadoop/.Trash/Current

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////

but after moving to trash, i don't need to wait for 10min, and the next log appears immediately.

for about 20 seconds, this sql is finished, and i get the same result as the above executing(the drop partitions statement costs about 5 second).



----------------------------------------------------------------------------------------------
Test-3:

now, if i run the same sql in hive server(not using spark-sql or spark-shell to run, just run it in beeline or hive client):

before running, i also load the same data in to login4game in partition(pt='mix_en',dt='2016-10-21')

then i run such sql:

insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')    select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login  where pt='mix_en' and  dt='2016-10-21' 

after running, i get the same result  for about 20 seconds.

==========================================================

now, it's the question, why its so slow when i use spark-sql to execute insert overwrite statement to handle a partitioned table with data(if there is no data in selected partitions, it runs fast) ?

important point:

1. it costs a lot of time in running move to trash when there is much data
2. running the same sql in beeline is fast
3. drop data from selected partitions before running executing insert overwrite statement will speeding the execution

Is there anyone could help me? Thanks a lot!






  was:
Hello,all. I face a strange thing in my project.

there is a table:

CREATE TABLE `login4game`(`account_name` string, `role_id` string, `server_id` string, `recdate` string)
PARTITIONED BY (`pt` string, `dt` string) stored as orc;

another table:
CREATE TABLE `tbllog_login`(`server` string,`role_id` bigint, `account_name` string, `happened_time` int)
PARTITIONED BY (`pt` string, `dt` string)



Test-1:

executed sql in spark-shell or spark-sql( before i run this sql, there is much data in partition(pt='mix_en', dt='2016-10-21') of table login4game ):

insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')    select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login  where pt='mix_en' and  dt='2016-10-21' 

it will cost a lot of time, below is a part of the logs:

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////
[Stage 5:=======================================>               (144 + 8) / 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, real=0.08 secs] 
[Stage 5:=========================================>             (152 + 8) / 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, real=0.08 secs] 
[Stage 5:============================================>          (160 + 8) / 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, real=0.07 secs] 
[Stage 5:================================================>      (176 + 8) / 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
[Stage 5:================================================>      (177 + 8) / 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
[Stage 5:====================================================>  (192 + 8) / 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, real=0.06 secs] 
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
.......
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////


i can see, the origin data is moved to .trash

and then, there is no log printing, and after about 10 min, the log print again:

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////
16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00000, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000, Status:true
16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00001, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001, Status:true
16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00002, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002, Status:true
16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00003, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003, Status:true
16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00004, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004, Status:true
....
16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00199, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199, Status:true
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////


after which, the sql executing is end. but it costs over 10 min to handle 257128 lines of data


Test-2:

if i remove all data of partition(pt='mix_en',dt='2016-10-21') in table login4game, and run such sql:

ALTER TABLE login4game DROP IF EXISTS PARTITION (pt='mix_en', dt='2016-10-21');

insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')    select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login  where pt='mix_en' and  dt='2016-10-21' 


the logs is the same:

////////////////////////////////////////////////////////////////////////////////////////////////////////////////
[Stage 5:=======================================>               (144 + 8) / 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, real=0.08 secs] 
[Stage 5:=========================================>             (152 + 8) / 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, real=0.08 secs] 
[Stage 5:============================================>          (160 + 8) / 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, real=0.07 secs] 
[Stage 5:================================================>      (176 + 8) / 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
[Stage 5:================================================>      (177 + 8) / 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
[Stage 5:====================================================>  (192 + 8) / 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, real=0.06 secs] 
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
.......
Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' to trash at: hdfs://master.com/user/hadoop/.Trash/Current

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////

but after moving to trash, i don't need to wait for 10min, and the next log appears immediately.

for about 20 seconds, this sql is finished, and i get the same result as the above executing(the drop partitions statement costs about 5 second).

Test-3:

now, if i run the same sql in hive server(not using spark-sql or spark-shell to run, just run it in beeline or hive client):

before running, i also load the same data in to login4game in partition(pt='mix_en',dt='2016-10-21')

then i run such sql:

insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')    select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login  where pt='mix_en' and  dt='2016-10-21' 

after running, i get the same result  for about 20 seconds.

==========================================================

now, it's the question, why its so slow when i use spark-sql to execute insert overwrite statement to handle a partitioned table with data(if there is no data in selected partitions, it runs fast) ?

important point:

1. it costs a lot of time in running move to trash when there is much data
2. running the same sql in beeline is fast
3. drop data from selected partitions before running executing insert overwrite statement will speeding the execution

Is there anyone could help me? Thanks a lot!







> Run insert overwrite statements in spark  to overwrite a partitioned table is very slow
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-18077
>                 URL: https://issues.apache.org/jira/browse/SPARK-18077
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>         Environment: spark 2.0
> hive 2.0.1
> driver memory: 4g
> total executors: 4
> executor memory: 10g
> total cores: 13
>            Reporter: J.P Feng
>              Labels: hive, insert, sparkSQL
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Hello,all. I face a strange thing in my project.
> there is a table:
> CREATE TABLE `login4game`(`account_name` string, `role_id` string, `server_id` string, `recdate` string)
> PARTITIONED BY (`pt` string, `dt` string) stored as orc;
> another table:
> CREATE TABLE `tbllog_login`(`server` string,`role_id` bigint, `account_name` string, `happened_time` int)
> PARTITIONED BY (`pt` string, `dt` string)
> ----------------------------------------------------------------------------------------------
> Test-1:
> executed sql in spark-shell or spark-sql( before i run this sql, there is much data in partition(pt='mix_en', dt='2016-10-21') of table login4game ):
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')    select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login  where pt='mix_en' and  dt='2016-10-21' 
> it will cost a lot of time, below is a part of the logs:
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> [Stage 5:=======================================>               (144 + 8) / 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, real=0.08 secs] 
> [Stage 5:=========================================>             (152 + 8) / 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, real=0.08 secs] 
> [Stage 5:============================================>          (160 + 8) / 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, real=0.07 secs] 
> [Stage 5:================================================>      (176 + 8) / 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
> [Stage 5:================================================>      (177 + 8) / 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
> [Stage 5:====================================================>  (192 + 8) / 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, real=0.06 secs] 
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> .......
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> i can see, the origin data is moved to .trash
> and then, there is no log printing, and after about 10 min, the log print again:
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00000, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000, Status:true
> 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00001, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001, Status:true
> 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00002, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002, Status:true
> 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00003, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003, Status:true
> 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00004, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004, Status:true
> ....
> 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00199, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199, Status:true
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> after which, the sql executing is end. but it costs over 10 min to handle 257128 lines of data
> ---------------------------------------------------------------------------------------------
> Test-2:
> if i remove all data of partition(pt='mix_en',dt='2016-10-21') in table login4game, and run such sql:
> ALTER TABLE login4game DROP IF EXISTS PARTITION (pt='mix_en', dt='2016-10-21');
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')    select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login  where pt='mix_en' and  dt='2016-10-21' 
> the logs is the same:
> ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> [Stage 5:=======================================>               (144 + 8) / 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, real=0.08 secs] 
> [Stage 5:=========================================>             (152 + 8) / 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, real=0.08 secs] 
> [Stage 5:============================================>          (160 + 8) / 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, real=0.07 secs] 
> [Stage 5:================================================>      (176 + 8) / 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
> [Stage 5:================================================>      (177 + 8) / 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
> [Stage 5:====================================================>  (192 + 8) / 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, real=0.06 secs] 
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> .......
> Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> but after moving to trash, i don't need to wait for 10min, and the next log appears immediately.
> for about 20 seconds, this sql is finished, and i get the same result as the above executing(the drop partitions statement costs about 5 second).
> ----------------------------------------------------------------------------------------------
> Test-3:
> now, if i run the same sql in hive server(not using spark-sql or spark-shell to run, just run it in beeline or hive client):
> before running, i also load the same data in to login4game in partition(pt='mix_en',dt='2016-10-21')
> then i run such sql:
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')    select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login  where pt='mix_en' and  dt='2016-10-21' 
> after running, i get the same result  for about 20 seconds.
> ==========================================================
> now, it's the question, why its so slow when i use spark-sql to execute insert overwrite statement to handle a partitioned table with data(if there is no data in selected partitions, it runs fast) ?
> important point:
> 1. it costs a lot of time in running move to trash when there is much data
> 2. running the same sql in beeline is fast
> 3. drop data from selected partitions before running executing insert overwrite statement will speeding the execution
> Is there anyone could help me? Thanks a lot!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org