You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "jiaan.geng (JIRA)" <ji...@apache.org> on 2019/03/13 02:36:00 UTC
[jira] [Updated] (SPARK-27140) The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment.

     [ https://issues.apache.org/jira/browse/SPARK-27140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

jiaan.geng updated SPARK-27140:
-------------------------------
    Description: 
In local[*] mode, maropu give a test case as follows:
{code:java}
$ls /tmp/noexistdir
ls: /tmp/noexistdir: No such file or directory

scala> sql("""create table t(c0 int, c1 int)""")
scala> spark.table("t").explain
== Physical Plan ==
Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]

scala> sql("""insert into t values(1, 1)""")
scala> sql("""select * from t""").show
+---+---+
| c0| c1|
+---+---+
|  1|  1|
+---+---+

scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""")

$ls /tmp/noexistdir/t/
_SUCCESS  part-00000-bbea4213-071a-49b4-aac8-8510e7263d45-c000
{code}
This test case prove spark will create the not exists path and move middle result from local temporary path to created path.This test based on newest master.

I follow the test case provided by maropu,but find another behavior.
 I run these SQL maropu provided on local[*] deploy mode based on 2.3.0.
 Inconsistent behavior appears as follows:
{code:java}
ls /tmp/noexistdir
ls: cannot access /tmp/noexistdir: No such file or directory

scala> sql("""create table t(c0 int, c1 int)""")
res0: org.apache.spark.sql.DataFrame = []
scala> spark.table("t").explain
== Physical Plan ==
HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]

scala> sql("""insert into t values(1, 1)""")
scala> sql("""select * from t""").show
+---+---+                                                                       
| c0| c1|
+---+---+
|  1|  1|
+---+---+

scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""")
res1: org.apache.spark.sql.DataFrame = [] 

ls /tmp/noexistdir/t/
/tmp/noexistdir/t

vi /tmp/noexistdir/t
  1 
{code}
Then I pull the master branch and compile it and deploy it on my hadoop cluster.I get the inconsistent behavior again. The spark version to test is 3.0.0.
{code:java}
ls /tmp/noexistdir
ls: cannot access /tmp/noexistdir: No such file or directory
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Spark context Web UI available at http://10.198.66.204:55326
Spark context available as 'sc' (master = local[*], app id = local-1551259036573).
Spark session available as 'spark'.
Welcome to spark version 3.0.0-SNAPSHOT
Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("""select * from t""").show
+---+---+                                                                       
| c0| c1|
+---+---+
|  1|  1|
+---+---+


scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""")
res1: org.apache.spark.sql.DataFrame = []                                       

scala> 
ll /tmp/noexistdir/t
-rw-r--r-- 1 xitong xitong 0 Feb 27 17:19 /tmp/noexistdir/t
vi /tmp/noexistdir/t
  1
{code}
The /tmp/noexistdir/t is a file too.

I create a PR `https://github.com/apache/spark/pull/23950` used for test the behavior by UT.

UT results are the same as those of maropu's test, but different from mine.

 

  was:
In local[*] mode, maropu give a test case as follows:
{code:java}
$ls /tmp/noexistdir
ls: /tmp/noexistdir: No such file or directory

scala> sql("""create table t(c0 int, c1 int)""")
scala> spark.table("t").explain
== Physical Plan ==
Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]

scala> sql("""insert into t values(1, 1)""")
scala> sql("""select * from t""").show
+---+---+
| c0| c1|
+---+---+
|  1|  1|
+---+---+

scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""")

$ls /tmp/noexistdir/t/
_SUCCESS  part-00000-bbea4213-071a-49b4-aac8-8510e7263d45-c000
{code}
This test case prove spark will create the not exists path and move middle result from local temporary path to created path.This test based on newest master.

I follow the test case provided by maropu,but find another behavior.
I run these SQL maropu provided on local[*] deploy mode based on 2.3.0.
Inconsistent behavior appears as follows:
{code:java}
ls /tmp/noexistdir
ls: cannot access /tmp/noexistdir: No such file or directory

scala> sql("""create table t(c0 int, c1 int)""")
res0: org.apache.spark.sql.DataFrame = []
scala> spark.table("t").explain
== Physical Plan ==
HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]

scala> sql("""insert into t values(1, 1)""")
scala> sql("""select * from t""").show
+---+---+                                                                       
| c0| c1|
+---+---+
|  1|  1|
+---+---+

scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""")
res1: org.apache.spark.sql.DataFrame = [] 

ls /tmp/noexistdir/t/
/tmp/noexistdir/t

vi /tmp/noexistdir/t
  1 
{code}
Then I pull the master branch and compile it and deploy it on my hadoop cluster.I get the inconsistent behavior again. The spark version to test is 3.0.0.
{code:java}
ls /tmp/noexistdir
ls: cannot access /tmp/noexistdir: No such file or directory
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Spark context Web UI available at http://10.198.66.204:55326
Spark context available as 'sc' (master = local[*], app id = local-1551259036573).
Spark session available as 'spark'.
Welcome to spark version 3.0.0-SNAPSHOT
Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("""select * from t""").show
+---+---+                                                                       
| c0| c1|
+---+---+
|  1|  1|
+---+---+


scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""")
res1: org.apache.spark.sql.DataFrame = []                                       

scala> 
ll /tmp/noexistdir/t
-rw-r--r-- 1 xitong xitong 0 Feb 27 17:19 /tmp/noexistdir/t
vi /tmp/noexistdir/t
  1
{code}
The /tmp/noexistdir/t is a file too.

So


> The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment.
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-27140
>                 URL: https://issues.apache.org/jira/browse/SPARK-27140
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0, 2.4.0, 3.0.0
>            Reporter: jiaan.geng
>            Priority: Major
>
> In local[*] mode, maropu give a test case as follows:
> {code:java}
> $ls /tmp/noexistdir
> ls: /tmp/noexistdir: No such file or directory
> scala> sql("""create table t(c0 int, c1 int)""")
> scala> spark.table("t").explain
> == Physical Plan ==
> Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]
> scala> sql("""insert into t values(1, 1)""")
> scala> sql("""select * from t""").show
> +---+---+
> | c0| c1|
> +---+---+
> |  1|  1|
> +---+---+
> scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""")
> $ls /tmp/noexistdir/t/
> _SUCCESS  part-00000-bbea4213-071a-49b4-aac8-8510e7263d45-c000
> {code}
> This test case prove spark will create the not exists path and move middle result from local temporary path to created path.This test based on newest master.
> I follow the test case provided by maropu,but find another behavior.
>  I run these SQL maropu provided on local[*] deploy mode based on 2.3.0.
>  Inconsistent behavior appears as follows:
> {code:java}
> ls /tmp/noexistdir
> ls: cannot access /tmp/noexistdir: No such file or directory
> scala> sql("""create table t(c0 int, c1 int)""")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.table("t").explain
> == Physical Plan ==
> HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]
> scala> sql("""insert into t values(1, 1)""")
> scala> sql("""select * from t""").show
> +---+---+                                                                       
> | c0| c1|
> +---+---+
> |  1|  1|
> +---+---+
> scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""")
> res1: org.apache.spark.sql.DataFrame = [] 
> ls /tmp/noexistdir/t/
> /tmp/noexistdir/t
> vi /tmp/noexistdir/t
>   1 
> {code}
> Then I pull the master branch and compile it and deploy it on my hadoop cluster.I get the inconsistent behavior again. The spark version to test is 3.0.0.
> {code:java}
> ls /tmp/noexistdir
> ls: cannot access /tmp/noexistdir: No such file or directory
> Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
> Spark context Web UI available at http://10.198.66.204:55326
> Spark context available as 'sc' (master = local[*], app id = local-1551259036573).
> Spark session available as 'spark'.
> Welcome to spark version 3.0.0-SNAPSHOT
> Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("""select * from t""").show
> +---+---+                                                                       
> | c0| c1|
> +---+---+
> |  1|  1|
> +---+---+
> scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""")
> res1: org.apache.spark.sql.DataFrame = []                                       
> scala> 
> ll /tmp/noexistdir/t
> -rw-r--r-- 1 xitong xitong 0 Feb 27 17:19 /tmp/noexistdir/t
> vi /tmp/noexistdir/t
>   1
> {code}
> The /tmp/noexistdir/t is a file too.
> I create a PR `https://github.com/apache/spark/pull/23950` used for test the behavior by UT.
> UT results are the same as those of maropu's test, but different from mine.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org