You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by do...@apache.org on 2021/08/27 19:59:24 UTC
[spark] branch master updated: [SPARK-36327][SQL] Spark sql creates staging dir inside database directory rather than creating inside table directory

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new fe7bf5f  [SPARK-36327][SQL] Spark sql creates staging dir inside database directory rather than creating inside table directory
fe7bf5f is described below

commit fe7bf5f96fca8b6b4ea8d7af4a4267d9e18f232e
Author: senthilkumarb <se...@cloudera.com>
AuthorDate: Fri Aug 27 12:58:28 2021 -0700

    [SPARK-36327][SQL] Spark sql creates staging dir inside database directory rather than creating inside table directory
    
    ### What changes were proposed in this pull request?
    
    This PR does minor changes in the file SaveAsHiveFile.scala.
    
    It contains the below changes :
    
    1. dropping getParent from below part of code
    ===============================
    if (extURI.getScheme == "viewfs") {
    getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
    ===============================
    
    ### Why are the changes needed?
    
    Hive is creating .staging directories inside "/db/table" location but Spark-sql creates .staging directories inside /db/" location when we use hadoop federation(viewFs). But works as expected (creating .staging inside /db/table/ location for other filesystems like hdfs).
    
    In HIVE:
    ```
     beeline
    > use dicedb;
    > insert into table part_test partition (j=1) values (1);
    ...
    INFO  : Loading data to table dicedb.part_test partition (j=1) from **viewfs://cloudera/user/daisuke/dicedb/part_test/j=1/.hive-staging_hive_2021-07-19_13-04-44_989_6775328876605030677-1/-ext-10000**
    ```
    
    but spark's behaviour,
    
    ```
    spark-sql> use dicedb;
    spark-sql> insert into table part_test partition (j=2) values (2);
    21/07/19 13:07:37 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/.hive-staging_hive_2021-07-19_13-07-37_317_5083528872437596950-1**
    ...
    ```
    
    The reason why we require this change is , if we allow spark-sql to create .staging directory inside /db/ location then we will end-up with security issues. We need to provide permission for "viewfs:///db/" location to all users who submit spark jobs.
    
    After this change is applied spark-sql creates .staging inside /db/table/,  similar to hive, as below,
    
    ```
    spark-sql> use dicedb;
    21/07/28 00:22:47 INFO SparkSQLCLIDriver: Time taken: 0.929 seconds
    spark-sql> insert into table part_test partition (j=8) values (8);
    21/07/28 00:23:25 INFO HiveMetaStoreClient: Closed a connection to metastore, current connections: 1
    21/07/28 00:23:26 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/part_test/.hive-staging_hive_2021-07-28_00-23-26_109_4548714524589026450-1**
    ```
    
    The reason why we don't see this issue in Hive but only occurs in Spark-sql:
    
    In hive, "/db/table/tmp" directory structure is passed for path and hence path.getParent returns "db/table/" . But in Spark we just pass "/db/table" so it is not required to use "path.getParent" for hadoop federation(viewfs)
    
    ### Does this PR introduce _any_ user-facing change?
     No
    
    ### How was this patch tested?
    
    Tested manually by creating hive-sql.jar
    
    Closes #33577 from senthh/viewfs-792392.
    
    Authored-by: senthilkumarb <se...@cloudera.com>
    Signed-off-by: Dongjoon Hyun <do...@apache.org>
---
 .../main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
index ec18934..b30de0c 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
@@ -188,7 +188,7 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand {
       stagingDir: String): Path = {
     val extURI: URI = path.toUri
     if (extURI.getScheme == "viewfs") {
-      getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
+      getExtTmpPathRelTo(path, hadoopConf, stagingDir)
     } else {
       new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), "-ext-10000")
     }

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org