You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by kevinyu98 <gi...@git.apache.org> on 2016/06/04 00:51:11 UTC

[GitHub] spark pull request #13506: [SPARK-15763][SQL] Support DELETE FILE command na...

GitHub user kevinyu98 opened a pull request:

    https://github.com/apache/spark/pull/13506

    [SPARK-15763][SQL] Support DELETE FILE command natively

    ## What changes were proposed in this pull request?
    Hive supports  these cli commands to manage the resource [Hive Doc](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli) : 
    `ADD/DELETE (FILE(s)<filepath..>|JAR(s) <jarpath..>)` 
    `LIST (FILE(S) [filepath ...] | JAR(S) [jarpath ...]) `
    
     but SPARK only supports two commands  
    `ADD (FILE <filepath> | JAR <jarpath>)` 
    `LIST (FILE(S) [filepath ...] | JAR(S) [jarpath ...])` for now.
    
    This PR is to add the DELETE FILE command into Spark SQL and I will submit another PR for the DELETE JAR(s).
    
    `DELETE FILE <filepath>`
    
    ## **Example:**
    **DELETE FILE**
    ```
    scala> spark.sql("add file /Users/qianyangyu/myfile.txt")
    res0: org.apache.spark.sql.DataFrame = []
    
    scala> spark.sql("add file /Users/qianyangyu/myfile2.txt")
    res1: org.apache.spark.sql.DataFrame = []
    
    scala> spark.sql("list file")
    res2: org.apache.spark.sql.DataFrame = [Results: string]
    
    scala> spark.sql("list file").show(false)
    +----------------------------------+
    |Results                           |
    +----------------------------------+
    |file:/Users/qianyangyu/myfile2.txt|
    |file:/Users/qianyangyu/myfile.txt |
    +----------------------------------+
    scala> spark.sql("delete file /Users/qianyangyu/myfile.txt")
    res4: org.apache.spark.sql.DataFrame = []
    
    scala> spark.sql("list file").show(false)
    +----------------------------------+
    |Results                           |
    +----------------------------------+
    |file:/Users/qianyangyu/myfile2.txt|
    +----------------------------------+
    
    
    scala> spark.sql("delete file /Users/qianyangyu/myfile2.txt")
    res6: org.apache.spark.sql.DataFrame = []
    
    scala> spark.sql("list file").show(false)
    +-------+
    |Results|
    +-------+
    +-------+
    ```
    
    ## How was this patch tested?
    
    Add test cases in Spark-SQL SPARK-Shell and SparkContext suites.
    
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kevinyu98/spark spark-15763

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13506.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13506
    
----
commit 3b44c5978bd44db986621d3e8511e9165b66926b
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-04-20T18:06:30Z

    adding testcase

commit 18b4a31c687b264b50aa5f5a74455956911f738a
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-04-22T21:48:00Z

    Merge remote-tracking branch 'upstream/master'

commit 4f4d1c8f2801b1e662304ab2b33351173e71b427
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-04-23T16:50:19Z

    Merge remote-tracking branch 'upstream/master'
    get latest code from upstream

commit f5f0cbed1eb5754c04c36933b374c3b3d2ae4f4e
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-04-23T22:20:53Z

    Merge remote-tracking branch 'upstream/master'
    adding trim characters support

commit d8b2edbd13ee9a4f057bca7dcb0c0940e8e867b8
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-04-25T20:24:33Z

    Merge remote-tracking branch 'upstream/master'
    get latest code for pr12646

commit 196b6c66b0d55232f427c860c0e7c6876c216a67
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-04-25T23:45:57Z

    Merge remote-tracking branch 'upstream/master'
    merge latest code

commit f37a01e005f3e27ae2be056462d6eb6730933ba5
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-04-27T14:15:06Z

    Merge remote-tracking branch 'upstream/master'
    merge upstream/master

commit bb5b01fd3abeea1b03315eccf26762fcc23f80c0
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-04-30T23:49:31Z

    Merge remote-tracking branch 'upstream/master'

commit bde5820a181cf84e0879038ad8c4cebac63c1e24
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-05-04T03:52:31Z

    Merge remote-tracking branch 'upstream/master'

commit 5f7cd96d495f065cd04e8e4cc58461843e45bc8d
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-05-10T21:14:50Z

    Merge remote-tracking branch 'upstream/master'

commit 893a49af0bfd153ccb59ba50b63a232660e0eada
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-05-13T18:20:39Z

    Merge remote-tracking branch 'upstream/master'

commit 4bbe1fd4a3ebd50338ccbe07dc5887fe289cd53d
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-05-17T21:58:14Z

    Merge remote-tracking branch 'upstream/master'

commit b2dd795e23c36cbbd022f07a10c0cf21c85eb421
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-05-18T06:37:13Z

    Merge remote-tracking branch 'upstream/master'

commit 8c3e5da458dbff397ed60fcb68f2a46d87ab7ba4
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-05-18T16:18:16Z

    Merge remote-tracking branch 'upstream/master'

commit a0eaa408e847fbdc3ac5b26348588ee0a1e276c7
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-05-19T04:28:20Z

    Merge remote-tracking branch 'upstream/master'

commit d03c940ed89795fa7fe1d1e9f511363b22cdf19d
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-05-19T21:24:33Z

    Merge remote-tracking branch 'upstream/master'

commit d728d5e002082e571ac47292226eb8b2614f479f
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-05-24T20:32:57Z

    Merge remote-tracking branch 'upstream/master'

commit ea104ddfbf7d180ed1bc53dd9a1005010264aa1f
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-05-25T22:52:57Z

    Merge remote-tracking branch 'upstream/master'

commit 6ab1215b781ad0cccf1752f3a625b4e4e371c38e
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-05-27T17:18:46Z

    Merge remote-tracking branch 'upstream/master'

commit 0c566533705331697eb1b287b30c8b16111f6fa2
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-06-01T06:48:57Z

    Merge remote-tracking branch 'upstream/master'

commit d7a187490b31185d0a803cbbdeda67cb26c40056
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-06-01T22:55:17Z

    Merge remote-tracking branch 'upstream/master'

commit 85d35002ce864d5ce6fd3be7215a868a8867caf9
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-06-02T14:08:30Z

    Merge remote-tracking branch 'upstream/master'

commit c056f91036ec75d1e2c93f6f47ad842eb28a3e0b
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-06-03T06:06:51Z

    Merge remote-tracking branch 'upstream/master'

commit 6dd6ca9aedcad9b024cbe092b2ee7540c90c0136
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-06-03T23:33:12Z

    fix7

commit 0b8189dd454897ae73bb3a5ffc245b2c65f6b226
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-06-03T23:33:39Z

    Merge remote-tracking branch 'upstream/master'

commit 527749d014af672fd3942dea4823552bedb1749a
Author: Kevin Yu <qy...@us.ibm.com>
Date:   2016-06-03T23:34:16Z

    Merge branch 'spark-deletefile' into spark-15763

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13506: [SPARK-15763][SQL] Support DELETE FILE command natively

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/13506
  
    @kevinyu98 Can you please close it? It seems like there is not a lot of interest in adding this functionality natively in Spark. If anybody wants this feature, we can reopen it later?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13506: [SPARK-15763][SQL] Support DELETE FILE command natively

Posted by kevinyu98 <gi...@git.apache.org>.
Github user kevinyu98 commented on the issue:

    https://github.com/apache/spark/pull/13506
  
    sure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13506: [SPARK-15763][SQL] Support DELETE FILE command na...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13506


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13506: [SPARK-15763][SQL] Support DELETE FILE command natively

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/13506
  
    @kevinyu98 Could you update the PR and fix merge conflicts? Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13506: [SPARK-15763][SQL] Support DELETE FILE command na...

Posted by kevinyu98 <gi...@git.apache.org>.
Github user kevinyu98 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13506#discussion_r65805306
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -1441,6 +1441,32 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
       }
     
       /**
    +   * Delete a file to be downloaded with this Spark job on every node.
    +   * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported
    +   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in Spark jobs,
    +   * use `SparkFiles.get(fileName)` to find its download location.
    +   *
    +   */
    +  def deleteFile(path: String): Unit = {
    --- End diff --
    
    Hello Reynold: Sorry I am afraid that I misunderstood your previous comments. Does your mean the user should take the path from the LIST FILE command output, then use that path as the DELETE FILE command's path? If that is the case, the delete code will much simple. Thanks for your advice. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13506: [SPARK-15763][SQL] Support DELETE FILE command natively

Posted by kevinyu98 <gi...@git.apache.org>.
Github user kevinyu98 commented on the issue:

    https://github.com/apache/spark/pull/13506
  
    @vanzin Hello Marcelo: I am so sorry that I didn't notice your update. I have fix the merge conflicts and  can you help review it? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13506: [SPARK-15763][SQL] Support DELETE FILE command na...

Posted by kevinyu98 <gi...@git.apache.org>.
Github user kevinyu98 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13506#discussion_r65981961
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -1441,6 +1441,32 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
       }
     
       /**
    +   * Delete a file to be downloaded with this Spark job on every node.
    +   * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported
    +   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in Spark jobs,
    +   * use `SparkFiles.get(fileName)` to find its download location.
    +   *
    +   */
    +  def deleteFile(path: String): Unit = {
    --- End diff --
    
    I have updated the deleteFile comments to make it more clear. Thanks for reviewing. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13506: [SPARK-15763][SQL] Support DELETE FILE command natively

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/13506
  
    We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13506: [SPARK-15763][SQL] Support DELETE FILE command natively

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13506
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13506: [SPARK-15763][SQL] Support DELETE FILE command na...

Posted by kevinyu98 <gi...@git.apache.org>.
Github user kevinyu98 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13506#discussion_r65799162
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -1441,6 +1441,32 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
       }
     
       /**
    +   * Delete a file to be downloaded with this Spark job on every node.
    +   * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported
    +   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in Spark jobs,
    +   * use `SparkFiles.get(fileName)` to find its download location.
    +   *
    +   */
    +  def deleteFile(path: String): Unit = {
    --- End diff --
    
    Hi Reynold: Thanks very much for reviewing the code. 
    yes, it is deleting the path from the addedFile hashmap, the path will be generated as key and stored in the map. 
    The addFile use this logical to generate the key and stored in the hashmap, so in order to find the same key, I have to use the same logical to generate the key. 
    For example:
    for this local file, the addFile will generate a 'file' in front of the path.
    
    spark.sql("add file /Users/qianyangyu/myfile.txt")
    
    scala> spark.sql("list file").show(false)
    +----------------------------------+
    |Results                           |
    +----------------------------------+
    |file:/Users/qianyangyu/myfile2.txt|
    |file:/Users/qianyangyu/myfile.txt |
    +----------------------------------+
    
    but for the remote location file, it will just take the path.
    
    scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt")
    res17: org.apache.spark.sql.DataFrame = []
    
    scala> spark.sql("list file").show(false)
    +---------------------------------------------+
    |Results                                      |
    +---------------------------------------------+
    |file:/Users/qianyangyu/myfile.txt            |
    |hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt|
    +---------------------------------------------+
    
    if the command is issued from the worker node and add local file, the path will be added into the NettyStreamManager's hashmap and using that environment's path as key to store in the addedFiles. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13506: [SPARK-15763][SQL] Support DELETE FILE command na...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13506#discussion_r65797550
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -1441,6 +1441,32 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
       }
     
       /**
    +   * Delete a file to be downloaded with this Spark job on every node.
    +   * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported
    +   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in Spark jobs,
    +   * use `SparkFiles.get(fileName)` to find its download location.
    +   *
    +   */
    +  def deleteFile(path: String): Unit = {
    --- End diff --
    
    this is fairly confusing -- i'd assume this is actually deleting the path given.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org