You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sujith71955 <gi...@git.apache.org> on 2018/09/11 18:45:29 UTC

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

GitHub user sujith71955 opened a pull request:

    https://github.com/apache/spark/pull/22396

    [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS path for loadtable command.

    
    What changes were proposed in this pull request
    Updated the Migration guide for the behavior changes done in the JIRA issue SPARK-23425.
    
    How was this patch tested?
    Manually verified.
    
    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sujith71955/spark master_newtest

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22396.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22396
    
----
commit c875b16e9ebb0ac1702227ca6d24afa9f9f2d1af
Author: s71955 <su...@...>
Date:   2018-09-11T14:11:55Z

    [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS path for load table command.
    
    What changes were proposed in this pull request
    Updated the Migration guide for the behavior changes done in the JIRA issue SPARK-23425.
    
    How was this patch tested?
    Manually verified.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217897114

--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
- Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
- Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string.
+ - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space character in folder/file names (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv')
--- End diff --

Why we only mention the space character?

```
val p2 = new Path(new URI("a%30b"))
print(p2)

val p = new Path("a%30b")
print(p)
```

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217920696

@gatorsmile Just used a common encoder (%20) in our example.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96110/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217008531

--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
- Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
- Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string.
+ - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now onwards normal space convention can be used in folder/file names (e.g. LOAD DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv).
--- End diff --

We should also mention that, the old way to escape the special chars will not work in 2.4.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    **[Test build #96121 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96121/testReport)** for PR 22396 at commit [`3514e37`](https://github.com/apache/spark/commit/3514e37a6f588dc5ac4b1ecd257f0f19bb9f5c62).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217209344

--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
- Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
- Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string.
+ - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now onwards normal space convention can be used in folder/file names (e.g. LOAD DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version.
--- End diff --

Is it specific to the local file system?
Can this text add a quick example of using `?` too? this wildcard syntax is not regex syntax.
From your previous analysis, spaces in paths didn't work before even if escaped with `%20`. Shouldn't we just say that, additionally, special characters in paths like spaces should work now, and give the example?

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217151516

--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
- Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
- Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string.
+ - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now onwards normal space convention can be used in folder/file names (e.g. LOAD DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv).
--- End diff --

taken care, please let me know for any further suggestions. thanks for looking into this.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217758371

for curiosity, now we have the same path syntax for local fs and HDFS?

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    **[Test build #96080 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96080/testReport)** for PR 22396 at commit [`8cbb4cc`](https://github.com/apache/spark/commit/8cbb4cc583913a18990be33b6c31c95c20a7bb82).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96124/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217920417

@srowen Sorry Sean i missed your suggested text, I updated the message based on your suggestions. Actually i became bit confused as this PR is a combination of bug fix and improvement :) .

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22396


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    thanks, merging to master/2.4!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217673140

--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
- Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
- Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string.
+ - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now onwards normal space convention can be used in folder/file names (e.g. LOAD DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version.
--- End diff --

Is it specific to the local file system? << Yes, , its specific to local file system as in hdfs user can provide wildcard character in folder level also, for local file system folder level support was not there and error will be thrown)
Can this text add a quick example of using ? too?<< Yes i added the same>>

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217897208

Agree, the text should be clear that this is only an example of a character that could work in a path now. It might be the most common one. @sujith71955 see my suggested text above.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    @gatorsmile @srowen 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217926918

--- Diff: docs/sql-programming-guide.md ---
@@ -1897,7 +1897,8 @@ working with timestamps in `pandas_udf`s to get the best performance, see
- In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`.
- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
- Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
- - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string.
+ - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string.
+ - Since Spark 2.4, The LOAD DATA command supports wildcard characters ? and *, which match any one character, and zero or more characters, respectively. Example: LOAD DATA INPATH '/tmp/folder*/ or LOAD DATA INPATH /tmp/part-?. Special Characters like spaces also now work in paths. Example: LOAD DATA INPATH /tmp/folder name/.
--- End diff --

The commands and paths should be back-tick-quoted for readability. I think they may be interpreted as markdown otherwise.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    That's fine, and worth adding to the "Docs Text" field in SPARK-23425 as it will then also go in release notes. What about a quick test case for this too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    **[Test build #96124 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96124/testReport)** for PR 22396 at commit [`caeeb0d`](https://github.com/apache/spark/commit/caeeb0d3cf9f0f898a8fa730723005e8c4ef77b5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    **[Test build #96080 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96080/testReport)** for PR 22396 at commit [`8cbb4cc`](https://github.com/apache/spark/commit/8cbb4cc583913a18990be33b6c31c95c20a7bb82).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    **[Test build #96124 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96124/testReport)** for PR 22396 at commit [`caeeb0d`](https://github.com/apache/spark/commit/caeeb0d3cf9f0f898a8fa730723005e8c4ef77b5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217722539

--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
- Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
- Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string.
+ - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now onwards normal space convention can be used in folder/file names (e.g. LOAD DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version.
--- End diff --

I think this text doesn't describe the change then. The new functionality is that wildcards work at all levels in both local and remote file systems, right? This also says that `%20` escaping used to work, but your results show it didn't. That doesn't seem like a change in behavior. This text also has typos and spacing problems. To be clear, here is the text I suggest:

Since Spark 2.4, the `LOAD DATA` command supports wildcard characters `?` and `*`, which match any one character, and zero or more characters, respectively. Example: `LOAD DATA INPATH '/tmp/folder*/` or `LOAD DATA INPATH /tmp/part-?`. Characters like spaces also now work in paths: `LOAD DATA INPATH `/tmp/folder name/`.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    **[Test build #96110 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96110/testReport)** for PR 22396 at commit [`b34b962`](https://github.com/apache/spark/commit/b34b96208dc86e9642dbc65e33a643df7b7ee406).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    Any idea why some parts of the text are highlighting in blue color?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    **[Test build #96121 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96121/testReport)** for PR 22396 at commit [`3514e37`](https://github.com/apache/spark/commit/3514e37a6f588dc5ac4b1ecd257f0f19bb9f5c62).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    **[Test build #96123 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96123/testReport)** for PR 22396 at commit [`95e6831`](https://github.com/apache/spark/commit/95e68318b30fb94af170a1150ac5db2f61e96f08).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217672824

--- Diff: docs/sql-programming-guide.md ---
@@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
- Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
- Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string.
+ - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/). Now onwards normal space convention can be used in folder/file names (e.g. LOAD DATA INPATH 'tmp/folderName/file Name.csv), Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version.
--- End diff --

<<From your previous analysis, spaces in paths didn't work before even if escaped with %20. Shouldn't we just say that>>
@srowen I modified the statement please recheck once and let me know for any suggestions.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    **[Test build #96110 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96110/testReport)** for PR 22396 at commit [`b34b962`](https://github.com/apache/spark/commit/b34b96208dc86e9642dbc65e33a643df7b7ee406).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96123/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22396#discussion_r217802972

@cloud-fan We follow the same syntax as old versions for Load command path, except in older versions user was not able to provide wildcard characters in folder level of the local fs , Now we do support with our new implementation and even in hdfs we do support the same syntax. So now it is consistent. All the usage which i mentioned can be applied in both local and hdfs file systems. Now the usages are more consistent compare to older versions.

For more details please refer below PR let me know for any clarifications. Thanks
https://github.com/apache/spark/pull/20611

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96121/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96080/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    > That's fine, and worth adding to the "Docs Text" field in SPARK-23425 as it will then also go in release notes. What about a quick test case for this too?
    
    Added a UT for verifying the use-case regarding file name with space in a Load command. Please let me know for any suggestions.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22396
  
    **[Test build #96123 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96123/testReport)** for PR 22396 at commit [`95e6831`](https://github.com/apache/spark/commit/95e68318b30fb94af170a1150ac5db2f61e96f08).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org