You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mmolimar <gi...@git.apache.org> on 2018/08/25 17:59:40 UTC

[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

GitHub user mmolimar opened a pull request:

    https://github.com/apache/spark/pull/22234

    [SPARK-25241][SQL] Configurable empty values when reading/writing CSV files

    ## What changes were proposed in this pull request?
    There is an option in the CSV parser to set values when we have empty values in the CSV files or in our dataframes.
    Currently, this option cannot be configured and always sets a default value (empty string for reading and `""` for writing).
    This PR is about enabling a new CSV option in the reader/writer to set custom empty values when reading/writing CSV files.
    
    ## How was this patch tested?
    The changes were tested by CSVSuite adding two unit tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mmolimar/spark SPARK-25241

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22234.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22234
    
----
commit 8b5180021d246ab2fdf0824c01b9f180136837ce
Author: Mario Molina <mm...@...>
Date:   2018-08-25T17:42:03Z

    Configurable empty values when reading/writing CSV files

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    **[Test build #95259 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95259/testReport)** for PR 22234 at commit [`8b51800`](https://github.com/apache/spark/commit/8b5180021d246ab2fdf0824c01b9f180136837ce).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

Posted by mmolimar <gi...@git.apache.org>.
Github user mmolimar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22234#discussion_r212850822
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala ---
    @@ -117,6 +117,9 @@ class CSVOptions(
     
       val nullValue = parameters.getOrElse("nullValue", "")
     
    +  val emptyValueInRead = parameters.getOrElse("emptyValue", "")
    --- End diff --
    
    I though that as well. Just for the shake of providing backwards compatibility as we already have in `ignoreLeadingWhiteSpaceInRead` and `ignoreLeadingWhiteSpaceFlagInWrite` I implemented that in that way.
    What do you say?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22234#discussion_r212825429
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -345,11 +345,11 @@ def text(self, paths, wholetext=False, lineSep=None):
         @since(2.0)
         def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=None,
                 comment=None, header=None, inferSchema=None, ignoreLeadingWhiteSpace=None,
    -            ignoreTrailingWhiteSpace=None, nullValue=None, nanValue=None, positiveInf=None,
    -            negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None,
    -            maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None,
    -            columnNameOfCorruptRecord=None, multiLine=None, charToEscapeQuoteEscaping=None,
    -            samplingRatio=None, enforceSchema=None):
    +            ignoreTrailingWhiteSpace=None, nullValue=None, emptyValue=None, nanValue=None,
    --- End diff --
    
    We should add new parameter at the end. +1


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    From my understanding, yea. The problem here is sounds like ambiguity in empty strings since they can be interpreted as empty strings and also `null`. To me, this is actually rather a bug since we can't distinguish, and don't respect the empty value. If empty strings are written, they should be read as empty strings.
    
    This PR proposes an ability explicitly set the empty value to work around the behaviour change.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95274/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22234#discussion_r212820281
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -345,11 +345,11 @@ def text(self, paths, wholetext=False, lineSep=None):
         @since(2.0)
         def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=None,
                 comment=None, header=None, inferSchema=None, ignoreLeadingWhiteSpace=None,
    -            ignoreTrailingWhiteSpace=None, nullValue=None, nanValue=None, positiveInf=None,
    -            negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None,
    -            maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None,
    -            columnNameOfCorruptRecord=None, multiLine=None, charToEscapeQuoteEscaping=None,
    -            samplingRatio=None, enforceSchema=None):
    +            ignoreTrailingWhiteSpace=None, nullValue=None, emptyValue=None, nanValue=None,
    --- End diff --
    
    It should be put at the last; otherwise, it's going to break existing Python app when the arguments are given positionally.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22234


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

Posted by mmolimar <gi...@git.apache.org>.
Github user mmolimar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22234#discussion_r216337792
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ---
    @@ -91,9 +91,10 @@ abstract class CSVDataSource extends Serializable {
           }
     
           row.zipWithIndex.map { case (value, index) =>
    -        if (value == null || value.isEmpty || value == options.nullValue) {
    -          // When there are empty strings or the values set in `nullValue`, put the
    -          // index as the suffix.
    +        if (value == null || value.isEmpty || value == options.nullValue ||
    +          value == options.emptyValueInRead) {
    --- End diff --
    
    Do I revert these both changes @HyukjinKwon then?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    @MaxGekk Could you take this PR over? I think we need to merge this to Spark 2.4. Users can set the behaviors to the previous one by this new conf `emptyValue`, if needed. Also update the migration guide about the behavior change and explain how to set `emptyValue`. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95271/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Have we documented the behavior changes in the migration guide? If not, can we do it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Did we introduce any behavior change in https://github.com/apache/spark/pull/21273? Does this PR resolve it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    **[Test build #95270 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95270/testReport)** for PR 22234 at commit [`3d3f178`](https://github.com/apache/spark/commit/3d3f178a55a8fdb4630916252866a68a98ae17cd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by mmolimar <gi...@git.apache.org>.
Github user mmolimar commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    @MaxGekk I added what you suggested as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22234#discussion_r212850498
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala ---
    @@ -117,6 +117,9 @@ class CSVOptions(
     
       val nullValue = parameters.getOrElse("nullValue", "")
     
    +  val emptyValueInRead = parameters.getOrElse("emptyValue", "")
    --- End diff --
    
    I would just call it `emptyValue` for consistency with other options here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22234#discussion_r212851032
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala ---
    @@ -117,6 +117,9 @@ class CSVOptions(
     
       val nullValue = parameters.getOrElse("nullValue", "")
     
    +  val emptyValueInRead = parameters.getOrElse("emptyValue", "")
    --- End diff --
    
    I had to name them differently names because the default values are different. We can change the variable later if the same thing should happen


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    **[Test build #95274 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95274/testReport)** for PR 22234 at commit [`0bcdb2a`](https://github.com/apache/spark/commit/0bcdb2a7f2299add11fd78a551027572f80f1ae7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95270/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    **[Test build #95271 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95271/testReport)** for PR 22234 at commit [`bb28db9`](https://github.com/apache/spark/commit/bb28db976fad9316f68a74da4955d08c5b7abaf2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Should the new option be taken into account there: https://github.com/apache/spark/blob/b461acb2d90b734393c27fe7b359e2f2d297b8d4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L94 ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    **[Test build #95259 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95259/testReport)** for PR 22234 at commit [`8b51800`](https://github.com/apache/spark/commit/8b5180021d246ab2fdf0824c01b9f180136837ce).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    **[Test build #95271 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95271/testReport)** for PR 22234 at commit [`bb28db9`](https://github.com/apache/spark/commit/bb28db976fad9316f68a74da4955d08c5b7abaf2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    @gatorsmile @HyukjinKwon Please, take a look at #22367 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

Posted by mmolimar <gi...@git.apache.org>.
Github user mmolimar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22234#discussion_r212842706
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -345,11 +345,11 @@ def text(self, paths, wholetext=False, lineSep=None):
         @since(2.0)
         def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=None,
                 comment=None, header=None, inferSchema=None, ignoreLeadingWhiteSpace=None,
    -            ignoreTrailingWhiteSpace=None, nullValue=None, nanValue=None, positiveInf=None,
    -            negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None,
    -            maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None,
    -            columnNameOfCorruptRecord=None, multiLine=None, charToEscapeQuoteEscaping=None,
    -            samplingRatio=None, enforceSchema=None):
    +            ignoreTrailingWhiteSpace=None, nullValue=None, emptyValue=None, nanValue=None,
    --- End diff --
    
    Done!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    **[Test build #95274 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95274/testReport)** for PR 22234 at commit [`0bcdb2a`](https://github.com/apache/spark/commit/0bcdb2a7f2299add11fd78a551027572f80f1ae7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    **[Test build #95270 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95270/testReport)** for PR 22234 at commit [`3d3f178`](https://github.com/apache/spark/commit/3d3f178a55a8fdb4630916252866a68a98ae17cd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Seems okay but I or someone else should take a closer look before getting this in.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    > cc @MaxGekk for a followup
    
    @HyukjinKwon Do you mean to update migration guide in master and probably in Spark 2.4? I don't think this should be considered as a bug because current version and previous versions of Spark can read saved CSV files correctly. Yes, for now empty strings are saved as `""` and `null`s as nothing but this is supposed to be to distinguish empty string and null in read. And produced CSV files are valid, and they can be read by any mature CSV libs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22234#discussion_r212850446
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -457,9 +459,9 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
                 schema=schema, sep=sep, encoding=encoding, quote=quote, escape=escape, comment=comment,
                 header=header, inferSchema=inferSchema, ignoreLeadingWhiteSpace=ignoreLeadingWhiteSpace,
                 ignoreTrailingWhiteSpace=ignoreTrailingWhiteSpace, nullValue=nullValue,
    -            nanValue=nanValue, positiveInf=positiveInf, negativeInf=negativeInf,
    -            dateFormat=dateFormat, timestampFormat=timestampFormat, maxColumns=maxColumns,
    -            maxCharsPerColumn=maxCharsPerColumn,
    +            emptyValue=emptyValue, nanValue=nanValue, positiveInf=positiveInf,
    --- End diff --
    
    I would put this at the end as well for readability.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Oh no I mean we fixed a bug..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95259/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22234#discussion_r212851116
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala ---
    @@ -117,6 +117,9 @@ class CSVOptions(
     
       val nullValue = parameters.getOrElse("nullValue", "")
     
    +  val emptyValueInRead = parameters.getOrElse("emptyValue", "")
    --- End diff --
    
    I had to name them differently names because the default values are different. Ah, yea then it makes sense here. I rushed to read.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22234#discussion_r214572617
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala ---
    @@ -79,7 +79,8 @@ private[csv] object CSVInferSchema {
        * point checking if it is an Int, as the final type must be Double or higher.
        */
       def inferField(typeSoFar: DataType, field: String, options: CSVOptions): DataType = {
    -    if (field == null || field.isEmpty || field == options.nullValue) {
    +    if (field == null || field.isEmpty || field == options.nullValue ||
    +      field == options.emptyValueInRead) {
    --- End diff --
    
    I wouldn't do this for now. It needs another review iteration. Let's revert this back.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

Posted by mmolimar <gi...@git.apache.org>.
Github user mmolimar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22234#discussion_r212851409
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -457,9 +459,9 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
                 schema=schema, sep=sep, encoding=encoding, quote=quote, escape=escape, comment=comment,
                 header=header, inferSchema=inferSchema, ignoreLeadingWhiteSpace=ignoreLeadingWhiteSpace,
                 ignoreTrailingWhiteSpace=ignoreTrailingWhiteSpace, nullValue=nullValue,
    -            nanValue=nanValue, positiveInf=positiveInf, negativeInf=negativeInf,
    -            dateFormat=dateFormat, timestampFormat=timestampFormat, maxColumns=maxColumns,
    -            maxCharsPerColumn=maxCharsPerColumn,
    +            emptyValue=emptyValue, nanValue=nanValue, positiveInf=positiveInf,
    --- End diff --
    
    Done!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22234#discussion_r214572625
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ---
    @@ -91,9 +91,10 @@ abstract class CSVDataSource extends Serializable {
           }
     
           row.zipWithIndex.map { case (value, index) =>
    -        if (value == null || value.isEmpty || value == options.nullValue) {
    -          // When there are empty strings or the values set in `nullValue`, put the
    -          // index as the suffix.
    +        if (value == null || value.isEmpty || value == options.nullValue ||
    +          value == options.emptyValueInRead) {
    --- End diff --
    
    ditto for excluding.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    This is rather a quite corner case (see the elaborated cases in the JIRA [SPARK-17916](https://issues.apache.org/jira/browse/SPARK-17916)) and there's ambiguity to treat this as a bug or a proper behaviour change; however, I don't object if this can be worth enough as something that should be mentioned.
    
    cc @MaxGekk for a followup


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22234
  
    @mmolimar, let's leave this closed since the newer one is open BTW. You will be credited as a primary author of #22367 anyway.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org