You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dhunziker <gi...@git.apache.org> on 2017/02/05 20:29:05 UTC

[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...

GitHub user dhunziker opened a pull request:

    https://github.com/apache/spark/pull/16812

    [SPARK-19465][SQL] Added options for custom boolean values in CSV

    ## What changes were proposed in this pull request?
    
    It adds trueValue and falseValue options for customising the values used for Boolean types in CSV files. The default behaviour remains unchanged but the two new options allow for more flexibility. As it is currently the case, the values are matched ignoring their case.  
    
    ## How was this patch tested?
    
    Extended existing unit tests for custom Boolean values and added a new test for asserting the IllegalArgumentException thrown by Scala's StringLike.toBoolean function.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dhunziker/spark feature/SPARK-19465

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16812.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16812
    
----
commit 0926626f3ab6411e405fb31c4a291d49092a19a4
Author: Dennis Hunziker <de...@gmail.com>
Date:   2017-02-05T20:06:49Z

    [SPARK-19465][SQL] Added options for custom boolean values in CSV

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...

Posted by jaceklaskowski <gi...@git.apache.org>.
Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16812#discussion_r99935530
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -139,6 +140,20 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
         assert(result.schema === expectedSchema)
       }
     
    +  test("test inferring booleans with custom values") {
    +    val result = spark.read
    +      .format("csv")
    +      .option("header", "true")
    --- End diff --
    
    Could you use `true` (boolean literal) instead to promote the more type-safe approach?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16812
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16812
  
    Could you give some systems that have such a feature? Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16812
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...

Posted by jaceklaskowski <gi...@git.apache.org>.
Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16812#discussion_r99934599
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -110,7 +110,11 @@ private[csv] class UnivocityParser(
           }
     
         case _: BooleanType => (d: String) =>
    -      nullSafeDatum(d, name, nullable, options)(_.toBoolean)
    +      nullSafeDatum(d, name, nullable, options) { datum =>
    +        if (datum.equalsIgnoreCase(options.trueValue)) true
    --- End diff --
    
    Is there any reason why you compare `datum` to `options.trueValue` and not vice versa as it's in `CSVInferSchema.scala` above?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...

Posted by dhunziker <gi...@git.apache.org>.
Github user dhunziker commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16812#discussion_r99953924
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -139,6 +140,20 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
         assert(result.schema === expectedSchema)
       }
     
    +  test("test inferring booleans with custom values") {
    +    val result = spark.read
    +      .format("csv")
    +      .option("header", "true")
    --- End diff --
    
    Sure, will make the change. If preferred, I can change the whole suite to the typesafe options as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16812
  
    I still don't think need this since the workaround is easy. If other committers find it worth, I won't object.
    If there are no interests fro this PR afterwards, I would just close this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...

Posted by dhunziker <gi...@git.apache.org>.
Github user dhunziker commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16812#discussion_r99953321
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala ---
    @@ -73,6 +73,12 @@ class CSVInferSchemaSuite extends SparkFunSuite {
         assert(CSVInferSchema.inferField(LongType, "2015-08 14:49:00", options) == StringType)
       }
     
    +  test("Boolean field types are inferred correctly from custom value") {
    +    val options = new CSVOptions(Map("trueValue" -> "yes"))
    +    assert(CSVInferSchema.inferField(BooleanType, "yES", options) == BooleanType)
    --- End diff --
    
    No assert in this suite uses `===`. Should I make this consistent and replace all occurrences of `==` in that suite with `===` or just the one in the lines I've changed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...

Posted by dhunziker <gi...@git.apache.org>.
Github user dhunziker commented on the issue:

    https://github.com/apache/spark/pull/16812
  
    That would remain a workaround though. The uniVocity parser for boolean supports this as well: https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/conversions/BooleanConversion.java. It's also been raised as a ticket against the original spark-csv so there seems to be at least some interest. Personally, I think this can help a lot when integrating Spark with other systems (i.e. some DB exports where modeling booleans as flags is common) similar to having ways of specifying the format of a date or null value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...

Posted by dhunziker <gi...@git.apache.org>.
Github user dhunziker commented on the issue:

    https://github.com/apache/spark/pull/16812
  
    > This can be easily worked around, no?
    
    It can be worked around, losing the boolean type by converting to string in the process though. With any workaround it'll be better to get eventually fixed/enhanced, especially as it was already raised against the original CSV datasource years ago. I did comment on this before and still think it's a valid improvement given this already exists for null, nan, inf and date values and the other reasons mentioned above.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...

Posted by dhunziker <gi...@git.apache.org>.
Github user dhunziker commented on the issue:

    https://github.com/apache/spark/pull/16812
  
    Oracle doesn't have boolean so it's usually modelled as char(1) with Y/N, Sybase doesn't have boolean either but bit which is 1/0. PosgreSQL supports a range of values (https://www.postgresql.org/docs/10/static/datatype-boolean.html). In regards to CSV parsers I found the same feature in SuperCSV and uniVocity (linked above). I think it's a very useful and common feature similar to custom date formats or null strings, but as mentioned before, not something that can't be worked around.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16812
  
    Hm, do we really need this? I think we could simply work around this by `case` & `when`, `if` or other expressions with a projection after loading. Or using `csv(ds: Dataset[String])` API after preprocessing in the source dataset.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...

Posted by jaceklaskowski <gi...@git.apache.org>.
Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16812#discussion_r99935349
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala ---
    @@ -73,6 +73,12 @@ class CSVInferSchemaSuite extends SparkFunSuite {
         assert(CSVInferSchema.inferField(LongType, "2015-08 14:49:00", options) == StringType)
       }
     
    +  test("Boolean field types are inferred correctly from custom value") {
    +    val options = new CSVOptions(Map("trueValue" -> "yes"))
    +    assert(CSVInferSchema.inferField(BooleanType, "yES", options) == BooleanType)
    --- End diff --
    
    What's the reason for `==` not `===` as it's in the other asserts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16812
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16812
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...

Posted by dhunziker <gi...@git.apache.org>.
Github user dhunziker commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16812#discussion_r99951876
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -110,7 +110,11 @@ private[csv] class UnivocityParser(
           }
     
         case _: BooleanType => (d: String) =>
    -      nullSafeDatum(d, name, nullable, options)(_.toBoolean)
    +      nullSafeDatum(d, name, nullable, options) { datum =>
    +        if (datum.equalsIgnoreCase(options.trueValue)) true
    --- End diff --
    
    Just seems more obvious that `datum` is guaranteed to be not null given the name of `nullSafeDatum`. Probably should switch around the values in `CSVInferSchema.scala`, since there, `field` is null checked at the beginning of the parsing chain. Do you prefer one over the other?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16812
  
    This can be easily worked around, no?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org