You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by maropu <gi...@git.apache.org> on 2018/06/25 04:38:45 UTC

[GitHub] spark pull request #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPru...

GitHub user maropu opened a pull request:

    https://github.com/apache/spark/pull/21631

    [SPARK-24645][SQL] Skip parsing when csvColumnPruning enabled and partitions scanned only

    ## What changes were proposed in this pull request?
    In the master, when `csvColumnPruning` enabled and partitions scanned only, it throws an exception below;
    
    ```
    scala> val dir = "/tmp/spark-csv/csv"
    scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
    scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
    18/06/25 13:12:51 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 5)
    java.lang.NullPointerException
            at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:197)  
            at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:190)
            at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309)
            at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309)
            at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
            ...
    ```
    This pr modified code to skip CSV parsing in the case.
    
    ## How was this patch tested?
    Added tests in `CSVSuite`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/maropu/spark SPARK-24645

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21631.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21631
    
----
commit 59a7f142ae9c83c76c2bfbff2962c071fc586122
Author: Takeshi Yamamuro <ya...@...>
Date:   2018-06-25T04:18:37Z

    fix

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    **[Test build #92284 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92284/testReport)** for PR 21631 at commit [`c7362dc`](https://github.com/apache/spark/commit/c7362dce248726598b63ea4fd1b45df19d5e9861).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/451/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    So you mean it's a bug in Univocity? that's another fix for a bug existing in Univocity then. We could work around this bug if it's clear that's a bug. I would suggest to open a bug there if we are not sure in any event. FWIW, their support is quite responsive when the issue is reported with a clear reproducer and descriptions. this issue sounds slightly orthogonal to the JIRA itself though.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Here is the test for uniVocity parser: https://github.com/MaxGekk/univocity_tests . For the first line, `parseLine` outputs empty array but `null`s for the next calls. What do you think should I create an issue for uniVocity? I believe output should be consistent and all calls should return the same at least. And I would expect empty array for all calls.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    cc: @HyukjinKwon @MaxGekk 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    v2.5.9 also have the same behaviour? Anyway, it'd be better to ask the author ;)  I asked before and I got quick response.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92282/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    **[Test build #92293 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92293/testReport)** for PR 21631 at commit [`c7362dc`](https://github.com/apache/spark/commit/c7362dce248726598b63ea4fd1b45df19d5e9861).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    @HyukjinKwon BTW, can you check this?
    @MaxGekk Probably, I feel you'd be better to file a new jira for the point you're looking into.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    LGTM.
    
    @MaxGekk please take a following action. Will help and check if it's needed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92284/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    @HyukjinKwon sure, I'll do 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/449/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPru...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21631#discussion_r197679873
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1602,4 +1602,14 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
         assert(testAppender2.events.asScala
           .exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema")))
       }
    +
    +  test("SPARK-24645 skip parsing when columnPruning enabled and partitions scanned only") {
    +    withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "true") {
    +      withTempPath { path =>
    +        val dir = path.getAbsolutePath
    +        spark.range(10).selectExpr("id % 2 AS p", "id").write.partitionBy("p").csv(dir)
    +        spark.read.csv(dir).selectExpr("sum(p)").collect()
    --- End diff --
    
    oh, I forgot to add `assert` here. I'll update soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPru...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21631#discussion_r197973962
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -183,11 +183,19 @@ class UnivocityParser(
         }
       }
     
    +  private lazy val doParse = if (schema.nonEmpty) {
    --- End diff --
    
    recheck: it seems `parse` is called in executor sides only, so we can add `lazy` here to avoid unnecessary instantiation?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPru...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21631#discussion_r197961932
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -183,11 +183,19 @@ class UnivocityParser(
         }
       }
     
    +  private lazy val doParse = if (schema.nonEmpty) {
    --- End diff --
    
    yea, I missed. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    **[Test build #92284 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92284/testReport)** for PR 21631 at commit [`c7362dc`](https://github.com/apache/spark/commit/c7362dce248726598b63ea4fd1b45df19d5e9861).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92317/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    yea, I think this is regressions because I checked that the query above passed before[ this commit](https://github.com/apache/spark/commit/64fad0b519cf35b8c0a0dec18dd3df9488a5ed25#diff-d19881aceddcaa5c60620fdcda99b4c4)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    **[Test build #92317 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92317/testReport)** for PR 21631 at commit [`39811a3`](https://github.com/apache/spark/commit/39811a34a5f0fb0f2c97bc096e01129b593315fe).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPru...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21631#discussion_r197887650
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -183,11 +183,19 @@ class UnivocityParser(
         }
       }
     
    +  private lazy val doParse = if (schema.nonEmpty) {
    --- End diff --
    
    Do you really need lazy here. In most cases, time interval between calling of the constructor and the parse() method is pretty short. I don't think we win something here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    I have found the places inside of UnivocityParser from where the `null` comes. It is interesting that `null` is returned for valid input string `"8"`. See the screenshot: 
    <img width="1398" alt="npe_parseline" src="https://user-images.githubusercontent.com/1580697/41853682-986d2ec6-788e-11e8-8573-89e00d1e945c.png">
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    @MaxGekk, thanks. mind opening a PR to  upgrade?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    > do we still hit the bug when parsing csv data?
    
    I have checked uniVocity 2.7.2, there is no problem on this version.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    As I described in https://github.com/apache/spark/pull/21625#discussion_r197679077, I found another bug? (the case where `spark.sql.csv.parser.columnPruning.enabled=false`) when working on this pr;
    ```
    ./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
    scala> val dir = "/tmp/spark-csv/csv"
    scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
    scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
    18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
    java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer
            at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
            at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
            ...
    ```
    I worked on this fix and made a patch to fix this; https://github.com/apache/spark/compare/master...maropu:SPARK-24645-2


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    > please take a following action. Will help and check if it's needed.
    
    I have opened the issue for uniVocity parser: https://github.com/uniVocity/univocity-parsers/issues/250


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/469/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    > So you mean it's a bug in Univocity?
    
    No, I mean we don't handle `null` from Univocity's `parseLine` at all (in another situations), and we just propagate `NullPointerException` to an user instead of producing of `BadRecordException`. The `BadRecordException` can be handled differently there https://github.com/apache/spark/blob/1a527bde49753535e6b86c18751f50c19a55f0d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L59-L73 depend on current mode. In `DropMalformedMode` mode, for example, we could just drop the line instead of propagating NPE to user space.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Wouldn't it be better to check schema instead of value for per record?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Looking at the `NullPointerException`, it comes from the line:
    ```scala
    if (tokens.length != schema.length) {
    ```
    where `tokens` is null returned by `parseLine` of `UnivocityParser`. The `parseLine` method is implemented in the way that it can return `null` in some cases:
    https://github.com/uniVocity/univocity-parsers/blob/v2.6.4/src/main/java/com/univocity/parsers/common/AbstractParser.java#L636-L679
    
    I think the right fix would be to check result of `parseLine` on `null` explicitly.
    
    Just in case, I have checked the commit before my changes - it also doesn't contain checking returned value of `parseLine` on `null`. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    **[Test build #92293 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92293/testReport)** for PR 21631 at commit [`c7362dc`](https://github.com/apache/spark/commit/c7362dce248726598b63ea4fd1b45df19d5e9861).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    **[Test build #92282 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92282/testReport)** for PR 21631 at commit [`59a7f14`](https://github.com/apache/spark/commit/59a7f142ae9c83c76c2bfbff2962c071fc586122).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    @MaxGekk yea, I noticed that behaviour. Probably, in case we set an empty array in `CommonSettings.selectIndexes`, it seems `UnivocityParser` returns null for valid input? I'm not sure setting an empty array there is valid. Anyway, to fix this bug, I think we don't need to check values record-by-record because `reuiqredSchema` is empty as @HyukjinKwon said.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/459/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    The bug has been already fixed in uniVocity `2.6.5-SNAPSHOT`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    @maropu Could you confirm whether these two bugs are regressions in the master branch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92293/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    **[Test build #92317 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92317/testReport)** for PR 21631 at commit [`39811a3`](https://github.com/apache/spark/commit/39811a34a5f0fb0f2c97bc096e01129b593315fe).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    **[Test build #92282 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92282/testReport)** for PR 21631 at commit [`59a7f14`](https://github.com/apache/spark/commit/59a7f142ae9c83c76c2bfbff2962c071fc586122).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    oh, super quick fix ;) Thanks, @MaxGekk 
    In the master, do we still hit the bug when parsing csv data?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    yea, I checked the two queries with/without column pruning in the master;
    ```
    ./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=true (default)
    scala> val dir = "/tmp/spark-csv/csv"
    
    // NG
    scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
    scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
    18/06/25 16:16:30 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 5)
    java.lang.NullPointerException
            at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scal
    a:193)  
            at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:190)
            at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:305)
    
    // OK
    scala> spark.range(10).selectExpr("id % 2 AS p", "id AS c0", "id AS c1").write.mode("overwrite").partitionBy("p").csv(dir)
    scala> spark.read.csv(dir).selectExpr("sum(p)", "avg(_c0)").collect()
    res11: Array[org.apache.spark.sql.Row] = Array([5,4.5])
    
    
    
    
    ./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
    // NG
    scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
    scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
    Caused by: java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer
      at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
      at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
      at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getInt(rows.scala:195)
      at org.apache.spark.sql.catalyst.expressions.JoinedRow.getInt(JoinedRow.scala:82)
    
    // NG
    scala> spark.range(10).selectExpr("id % 2 AS p", "id AS c0", "id AS c1").write.mode("overwrite").partitionBy("p").csv(dir)
    scala> spark.read.csv(dir).selectExpr("sum(p)", "avg(_c0)").collect()
    18/06/25 16:23:55 ERROR TaskSetManager: Task 0 in stage 14.0 failed 1 times; aborting job
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 32, loca
    lhost, executor driver): java.lang.ClassCastException
    Caused by: java.lang.ClassCastException
    ```
    The two queries is ok in v2.3.1 and in the master before the commit.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    It seems `null` for Univocity's `parserLine` is normal way to indicate about an error. Should we handle `null`s and throw `BadRecordException` instead of propagating NPE to user's app?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Both?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    > v2.5.9 also have the same behaviour?
    
    yes, it is the same.
    
    > Anyway, it'd be better to ask the author ;) I asked before and I got quick response.
    
    ok. I will create an issue for uniVocity.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    > But first we should be sure if it's a bug or not for this anyway.
    
    I will try to reproduce it on small example without Spark. I am not sure what the expected behavior should be if set of selected columns is empty. Should the `parseLine` return `null`, empty `array` of `String`s or something else. Will see...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21631
  
    I mean `null is returned for valid input string "8"`. I thought this is a bug. If there's valid case returning `null`, yea we should handle `null` of course but the case you mentioned sounds like a bug.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPru...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21631


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org