You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by maropu <gi...@git.apache.org> on 2018/06/25 04:38:45 UTC
[GitHub] spark pull request #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPru...
GitHub user maropu opened a pull request:
https://github.com/apache/spark/pull/21631
[SPARK-24645][SQL] Skip parsing when csvColumnPruning enabled and partitions scanned only
## What changes were proposed in this pull request?
In the master, when `csvColumnPruning` enabled and partitions scanned only, it throws an exception below;
```
scala> val dir = "/tmp/spark-csv/csv"
scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
18/06/25 13:12:51 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 5)
java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:197)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:190)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309)
at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
...
```
This pr modified code to skip CSV parsing in the case.
## How was this patch tested?
Added tests in `CSVSuite`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/maropu/spark SPARK-24645
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21631.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21631
----
commit 59a7f142ae9c83c76c2bfbff2962c071fc586122
Author: Takeshi Yamamuro <ya...@...>
Date: 2018-06-25T04:18:37Z
fix
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21631
**[Test build #92284 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92284/testReport)** for PR 21631 at commit [`c7362dc`](https://github.com/apache/spark/commit/c7362dce248726598b63ea4fd1b45df19d5e9861).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/451/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21631
So you mean it's a bug in Univocity? that's another fix for a bug existing in Univocity then. We could work around this bug if it's clear that's a bug. I would suggest to open a bug there if we are not sure in any event. FWIW, their support is quite responsive when the issue is reported with a clear reproducer and descriptions. this issue sounds slightly orthogonal to the JIRA itself though.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/21631
Here is the test for uniVocity parser: https://github.com/MaxGekk/univocity_tests . For the first line, `parseLine` outputs empty array but `null`s for the next calls. What do you think should I create an issue for uniVocity? I believe output should be consistent and all calls should return the same at least. And I would expect empty array for all calls.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21631
cc: @HyukjinKwon @MaxGekk
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21631
v2.5.9 also have the same behaviour? Anyway, it'd be better to ask the author ;) I asked before and I got quick response.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92282/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21631
**[Test build #92293 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92293/testReport)** for PR 21631 at commit [`c7362dc`](https://github.com/apache/spark/commit/c7362dce248726598b63ea4fd1b45df19d5e9861).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21631
@HyukjinKwon BTW, can you check this?
@MaxGekk Probably, I feel you'd be better to file a new jira for the point you're looking into.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21631
LGTM.
@MaxGekk please take a following action. Will help and check if it's needed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92284/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21631
@HyukjinKwon sure, I'll do
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/449/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPru...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21631#discussion_r197679873
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
@@ -1602,4 +1602,14 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
assert(testAppender2.events.asScala
.exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema")))
}
+
+ test("SPARK-24645 skip parsing when columnPruning enabled and partitions scanned only") {
+ withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "true") {
+ withTempPath { path =>
+ val dir = path.getAbsolutePath
+ spark.range(10).selectExpr("id % 2 AS p", "id").write.partitionBy("p").csv(dir)
+ spark.read.csv(dir).selectExpr("sum(p)").collect()
--- End diff --
oh, I forgot to add `assert` here. I'll update soon.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPru...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21631#discussion_r197973962
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
@@ -183,11 +183,19 @@ class UnivocityParser(
}
}
+ private lazy val doParse = if (schema.nonEmpty) {
--- End diff --
recheck: it seems `parse` is called in executor sides only, so we can add `lazy` here to avoid unnecessary instantiation?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPru...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21631#discussion_r197961932
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
@@ -183,11 +183,19 @@ class UnivocityParser(
}
}
+ private lazy val doParse = if (schema.nonEmpty) {
--- End diff --
yea, I missed. Thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21631
**[Test build #92284 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92284/testReport)** for PR 21631 at commit [`c7362dc`](https://github.com/apache/spark/commit/c7362dce248726598b63ea4fd1b45df19d5e9861).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92317/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21631
yea, I think this is regressions because I checked that the query above passed before[ this commit](https://github.com/apache/spark/commit/64fad0b519cf35b8c0a0dec18dd3df9488a5ed25#diff-d19881aceddcaa5c60620fdcda99b4c4)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21631
**[Test build #92317 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92317/testReport)** for PR 21631 at commit [`39811a3`](https://github.com/apache/spark/commit/39811a34a5f0fb0f2c97bc096e01129b593315fe).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPru...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/21631#discussion_r197887650
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
@@ -183,11 +183,19 @@ class UnivocityParser(
}
}
+ private lazy val doParse = if (schema.nonEmpty) {
--- End diff --
Do you really need lazy here. In most cases, time interval between calling of the constructor and the parse() method is pretty short. I don't think we win something here.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/21631
I have found the places inside of UnivocityParser from where the `null` comes. It is interesting that `null` is returned for valid input string `"8"`. See the screenshot:
<img width="1398" alt="npe_parseline" src="https://user-images.githubusercontent.com/1580697/41853682-986d2ec6-788e-11e8-8573-89e00d1e945c.png">
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21631
@MaxGekk, thanks. mind opening a PR to upgrade?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/21631
> do we still hit the bug when parsing csv data?
I have checked uniVocity 2.7.2, there is no problem on this version.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21631
As I described in https://github.com/apache/spark/pull/21625#discussion_r197679077, I found another bug? (the case where `spark.sql.csv.parser.columnPruning.enabled=false`) when working on this pr;
```
./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
scala> val dir = "/tmp/spark-csv/csv"
scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
...
```
I worked on this fix and made a patch to fix this; https://github.com/apache/spark/compare/master...maropu:SPARK-24645-2
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/21631
> please take a following action. Will help and check if it's needed.
I have opened the issue for uniVocity parser: https://github.com/uniVocity/univocity-parsers/issues/250
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/469/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/21631
> So you mean it's a bug in Univocity?
No, I mean we don't handle `null` from Univocity's `parseLine` at all (in another situations), and we just propagate `NullPointerException` to an user instead of producing of `BadRecordException`. The `BadRecordException` can be handled differently there https://github.com/apache/spark/blob/1a527bde49753535e6b86c18751f50c19a55f0d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L59-L73 depend on current mode. In `DropMalformedMode` mode, for example, we could just drop the line instead of propagating NPE to user space.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21631
Wouldn't it be better to check schema instead of value for per record?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/21631
Looking at the `NullPointerException`, it comes from the line:
```scala
if (tokens.length != schema.length) {
```
where `tokens` is null returned by `parseLine` of `UnivocityParser`. The `parseLine` method is implemented in the way that it can return `null` in some cases:
https://github.com/uniVocity/univocity-parsers/blob/v2.6.4/src/main/java/com/univocity/parsers/common/AbstractParser.java#L636-L679
I think the right fix would be to check result of `parseLine` on `null` explicitly.
Just in case, I have checked the commit before my changes - it also doesn't contain checking returned value of `parseLine` on `null`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21631
**[Test build #92293 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92293/testReport)** for PR 21631 at commit [`c7362dc`](https://github.com/apache/spark/commit/c7362dce248726598b63ea4fd1b45df19d5e9861).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21631
**[Test build #92282 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92282/testReport)** for PR 21631 at commit [`59a7f14`](https://github.com/apache/spark/commit/59a7f142ae9c83c76c2bfbff2962c071fc586122).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21631
@MaxGekk yea, I noticed that behaviour. Probably, in case we set an empty array in `CommonSettings.selectIndexes`, it seems `UnivocityParser` returns null for valid input? I'm not sure setting an empty array there is valid. Anyway, to fix this bug, I think we don't need to check values record-by-record because `reuiqredSchema` is empty as @HyukjinKwon said.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/459/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/21631
The bug has been already fixed in uniVocity `2.6.5-SNAPSHOT`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21631
Merged to master.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21631
@maropu Could you confirm whether these two bugs are regressions in the master branch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92293/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21631
**[Test build #92317 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92317/testReport)** for PR 21631 at commit [`39811a3`](https://github.com/apache/spark/commit/39811a34a5f0fb0f2c97bc096e01129b593315fe).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21631
**[Test build #92282 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92282/testReport)** for PR 21631 at commit [`59a7f14`](https://github.com/apache/spark/commit/59a7f142ae9c83c76c2bfbff2962c071fc586122).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21631
oh, super quick fix ;) Thanks, @MaxGekk
In the master, do we still hit the bug when parsing csv data?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21631
yea, I checked the two queries with/without column pruning in the master;
```
./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=true (default)
scala> val dir = "/tmp/spark-csv/csv"
// NG
scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
18/06/25 16:16:30 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 5)
java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scal
a:193)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:190)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:305)
// OK
scala> spark.range(10).selectExpr("id % 2 AS p", "id AS c0", "id AS c1").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)", "avg(_c0)").collect()
res11: Array[org.apache.spark.sql.Row] = Array([5,4.5])
./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
// NG
scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
Caused by: java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getInt(rows.scala:195)
at org.apache.spark.sql.catalyst.expressions.JoinedRow.getInt(JoinedRow.scala:82)
// NG
scala> spark.range(10).selectExpr("id % 2 AS p", "id AS c0", "id AS c1").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)", "avg(_c0)").collect()
18/06/25 16:23:55 ERROR TaskSetManager: Task 0 in stage 14.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 32, loca
lhost, executor driver): java.lang.ClassCastException
Caused by: java.lang.ClassCastException
```
The two queries is ok in v2.3.1 and in the master before the commit.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21631
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/21631
It seems `null` for Univocity's `parserLine` is normal way to indicate about an error. Should we handle `null`s and throw `BadRecordException` instead of propagating NPE to user's app?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21631
Both?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/21631
> v2.5.9 also have the same behaviour?
yes, it is the same.
> Anyway, it'd be better to ask the author ;) I asked before and I got quick response.
ok. I will create an issue for uniVocity.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/21631
> But first we should be sure if it's a bug or not for this anyway.
I will try to reproduce it on small example without Spark. I am not sure what the expected behavior should be if set of selected columns is empty. Should the `parseLine` return `null`, empty `array` of `String`s or something else. Will see...
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21631
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPruning en...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21631
I mean `null is returned for valid input string "8"`. I thought this is a bug. If there's valid case returning `null`, yea we should handle `null` of course but the case you mentioned sounds like a bug.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21631: [SPARK-24645][SQL] Skip parsing when csvColumnPru...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/21631
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org