You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jakub Nowacki (JIRA)" <ji...@apache.org> on 2016/12/04 22:26:59 UTC

[jira] [Comment Edited] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

    [ https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15720710#comment-15720710 ] 

Jakub Nowacki edited comment on SPARK-18699 at 12/4/16 10:26 PM:
-----------------------------------------------------------------

While I don't argue that some other packages have similar behaviour, I think the PERMISSIVE mode should be, well, as permissive as possible, since CSVs have very little standards and no types. In ma case I had just one odd value in almost 1 TB set and the job crushed at the very end after about an hour. To go around the issue one needs to manually parse each line, which is not the end of the world, but I wanted to use CSV reader exactly for the confidence of not writing extra code. IMO the mode for error detection should be FAILFAST. Moreover, if I really need to check the data, I read it differently anyway.
BTW thanks for looking into this.


was (Author: jsnowacki):
While I don't argue that some other packages have similar behaviour, I think the PERMISSIVE mode should be, well, as permissive as possible, since CSVs have very little standards and no types. In ma case I had just one odd value in almost 1 TB set and the job crushed at the very end after about an hour. To go around the issue one needs to manually parse each line, which is not the end of the world, but I wanted to use CSV reader exactly for the confidence of not writing extra code. IMO the mode for error detection should be FAILFAST. Moreover, if I really need to check the data, I read it differently anyway.

> Spark CSV parsing types other than String throws exception when malformed
> -------------------------------------------------------------------------
>
>                 Key: SPARK-18699
>                 URL: https://issues.apache.org/jira/browse/SPARK-18699
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Jakub Nowacki
>
> If CSV is read and the schema contains any other type than String, exception is thrown when the string value in CSV is malformed; e.g. if the timestamp does not match the defined one, an exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException
> 	at java.sql.Date.valueOf(Date.java:143)
> 	at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
> 	at scala.util.Try.getOrElse(Try.scala:79)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
> 	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> 	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> 	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> 	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
> 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> 	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> 	at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
> 	at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
> 	at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
> 	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
> 	at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
> 	... 8 more
> {code}
> It behaves similarly with Integer and Long types, from what I've seen.
> To my understanding modes PERMISSIVE and DROPMALFORMED should just null the value or drop the line, but instead they kill the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org