You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Takeshi Yamamuro (JIRA)" <ji...@apache.org> on 2017/06/08 15:59:18 UTC
[jira] [Created] (SPARK-21024) CSV parse mode handles Univocity parser exceptions

Takeshi Yamamuro created SPARK-21024:
----------------------------------------

             Summary: CSV parse mode handles Univocity parser exceptions
                 Key: SPARK-21024
                 URL: https://issues.apache.org/jira/browse/SPARK-21024
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.1.1
            Reporter: Takeshi Yamamuro
            Priority: Minor


The current master cannot skip the illegal records that Univocity parsers:
This comes from the spark-user mailing list:
https://www.mail-archive.com/user@spark.apache.org/msg63985.html

{code}
scala> Seq("0,1", "0,1,2,3").toDF().write.text("/Users/maropu/Desktop/data")
scala> val df = spark.read.format("csv").schema("a int, b int").option("maxColumns", "3").load("/Users/maropu/Desktop/data")
scala> df.show

com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - 3
Hint: Number of columns processed may have exceeded limit of 3 columns. Use settings.setMaxColumns(int) to define the maximum number of columns your input can have
Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse
Parser Configuration: CsvParserSettings:
        Auto configuration enabled=true
        Autodetect column delimiter=false
        Autodetect quotes=false
        Column reordering enabled=true
        Empty value=null
        Escape unquoted values=false
        ...

at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at com.univocity.parsers.common.AbstractParser.handleEOF(AbstractParser.java:195)
at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:544)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:191)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308)
at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:60)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
...
{code}

We could easily fix this like: https://github.com/apache/spark/compare/master...maropu:HandleExceptionInParser



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org