You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/04/24 02:42:38 UTC

[jira] [Updated] (SPARK-6082) SparkSQL should fail gracefully when input data format doesn't match expectations

     [ https://issues.apache.org/jira/browse/SPARK-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated SPARK-6082:
-----------------------------
    Assignee: Cheng Lian

> SparkSQL should fail gracefully when input data format doesn't match expectations
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-6082
>                 URL: https://issues.apache.org/jira/browse/SPARK-6082
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.1
>            Reporter: Kay Ousterhout
>            Assignee: Cheng Lian
>             Fix For: 1.3.0
>
>
> I have a udf that creates a tab-delimited table. If any of the column values contain a tab, SQL fails with an ArrayIndexOutOfBounds exception (pasted below).  It would be great if SQL failed gracefully here, with a helpful exception (something like "One row contained too many values").
> It looks like this can be done quite easily, by checking here if i > columnBuilders.size and if so, throwing a nicer exception: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala#L124.
> One thing that makes this problem especially annoying to debug is because if you do "CREATE table foo as select transform(..." and then "CACHE table foo", it works fine.  It only fails if you do "CACHE table foo as select transform(...".  Because of this, it would be great if the problem were more transparent to users.
> Stack trace:
> java.lang.ArrayIndexOutOfBoundsException: 3
>   at org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.next(InMemoryColumnarTableScan.scala:125)
>   at org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.next(InMemoryColumnarTableScan.scala:112)
>   at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
>   at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>   at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:220)
>   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org