You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Rui Li (JIRA)" <ji...@apache.org> on 2014/08/19 14:06:18 UTC

[jira] [Updated] (HIVE-7773) Union all query finished with errors [Spark Branch]

     [ https://issues.apache.org/jira/browse/HIVE-7773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rui Li updated HIVE-7773:
-------------------------

    Attachment: HIVE-7773.spark.patch

I found the problem is that IOContext is used to store and retrieve input path for the operators. IOContext is a singleton when I submit the query via hive cli. Since spark tasks are threads within a JVM, the input path in IOContext will get messed up if concurrent tasks have different input paths. In my test case, two map works run concurrently for two different tables.
This patch makes sure we always use a thread local IOContext.

> Union all query finished with errors [Spark Branch]
> ---------------------------------------------------
>
>                 Key: HIVE-7773
>                 URL: https://issues.apache.org/jira/browse/HIVE-7773
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Rui Li
>            Priority: Critical
>         Attachments: HIVE-7773.spark.patch
>
>
> When I run a union all query, I found the following error in spark log (the query finished with correct results though):
> {noformat}
> java.lang.RuntimeException: Map operator initialization failed
>         at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.init(SparkMapRecordHandler.java:127)
>         at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunction.call(HiveMapFunction.java:52)
>         at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunction.call(HiveMapFunction.java:30)
>         at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
>         at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
>         at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>         at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Configuration and input path are inconsistent
>         at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:404)
>         at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.init(SparkMapRecordHandler.java:93)
>         ... 16 more
> {noformat}
> Judging from the log, I think we don't properly handle the input paths when cloning the job conf, so it may also affect other queries with multiple maps or reduces.



--
This message was sent by Atlassian JIRA
(v6.2#6252)