You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by - - <me...@paulerrington.co.uk> on 2022/09/16 08:11:04 UTC

hive and druid

Hi All!  
  
Ive setup hive 3.1.3 with spark 2.3.0 and a hdfs 3.2.1 cluster.  
  
Druid version: apache-druid-0.18.1-1  
  
I have 2 external tables in mysql and druid. Independently tables work well
querying them, but joining them im getting this:  
  
_java.lang.NullPointerException_  
at
org.apache.hadoop.hive.druid.serde.DruidSelectQueryRecordReader.nextKeyValue(DruidSelectQueryRecordReader.java:62)  
  
I tried a few more selects on druid and realised this problem is on the spark
executor as group by on the druid table was causing this.  
  
Is this down to using 2.3.0 with HDFS 3 ? CSV external tables work fine. What
versions of druid are supported i dont see a compatibility matrix but i did
have to upgrade from 2.3.7 to 3.1.3 as my druid deploy does not support
SELECTS only SCANS.

  
Ive already had to set this to get around the LZ4 problems.  
spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec  
  
Ive tried disabling vectorization (i think this was 2 or 3 params in total)  
hive.vectorized.execution.enabled=false  
  
Full stack trace below. Any help much appreciated especially on the version
compatibility!  
  
  
ERROR : Job failed with
java.lang.NullPointerExceptionjava.util.concurrent.ExecutionException:
Exception thrown by job at
org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:337) at
org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:342) at
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:382)
at
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:343)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)Caused by:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID
33, 10.0.0.25, executor 1): java.io.IOException:
java.lang.NullPointerException at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365)
at
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:277) at
org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:214) at
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at
scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
at
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:83)
at
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
at scala.collection.Iterator$class.foreach(Iterator.scala:893) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at
org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
at
org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2178) at
org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2178) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at
org.apache.spark.scheduler.Task.run(Task.scala:109) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)Caused by:
java.lang.NullPointerException at
org.apache.hadoop.hive.druid.serde.DruidSelectQueryRecordReader.nextKeyValue(DruidSelectQueryRecordReader.java:62)
at
org.apache.hadoop.hive.druid.serde.DruidSelectQueryRecordReader.next(DruidSelectQueryRecordReader.java:85)
at
org.apache.hadoop.hive.druid.serde.DruidSelectQueryRecordReader.next(DruidSelectQueryRecordReader.java:38)
at
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
... 22 more Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586) at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)Caused by:
java.io.IOException: java.lang.NullPointerException at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365)
at
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:277) at
org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:214) at
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at
scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
at
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:83)
at
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
at scala.collection.Iterator$class.foreach(Iterator.scala:893) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at
org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
at
org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2178) at
org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2178) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at
org.apache.spark.scheduler.Task.run(Task.scala:109) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)Caused by:
java.lang.NullPointerException at
org.apache.hadoop.hive.druid.serde.DruidSelectQueryRecordReader.nextKeyValue(DruidSelectQueryRecordReader.java:62)
at
org.apache.hadoop.hive.druid.serde.DruidSelectQueryRecordReader.next(DruidSelectQueryRecordReader.java:85)
at
org.apache.hadoop.hive.druid.serde.DruidSelectQueryRecordReader.next(DruidSelectQueryRecordReader.java:38)
at
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
... 22 more ERROR : FAILED: Execution Error, return code 3 from
org.apache.hadoop.hive.ql.exec.spark.SparkTask. Spark job failed during
runtime. Please check stacktrace for the root cause.


Re: hive and druid

Posted by Aaron Grubb <aa...@kaden.ai>.
Double-check that your Druid servers aren't having issues. I had this same error but with LLAP and it turned out that my historicals were running out of heap space when the query was executed by Hive.

On Fri, 2022-09-16 at 09:11 +0100, - - wrote:
Hi All!

Ive setup hive 3.1.3 with spark 2.3.0 and a hdfs 3.2.1 cluster.

Druid version: apache-druid-0.18.1-1

I have 2 external tables in mysql and druid. Independently tables work well querying them, but joining them im getting this:

java.lang.NullPointerException
at org.apache.hadoop.hive.druid.serde.DruidSelectQueryRecordReader.nextKeyValue(DruidSelectQueryRecordReader.java:62)

I tried a few more selects on druid and realised this problem is on the spark executor as group by on the druid table was causing this.

Is this down to using 2.3.0 with HDFS 3 ? CSV external tables work fine. What versions of druid are supported i dont see a compatibility matrix but i did have to upgrade from 2.3.7 to 3.1.3 as my druid deploy does not support SELECTS only SCANS.

Ive already had to set this to get around the LZ4 problems.
spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec

Ive tried disabling vectorization (i think this was 2 or 3 params in total)
hive.vectorized.execution.enabled=false

Full stack trace below. Any help much appreciated especially on the version compatibility!


ERROR : Job failed with java.lang.NullPointerExceptionjava.util.concurrent.ExecutionException: Exception thrown by job at org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:337) at org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:342) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:382) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:343) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 33, 10.0.0.25, executor 1): java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365) at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79) at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:277) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:214) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:83) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2178) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2178) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.druid.serde.DruidSelectQueryRecordReader.nextKeyValue(DruidSelectQueryRecordReader.java:62) at org.apache.hadoop.hive.druid.serde.DruidSelectQueryRecordReader.next(DruidSelectQueryRecordReader.java:85) at org.apache.hadoop.hive.druid.serde.DruidSelectQueryRecordReader.next(DruidSelectQueryRecordReader.java:38) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360) ... 22 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365) at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79) at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:277) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:214) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:83) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2178) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2178) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.druid.serde.DruidSelectQueryRecordReader.nextKeyValue(DruidSelectQueryRecordReader.java:62) at org.apache.hadoop.hive.druid.serde.DruidSelectQueryRecordReader.next(DruidSelectQueryRecordReader.java:85) at org.apache.hadoop.hive.druid.serde.DruidSelectQueryRecordReader.next(DruidSelectQueryRecordReader.java:38) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360) ... 22 more ERROR : FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Spark job failed during runtime. Please check stacktrace for the root cause.