You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dhruve Ashar (JIRA)" <ji...@apache.org> on 2019/03/08 16:38:00 UTC

[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

    [ https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16788021#comment-16788021 ] 

Dhruve Ashar commented on SPARK-27107:
--------------------------------------

[~dongjoon] can you review the PR for ORC to fix this issue? [https://github.com/apache/orc/pull/372]

Once this is merged, we can fix the issue in spark as well. Until then the only workaround is to use the hive based implementation.

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --------------------------------------------------------------
>
>                 Key: SPARK-27107
>                 URL: https://issues.apache.org/jira/browse/SPARK-27107
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.2, 2.4.0
>            Reporter: Dhruve Ashar
>            Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 9
> Serialization trace:
> literalList (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
> 	at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
> 	at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
> 	at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
> 	at com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
> 	at com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
> 	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> 	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> 	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> 	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> 	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> 	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> 	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> 	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> 	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> 	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> 	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> 	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> 	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
> 	at org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
> 	at org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
> 	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
> 	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
> 	at scala.Option.foreach(Option.scala:257)
> 	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
> 	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
> 	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
> 	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
> 	at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
> 	at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
> 	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> 	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> 	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> 	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> 	at org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> 	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> 	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> 	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> 	at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371)
> 	at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
> 	at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:150)
> 	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> 	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> 	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> 	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> 	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:92)
> 	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:128)
> 	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:119)
> 	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
> 	... 52 more
> {code}
> This happens only with the new apache orc based implementation and doesn't happen with the hive based implementation. 
>  
> Reason:
> Hive implementation (1.2) sets the default buffer size to 4M and max buffer size to 10M.
> [https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgumentImpl.java#L998]
> Orc implementation on the other hand, sets the size to 100K.
> [https://github.com/apache/orc/blob/master/java/mapreduce/src/java/org/apache/orc/mapred/OrcInputFormat.java#L93]
>  
> We need to fix this in the ORC library and update the version in spark to resolve the issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org