You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Eirik Thorsnes <ei...@uni.no> on 2018/03/23 15:03:29 UTC

ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read

Hi all,

I'm trying the new ORC native in Spark 2.3
(org.apache.spark.sql.execution.datasources.orc).

I've compiled Spark 2.3 from the git branch-2.3 as of March 20th.
I also get the same error for the Spark 2.2 from Hortonworks HDP 2.6.4.

*NOTE*: the error only occurs with zlib compression, and I see that with
Snappy I get an extra log-line saying "OrcCodecPool: Got brand-new codec
SNAPPY". Perhaps zlib codec is never loaded/triggered in the new code?

I can write using the new native codepath without errors, but *reading*
zlib-compressed ORC, either the newly written ORC-files *or* older
ORC-files written with Spark 2.2/1.6 I get the following exception.

======= cut =========
2018-03-23 10:36:08,249 INFO FileScanRDD: Reading File path:
hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc,
range: 0-134217728, partition values: [1999]
2018-03-23 10:36:08,326 INFO ReaderImpl: Reading ORC rows from
hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc
with {include: [true, true, true, true, true, true, true, true, true],
offset: 0, length: 134217728}
2018-03-23 10:36:08,326 INFO RecordReaderImpl: Reader schema not
provided -- using file schema
struct<datetime:timestamp,lon:float,lat:float,u10:smallint,v10:smallint,lcc:smallint,mcc:smallint,hcc:smallint>

2018-03-23 10:36:08,824 ERROR Executor: Exception in task 0.0 in stage
1.0 (TID 1)
java.nio.BufferUnderflowException
        at java.nio.Buffer.nextGetIndex(Buffer.java:500)
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:249)
        at
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:248)
        at
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:58)
        at
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
        at
org.apache.orc.impl.TreeReaderFactory$TimestampTreeReader.nextVector(TreeReaderFactory.java:976)
        at
org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815)
        at
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
        at
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.scala:186)
        at
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.scala:114)
        at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
        at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
        at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
        at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
Source)
        at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
        at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
        at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
        at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
        at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
        at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
        at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
======= cut =========

I have the following set in spark-defaults.conf:

spark.sql.hive.convertMetastoreOrc true
spark.sql.orc.char.enabled true
spark.sql.orc.enabled true
spark.sql.orc.filterPushdown true
spark.sql.orc.impl native
spark.sql.orc.enableVectorizedReader true


If I set these to false and use the old hive reader (or specify the full
classname for the old hive reader in the spark-shell) I get results OK
with both new and old orc-files.

If I use Snappy compression it works with the new reader without error.

NOTE: I'm running on Hortonworks HDP 2.6.4 (Hadoop 2.7.3) and I also get
the same error for the Spark 2.2 there which I understand has many of
the patches from the Spark 2.3 branch.

Should this be reported in the JIRA system?

Regards,
Eirik

-- 
Eirik Thorsnes


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read

Posted by Xiao Li <ga...@gmail.com>.

Hi, Eirik,

Yes, please open a JIRA.

Thanks,

Xiao

2018-03-23 8:03 GMT-07:00 Eirik Thorsnes <ei...@uni.no>:

> Hi all,
>
> I'm trying the new ORC native in Spark 2.3
> (org.apache.spark.sql.execution.datasources.orc).
>
> I've compiled Spark 2.3 from the git branch-2.3 as of March 20th.
> I also get the same error for the Spark 2.2 from Hortonworks HDP 2.6.4.
>
> *NOTE*: the error only occurs with zlib compression, and I see that with
> Snappy I get an extra log-line saying "OrcCodecPool: Got brand-new codec
> SNAPPY". Perhaps zlib codec is never loaded/triggered in the new code?
>
> I can write using the new native codepath without errors, but *reading*
> zlib-compressed ORC, either the newly written ORC-files *or* older
> ORC-files written with Spark 2.2/1.6 I get the following exception.
>
> ======= cut =========
> 2018-03-23 10:36:08,249 INFO FileScanRDD: Reading File path:
> hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-
> 37dc216b8a99.orc,
> range: 0-134217728, partition values: [1999]
> 2018-03-23 10:36:08,326 INFO ReaderImpl: Reading ORC rows from
> hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc
> with {include: [true, true, true, true, true, true, true, true, true],
> offset: 0, length: 134217728}
> 2018-03-23 10:36:08,326 INFO RecordReaderImpl: Reader schema not
> provided -- using file schema
> struct<datetime:timestamp,lon:float,lat:float,u10:smallint,
> v10:smallint,lcc:smallint,mcc:smallint,hcc:smallint>
>
> 2018-03-23 10:36:08,824 ERROR Executor: Exception in task 0.0 in stage
> 1.0 (TID 1)
> java.nio.BufferUnderflowException
>         at java.nio.Buffer.nextGetIndex(Buffer.java:500)
>         at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:249)
>         at
> org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:248)
>         at
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(
> RunLengthIntegerReaderV2.java:58)
>         at
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(
> RunLengthIntegerReaderV2.java:323)
>         at
> org.apache.orc.impl.TreeReaderFactory$TimestampTreeReader.
> nextVector(TreeReaderFactory.java:976)
>         at
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(
> TreeReaderFactory.java:1815)
>         at
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
>         at
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.
> nextBatch(OrcColumnarBatchReader.scala:186)
>         at
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.
> nextKeyValue(OrcColumnarBatchReader.scala:114)
>         at
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(
> RecordReaderIterator.scala:39)
>         at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(
> FileScanRDD.scala:105)
>         at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.
> nextIterator(FileScanRDD.scala:177)
>         at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(
> FileScanRDD.scala:105)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.scan_nextBatch$(Unknown
> Source)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.processNext(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.BufferedRowIterator.
> hasNext(BufferedRowIterator.java:43)
>         at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$
> anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>         at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 2.apply(SparkPlan.scala:234)
>         at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 2.apply(SparkPlan.scala:228)
>         at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$25.apply(RDD.scala:827)
>         at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$25.apply(RDD.scala:827)
>         at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>         at org.apache.spark.scheduler.Task.run(Task.scala:108)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> ======= cut =========
>
> I have the following set in spark-defaults.conf:
>
> spark.sql.hive.convertMetastoreOrc true
> spark.sql.orc.char.enabled true
> spark.sql.orc.enabled true
> spark.sql.orc.filterPushdown true
> spark.sql.orc.impl native
> spark.sql.orc.enableVectorizedReader true
>
>
> If I set these to false and use the old hive reader (or specify the full
> classname for the old hive reader in the spark-shell) I get results OK
> with both new and old orc-files.
>
> If I use Snappy compression it works with the new reader without error.
>
> NOTE: I'm running on Hortonworks HDP 2.6.4 (Hadoop 2.7.3) and I also get
> the same error for the Spark 2.2 there which I understand has many of
> the patches from the Spark 2.3 branch.
>
> Should this be reported in the JIRA system?
>
> Regards,
> Eirik
>
> --
> Eirik Thorsnes
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read

Posted by Eirik Thorsnes <ei...@uni.no>.

On 28. mars 2018 03:26, Dongjoon Hyun wrote:
> You may hit SPARK-23355 (convertMetastore should not ignore table properties).
> 
> Since it's a known Spark issue for all Hive tables (Parquet/ORC), could you check that too?
> 
> Bests,
> Dongjoon.
> 

Hi,

I think you might be right, I can run your example from the other email
OK ( spark.range(10).write.orc("/tmp/zlib_test") +
spark.read.orc("/tmp/zlib_test").show )

I can also do:

spark.range(10).write.format("orc").option("compression","zlib").saveAsTable("zlib_test3")

with a corresponding read. Trying to read a more complicated and
partitioned table fails. Could be because of partitioning perhaps?
Looking more into it now.

Regards,
Eirik

-- 
Eirik Thorsnes


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read

Posted by Dongjoon Hyun <do...@apache.org>.

You may hit SPARK-23355 (convertMetastore should not ignore table properties).

Since it's a known Spark issue for all Hive tables (Parquet/ORC), could you check that too?

Bests,
Dongjoon.

On 2018/03/28 01:00:55, Dongjoon Hyun <do...@apache.org> wrote: 
> Hi, Eric.
> 
> For me, Spark 2.3 works correctly like the following. Could you give us some reproducible example?
> 
> ```
> scala> sql("set spark.sql.orc.impl=native")
> 
> scala> sql("set spark.sql.orc.compression.codec=zlib")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> 
> scala> spark.range(10).write.orc("/tmp/zlib_test")
> 
> scala> spark.read.orc("/tmp/zlib_test").show
> +---+
> | id|
> +---+
> |  8|
> |  9|
> |  5|
> |  0|
> |  3|
> |  4|
> |  6|
> |  7|
> |  1|
> |  2|
> +---+
> 
> scala> sc.version
> res4: String = 2.3.0
> ```
> 
> Bests,
> Dongjoon.
> 
> 
> On 2018/03/23 15:03:29, Eirik Thorsnes <ei...@uni.no> wrote: 
> > Hi all,
> > 
> > I'm trying the new ORC native in Spark 2.3
> > (org.apache.spark.sql.execution.datasources.orc).
> > 
> > I've compiled Spark 2.3 from the git branch-2.3 as of March 20th.
> > I also get the same error for the Spark 2.2 from Hortonworks HDP 2.6.4.
> > 
> > *NOTE*: the error only occurs with zlib compression, and I see that with
> > Snappy I get an extra log-line saying "OrcCodecPool: Got brand-new codec
> > SNAPPY". Perhaps zlib codec is never loaded/triggered in the new code?
> > 
> > I can write using the new native codepath without errors, but *reading*
> > zlib-compressed ORC, either the newly written ORC-files *or* older
> > ORC-files written with Spark 2.2/1.6 I get the following exception.
> > 
> > ======= cut =========
> > 2018-03-23 10:36:08,249 INFO FileScanRDD: Reading File path:
> > hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc,
> > range: 0-134217728, partition values: [1999]
> > 2018-03-23 10:36:08,326 INFO ReaderImpl: Reading ORC rows from
> > hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc
> > with {include: [true, true, true, true, true, true, true, true, true],
> > offset: 0, length: 134217728}
> > 2018-03-23 10:36:08,326 INFO RecordReaderImpl: Reader schema not
> > provided -- using file schema
> > struct<datetime:timestamp,lon:float,lat:float,u10:smallint,v10:smallint,lcc:smallint,mcc:smallint,hcc:smallint>
> > 
> > 2018-03-23 10:36:08,824 ERROR Executor: Exception in task 0.0 in stage
> > 1.0 (TID 1)
> > java.nio.BufferUnderflowException
> >         at java.nio.Buffer.nextGetIndex(Buffer.java:500)
> >         at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:249)
> >         at
> > org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:248)
> >         at
> > org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:58)
> >         at
> > org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
> >         at
> > org.apache.orc.impl.TreeReaderFactory$TimestampTreeReader.nextVector(TreeReaderFactory.java:976)
> >         at
> > org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815)
> >         at
> > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
> >         at
> > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.scala:186)
> >         at
> > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.scala:114)
> >         at
> > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> >         at
> > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
> >         at
> > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
> >         at
> > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
> >         at
> > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
> > Source)
> >         at
> > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> > Source)
> >         at
> > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> >         at
> > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> >         at
> > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
> >         at
> > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> >         at
> > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> >         at
> > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> >         at
> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> >         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> >         at
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> >         at org.apache.spark.scheduler.Task.run(Task.scala:108)
> >         at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >         at java.lang.Thread.run(Thread.java:748)
> > ======= cut =========
> > 
> > I have the following set in spark-defaults.conf:
> > 
> > spark.sql.hive.convertMetastoreOrc true
> > spark.sql.orc.char.enabled true
> > spark.sql.orc.enabled true
> > spark.sql.orc.filterPushdown true
> > spark.sql.orc.impl native
> > spark.sql.orc.enableVectorizedReader true
> > 
> > 
> > If I set these to false and use the old hive reader (or specify the full
> > classname for the old hive reader in the spark-shell) I get results OK
> > with both new and old orc-files.
> > 
> > If I use Snappy compression it works with the new reader without error.
> > 
> > NOTE: I'm running on Hortonworks HDP 2.6.4 (Hadoop 2.7.3) and I also get
> > the same error for the Spark 2.2 there which I understand has many of
> > the patches from the Spark 2.3 branch.
> > 
> > Should this be reported in the JIRA system?
> > 
> > Regards,
> > Eirik
> > 
> > -- 
> > Eirik Thorsnes
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> > 
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read

Posted by Dongjoon Hyun <do...@apache.org>.

Hi, Eric.

For me, Spark 2.3 works correctly like the following. Could you give us some reproducible example?

```
scala> sql("set spark.sql.orc.impl=native")

scala> sql("set spark.sql.orc.compression.codec=zlib")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.range(10).write.orc("/tmp/zlib_test")

scala> spark.read.orc("/tmp/zlib_test").show
+---+
| id|
+---+
|  8|
|  9|
|  5|
|  0|
|  3|
|  4|
|  6|
|  7|
|  1|
|  2|
+---+

scala> sc.version
res4: String = 2.3.0
```

Bests,
Dongjoon.


On 2018/03/23 15:03:29, Eirik Thorsnes <ei...@uni.no> wrote: 
> Hi all,
> 
> I'm trying the new ORC native in Spark 2.3
> (org.apache.spark.sql.execution.datasources.orc).
> 
> I've compiled Spark 2.3 from the git branch-2.3 as of March 20th.
> I also get the same error for the Spark 2.2 from Hortonworks HDP 2.6.4.
> 
> *NOTE*: the error only occurs with zlib compression, and I see that with
> Snappy I get an extra log-line saying "OrcCodecPool: Got brand-new codec
> SNAPPY". Perhaps zlib codec is never loaded/triggered in the new code?
> 
> I can write using the new native codepath without errors, but *reading*
> zlib-compressed ORC, either the newly written ORC-files *or* older
> ORC-files written with Spark 2.2/1.6 I get the following exception.
> 
> ======= cut =========
> 2018-03-23 10:36:08,249 INFO FileScanRDD: Reading File path:
> hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc,
> range: 0-134217728, partition values: [1999]
> 2018-03-23 10:36:08,326 INFO ReaderImpl: Reading ORC rows from
> hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc
> with {include: [true, true, true, true, true, true, true, true, true],
> offset: 0, length: 134217728}
> 2018-03-23 10:36:08,326 INFO RecordReaderImpl: Reader schema not
> provided -- using file schema
> struct<datetime:timestamp,lon:float,lat:float,u10:smallint,v10:smallint,lcc:smallint,mcc:smallint,hcc:smallint>
> 
> 2018-03-23 10:36:08,824 ERROR Executor: Exception in task 0.0 in stage
> 1.0 (TID 1)
> java.nio.BufferUnderflowException
>         at java.nio.Buffer.nextGetIndex(Buffer.java:500)
>         at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:249)
>         at
> org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:248)
>         at
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:58)
>         at
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
>         at
> org.apache.orc.impl.TreeReaderFactory$TimestampTreeReader.nextVector(TreeReaderFactory.java:976)
>         at
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815)
>         at
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
>         at
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.scala:186)
>         at
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.scala:114)
>         at
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>         at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
>         at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
>         at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
> Source)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>         at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>         at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>         at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>         at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>         at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>         at org.apache.spark.scheduler.Task.run(Task.scala:108)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> ======= cut =========
> 
> I have the following set in spark-defaults.conf:
> 
> spark.sql.hive.convertMetastoreOrc true
> spark.sql.orc.char.enabled true
> spark.sql.orc.enabled true
> spark.sql.orc.filterPushdown true
> spark.sql.orc.impl native
> spark.sql.orc.enableVectorizedReader true
> 
> 
> If I set these to false and use the old hive reader (or specify the full
> classname for the old hive reader in the spark-shell) I get results OK
> with both new and old orc-files.
> 
> If I use Snappy compression it works with the new reader without error.
> 
> NOTE: I'm running on Hortonworks HDP 2.6.4 (Hadoop 2.7.3) and I also get
> the same error for the Spark 2.2 there which I understand has many of
> the patches from the Spark 2.3 branch.
> 
> Should this be reported in the JIRA system?
> 
> Regards,
> Eirik
> 
> -- 
> Eirik Thorsnes
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org