You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2022/06/18 10:43:00 UTC

[jira] [Comment Edited] (ORC-1205) Size of batches in some ConvertTreeReaders should be ensured before using

    [ https://issues.apache.org/jira/browse/ORC-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555887#comment-17555887 ] 

László Bodor edited comment on ORC-1205 at 6/18/22 10:42 AM:
-------------------------------------------------------------

same issue with 1.7.5, slightly different stack:
{code}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 27
	at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
	at org.apache.orc.impl.TreeReaderFactory$FloatTreeReader.nextVector(TreeReaderFactory.java:894)
	at org.apache.orc.impl.ConvertTreeReaderFactory$DecimalFromDoubleTreeReader.nextVector(ConvertTreeReaderFactory.java:897)
	at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:88)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:104)
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:265)
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:241)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:589)
	... 54 more
{code}

this is the same hive repro, I'll add unit test for ORC



was (Author: abstractdog):
same issue with 1.7.5, slightly different stack:
{code}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 27
	at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
	at org.apache.orc.impl.TreeReaderFactory$FloatTreeReader.nextVector(TreeReaderFactory.java:894)
	at org.apache.orc.impl.ConvertTreeReaderFactory$DecimalFromDoubleTreeReader.nextVector(ConvertTreeReaderFactory.java:897)
	at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:88)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:104)
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:265)
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:241)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:589)
	... 54 more
{code}



> Size of batches in some ConvertTreeReaders should be ensured before using
> -------------------------------------------------------------------------
>
>                 Key: ORC-1205
>                 URL: https://issues.apache.org/jira/browse/ORC-1205
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.6.14, 1.7.5
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>
> Given this ORC file:
> {code}
> Rows: 57
> Compression: ZLIB
> Compression size: 262144
> Calendar: Julian/Gregorian
> Type: struct<_col0:timestamp,_col1:string,_col2:int,_col3:int,_col4:string,_col5:float,_col6:timestamp>
> Stripe Statistics:
>   Stripe 1:
>     Column 0: count: 27 hasNull: false
>     Column 1: count: 27 hasNull: false min: 2019-02-22 10:39:52.0 max: 2019-02-22 10:39:52.0
>     Column 2: count: 27 hasNull: false min: I max: I sum: 27
>     Column 3: count: 27 hasNull: false min: 19752356 max: 20524679 sum: 551077013
>     Column 4: count: 27 hasNull: false min: 34 max: 154 sum: 2568
>     Column 5: count: 27 hasNull: false min:  max: 692 sum: 29
>     Column 6: count: 27 hasNull: false min: -99988.0 max: 0.0 sum: -2299724.0
>     Column 7: count: 27 hasNull: false min: 1899-12-30 06:00:00.0 max: 1899-12-30 06:00:00.0
>   Stripe 2:
>     Column 0: count: 30 hasNull: false
>     Column 1: count: 30 hasNull: false min: 2019-02-22 10:39:52.0 max: 2019-02-22 10:39:52.0
>     Column 2: count: 30 hasNull: false min: I max: I sum: 30
>     Column 3: count: 30 hasNull: false min: 19752356 max: 20524679 sum: 611106400
>     Column 4: count: 30 hasNull: false min: 34 max: 154 sum: 2923
>     Column 5: count: 30 hasNull: false min:  max: 692 sum: 21
>     Column 6: count: 30 hasNull: false min: -99988.0 max: 0.0 sum: -2699676.0
>     Column 7: count: 30 hasNull: false min: 1899-12-30 06:00:00.0 max: 1899-12-30 06:00:00.0
> ...
> {code}
> this leads to a read of a batch of size 27 and then another of size 30
> on the second batch we get:
> {code}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 27
> 	at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:306)
> 	at org.apache.orc.impl.TreeReaderFactory$FloatTreeReader.nextVector(TreeReaderFactory.java:690)
> 	at org.apache.orc.impl.ConvertTreeReaderFactory$DecimalFromDoubleTreeReader.nextVector(ConvertTreeReaderFactory.java:867)
> 	at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2047)
> 	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1219)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:88)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:104)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:265)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:241)
> 	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:589)
> {code}
> this is thrown from here (ignore line numbers above, those belong to another distro)
> https://github.com/apache/orc/blob/d41c3a678307f10d3cc8799abb5d55e9922115a8/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L388
> I simply fixed this problem by adding another ensure call here:
> https://github.com/apache/orc/blob/d41c3a678307f10d3cc8799abb5d55e9922115a8/java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java#L901
> {code}
> doubleColVector.ensureSize(batchSize, false);
> {code}
> in general, in ConvertTreeReader instances we use multiple vector variables (because of conversion), and we only ensure the size of one of them while reading:
> https://github.com/apache/orc/blob/b5945001f670a5a44250e76aea1ea704bfd0e29d/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L2046
> I've set 1.6.9 as affected version, as I'm able to reproduce it on hive/master which depends on ORC 1.6.9 at the moment
> on main branch, I haven't seen the corresponding ensure call, I need to check what changed since branch-1.6



--
This message was sent by Atlassian Jira
(v8.20.7#820007)