You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2022/06/18 06:23:00 UTC
[jira] [Updated] (ORC-1205) Size of batches in some ConvertTreeReaders should be ensured before using
[ https://issues.apache.org/jira/browse/ORC-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
László Bodor updated ORC-1205:
------------------------------
Description:
Given this ORC file:
{code}
Rows: 57
Compression: ZLIB
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<_col0:timestamp,_col1:string,_col2:int,_col3:int,_col4:string,_col5:float,_col6:timestamp>
Stripe Statistics:
Stripe 1:
Column 0: count: 27 hasNull: false
Column 1: count: 27 hasNull: false min: 2019-02-22 10:39:52.0 max: 2019-02-22 10:39:52.0
Column 2: count: 27 hasNull: false min: I max: I sum: 27
Column 3: count: 27 hasNull: false min: 19752356 max: 20524679 sum: 551077013
Column 4: count: 27 hasNull: false min: 34 max: 154 sum: 2568
Column 5: count: 27 hasNull: false min: max: 692 sum: 29
Column 6: count: 27 hasNull: false min: -99988.0 max: 0.0 sum: -2299724.0
Column 7: count: 27 hasNull: false min: 1899-12-30 06:00:00.0 max: 1899-12-30 06:00:00.0
Stripe 2:
Column 0: count: 30 hasNull: false
Column 1: count: 30 hasNull: false min: 2019-02-22 10:39:52.0 max: 2019-02-22 10:39:52.0
Column 2: count: 30 hasNull: false min: I max: I sum: 30
Column 3: count: 30 hasNull: false min: 19752356 max: 20524679 sum: 611106400
Column 4: count: 30 hasNull: false min: 34 max: 154 sum: 2923
Column 5: count: 30 hasNull: false min: max: 692 sum: 21
Column 6: count: 30 hasNull: false min: -99988.0 max: 0.0 sum: -2699676.0
Column 7: count: 30 hasNull: false min: 1899-12-30 06:00:00.0 max: 1899-12-30 06:00:00.0
...
{code}
this leads to a read of a batch of size 27 and then another of size 30
on the second batch we get:
{code}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 27
at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:293)
at org.apache.orc.impl.TreeReaderFactory$FloatTreeReader.nextVector(TreeReaderFactory.java:690)
at org.apache.orc.impl.ConvertTreeReaderFactory$DecimalFromDoubleTreeReader.nextVector(ConvertTreeReaderFactory.java:951)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2060)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:88)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:104)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:255)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:230)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:605)
... 51 more
{code}
this is thrown from here (ignore line numbers above, those belong to another distro)
https://github.com/apache/orc/blob/d41c3a678307f10d3cc8799abb5d55e9922115a8/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L388
I simply fixed this problem by adding another ensure call here:
https://github.com/apache/orc/blob/d41c3a678307f10d3cc8799abb5d55e9922115a8/java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java#L901
{code}
doubleColVector.ensureSize(batchSize, false);
{code}
in general, in ConvertTreeReader instances we use multiple vector variables (because of conversion), and we only ensure the size of one of them while reading
> Size of batches in some ConvertTreeReaders should be ensured before using
> -------------------------------------------------------------------------
>
> Key: ORC-1205
> URL: https://issues.apache.org/jira/browse/ORC-1205
> Project: ORC
> Issue Type: Bug
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
>
> Given this ORC file:
> {code}
> Rows: 57
> Compression: ZLIB
> Compression size: 262144
> Calendar: Julian/Gregorian
> Type: struct<_col0:timestamp,_col1:string,_col2:int,_col3:int,_col4:string,_col5:float,_col6:timestamp>
> Stripe Statistics:
> Stripe 1:
> Column 0: count: 27 hasNull: false
> Column 1: count: 27 hasNull: false min: 2019-02-22 10:39:52.0 max: 2019-02-22 10:39:52.0
> Column 2: count: 27 hasNull: false min: I max: I sum: 27
> Column 3: count: 27 hasNull: false min: 19752356 max: 20524679 sum: 551077013
> Column 4: count: 27 hasNull: false min: 34 max: 154 sum: 2568
> Column 5: count: 27 hasNull: false min: max: 692 sum: 29
> Column 6: count: 27 hasNull: false min: -99988.0 max: 0.0 sum: -2299724.0
> Column 7: count: 27 hasNull: false min: 1899-12-30 06:00:00.0 max: 1899-12-30 06:00:00.0
> Stripe 2:
> Column 0: count: 30 hasNull: false
> Column 1: count: 30 hasNull: false min: 2019-02-22 10:39:52.0 max: 2019-02-22 10:39:52.0
> Column 2: count: 30 hasNull: false min: I max: I sum: 30
> Column 3: count: 30 hasNull: false min: 19752356 max: 20524679 sum: 611106400
> Column 4: count: 30 hasNull: false min: 34 max: 154 sum: 2923
> Column 5: count: 30 hasNull: false min: max: 692 sum: 21
> Column 6: count: 30 hasNull: false min: -99988.0 max: 0.0 sum: -2699676.0
> Column 7: count: 30 hasNull: false min: 1899-12-30 06:00:00.0 max: 1899-12-30 06:00:00.0
> ...
> {code}
> this leads to a read of a batch of size 27 and then another of size 30
> on the second batch we get:
> {code}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 27
> at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:293)
> at org.apache.orc.impl.TreeReaderFactory$FloatTreeReader.nextVector(TreeReaderFactory.java:690)
> at org.apache.orc.impl.ConvertTreeReaderFactory$DecimalFromDoubleTreeReader.nextVector(ConvertTreeReaderFactory.java:951)
> at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2060)
> at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329)
> at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:88)
> at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:104)
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:255)
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:230)
> at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:605)
> ... 51 more
> {code}
> this is thrown from here (ignore line numbers above, those belong to another distro)
> https://github.com/apache/orc/blob/d41c3a678307f10d3cc8799abb5d55e9922115a8/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L388
> I simply fixed this problem by adding another ensure call here:
> https://github.com/apache/orc/blob/d41c3a678307f10d3cc8799abb5d55e9922115a8/java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java#L901
> {code}
> doubleColVector.ensureSize(batchSize, false);
> {code}
> in general, in ConvertTreeReader instances we use multiple vector variables (because of conversion), and we only ensure the size of one of them while reading
--
This message was sent by Atlassian Jira
(v8.20.7#820007)