You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2022/06/18 06:23:00 UTC
[jira] [Updated] (ORC-1205) Size of batches in some ConvertTreeReaders should be ensured before using

     [ https://issues.apache.org/jira/browse/ORC-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

László Bodor updated ORC-1205:
------------------------------
    Description: 
Given this ORC file:
{code}
Rows: 57
Compression: ZLIB
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<_col0:timestamp,_col1:string,_col2:int,_col3:int,_col4:string,_col5:float,_col6:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 27 hasNull: false
    Column 1: count: 27 hasNull: false min: 2019-02-22 10:39:52.0 max: 2019-02-22 10:39:52.0
    Column 2: count: 27 hasNull: false min: I max: I sum: 27
    Column 3: count: 27 hasNull: false min: 19752356 max: 20524679 sum: 551077013
    Column 4: count: 27 hasNull: false min: 34 max: 154 sum: 2568
    Column 5: count: 27 hasNull: false min:  max: 692 sum: 29
    Column 6: count: 27 hasNull: false min: -99988.0 max: 0.0 sum: -2299724.0
    Column 7: count: 27 hasNull: false min: 1899-12-30 06:00:00.0 max: 1899-12-30 06:00:00.0
  Stripe 2:
    Column 0: count: 30 hasNull: false
    Column 1: count: 30 hasNull: false min: 2019-02-22 10:39:52.0 max: 2019-02-22 10:39:52.0
    Column 2: count: 30 hasNull: false min: I max: I sum: 30
    Column 3: count: 30 hasNull: false min: 19752356 max: 20524679 sum: 611106400
    Column 4: count: 30 hasNull: false min: 34 max: 154 sum: 2923
    Column 5: count: 30 hasNull: false min:  max: 692 sum: 21
    Column 6: count: 30 hasNull: false min: -99988.0 max: 0.0 sum: -2699676.0
    Column 7: count: 30 hasNull: false min: 1899-12-30 06:00:00.0 max: 1899-12-30 06:00:00.0
...
{code}

this leads to a read of a batch of size 27 and then another of size 30

on the second batch we get:
{code}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 27
	at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:293)
	at org.apache.orc.impl.TreeReaderFactory$FloatTreeReader.nextVector(TreeReaderFactory.java:690)
	at org.apache.orc.impl.ConvertTreeReaderFactory$DecimalFromDoubleTreeReader.nextVector(ConvertTreeReaderFactory.java:951)
	at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2060)
	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:88)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:104)
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:255)
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:230)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:605)
	... 51 more
{code}

this is thrown from here (ignore line numbers above, those belong to another distro)
https://github.com/apache/orc/blob/d41c3a678307f10d3cc8799abb5d55e9922115a8/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L388

I simply fixed this problem by adding another ensure call here:
https://github.com/apache/orc/blob/d41c3a678307f10d3cc8799abb5d55e9922115a8/java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java#L901
{code}
doubleColVector.ensureSize(batchSize, false);
{code}

in general, in ConvertTreeReader instances we use multiple vector variables (because of conversion), and we only ensure the size of one of them while reading


> Size of batches in some ConvertTreeReaders should be ensured before using
> -------------------------------------------------------------------------
>
>                 Key: ORC-1205
>                 URL: https://issues.apache.org/jira/browse/ORC-1205
>             Project: ORC
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>
> Given this ORC file:
> {code}
> Rows: 57
> Compression: ZLIB
> Compression size: 262144
> Calendar: Julian/Gregorian
> Type: struct<_col0:timestamp,_col1:string,_col2:int,_col3:int,_col4:string,_col5:float,_col6:timestamp>
> Stripe Statistics:
>   Stripe 1:
>     Column 0: count: 27 hasNull: false
>     Column 1: count: 27 hasNull: false min: 2019-02-22 10:39:52.0 max: 2019-02-22 10:39:52.0
>     Column 2: count: 27 hasNull: false min: I max: I sum: 27
>     Column 3: count: 27 hasNull: false min: 19752356 max: 20524679 sum: 551077013
>     Column 4: count: 27 hasNull: false min: 34 max: 154 sum: 2568
>     Column 5: count: 27 hasNull: false min:  max: 692 sum: 29
>     Column 6: count: 27 hasNull: false min: -99988.0 max: 0.0 sum: -2299724.0
>     Column 7: count: 27 hasNull: false min: 1899-12-30 06:00:00.0 max: 1899-12-30 06:00:00.0
>   Stripe 2:
>     Column 0: count: 30 hasNull: false
>     Column 1: count: 30 hasNull: false min: 2019-02-22 10:39:52.0 max: 2019-02-22 10:39:52.0
>     Column 2: count: 30 hasNull: false min: I max: I sum: 30
>     Column 3: count: 30 hasNull: false min: 19752356 max: 20524679 sum: 611106400
>     Column 4: count: 30 hasNull: false min: 34 max: 154 sum: 2923
>     Column 5: count: 30 hasNull: false min:  max: 692 sum: 21
>     Column 6: count: 30 hasNull: false min: -99988.0 max: 0.0 sum: -2699676.0
>     Column 7: count: 30 hasNull: false min: 1899-12-30 06:00:00.0 max: 1899-12-30 06:00:00.0
> ...
> {code}
> this leads to a read of a batch of size 27 and then another of size 30
> on the second batch we get:
> {code}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 27
> 	at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:293)
> 	at org.apache.orc.impl.TreeReaderFactory$FloatTreeReader.nextVector(TreeReaderFactory.java:690)
> 	at org.apache.orc.impl.ConvertTreeReaderFactory$DecimalFromDoubleTreeReader.nextVector(ConvertTreeReaderFactory.java:951)
> 	at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2060)
> 	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:88)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:104)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:255)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:230)
> 	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:605)
> 	... 51 more
> {code}
> this is thrown from here (ignore line numbers above, those belong to another distro)
> https://github.com/apache/orc/blob/d41c3a678307f10d3cc8799abb5d55e9922115a8/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L388
> I simply fixed this problem by adding another ensure call here:
> https://github.com/apache/orc/blob/d41c3a678307f10d3cc8799abb5d55e9922115a8/java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java#L901
> {code}
> doubleColVector.ensureSize(batchSize, false);
> {code}
> in general, in ConvertTreeReader instances we use multiple vector variables (because of conversion), and we only ensure the size of one of them while reading



--
This message was sent by Atlassian Jira
(v8.20.7#820007)