You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Ivan Dyptan (Jira)" <ji...@apache.org> on 2020/04/08 16:16:00 UTC
[jira] [Commented] (ORC-435) Ability to read stripes that are greater than 2GB

    [ https://issues.apache.org/jira/browse/ORC-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078442#comment-17078442 ] 

Ivan Dyptan commented on ORC-435:
---------------------------------

[~prasanth_j] [~omalley]
Please consider re-testing the case, as we are hitting similar error on ORC 1.5.5:
{code:java}
 java -Xmx8g -jar /tmp/orc-tools-1.5.10-SNAPSHOT-uber.jar data /tmp/largestripe.orc

log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.Processing data file /tmp/largestripe.orc [length: 2415919549]Unable to dump data for file: /tmp/largestripe.orcjava.lang.NegativeArraySizeException	at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1553)	at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:1575)	at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:1673)	at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517)	at org.apache.orc.impl.TreeReaderFactory$ListTreeReader.nextVector(TreeReaderFactory.java:2245)	at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059)	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1315)	at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:203)	at org.apache.orc.tools.PrintData.main(PrintData.java:241)	at org.apache.orc.tools.Driver.main(Driver.java:110){code}

> Ability to read stripes that are greater than 2GB
> -------------------------------------------------
>
>                 Key: ORC-435
>                 URL: https://issues.apache.org/jira/browse/ORC-435
>             Project: ORC
>          Issue Type: Bug
>          Components: Reader
>    Affects Versions: 1.3.4, 1.4.4, 1.5.3, 1.6.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>            Priority: Major
>             Fix For: 1.5.4, 1.6.0
>
>
> ORC reader fails with NegativeArraySizeException if the stripe size is >2GB. Even though default stripe size is 64MB there are cases where stripe size will reach >2GB even before memory manager can kick in to check memory size. Say if we are inserting 500KB strings (mostly unique) by the time we reach 5000 rows stripe size is already over 2GB. Reader will have to chunk the disk range reads for such cases instead of reading the stripe as whole blob. 
> Exception thrown when reading such files
> {code:java}
> 2018-10-12 21:43:58,833 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.NegativeArraySizeException
>         at org.apache.hadoop.hive.ql.io.orc.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:272)
>         at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1007)
>         at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:835)
>         at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1029)
>         at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1062)
>         at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1085){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)