You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2023/03/08 16:31:00 UTC
[jira] [Commented] (SPARK-42715) NegativeArraySizeException by too many datas read from ORC file

    [ https://issues.apache.org/jira/browse/SPARK-42715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697999#comment-17697999 ] 

Apache Spark commented on SPARK-42715:
--------------------------------------

User 'chong0929' has created a pull request for this issue:
https://github.com/apache/spark/pull/40341

> NegativeArraySizeException by too many datas read from ORC file
> ---------------------------------------------------------------
>
>                 Key: SPARK-42715
>                 URL: https://issues.apache.org/jira/browse/SPARK-42715
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.2
>            Reporter: XiaoLong Wu
>            Priority: Minor
>
> If need more friendly exception msg about how to avoid this exception? Like when we catch this expetion, told user can reduce the value about spark.sql.orc.columnarReaderBatchSize;
> In the current version, for batch reading of orc files, we use the function OrcColumnarBatchReader.nextBatch() to do this and depends on [ORC|https://github.com/apache/orc](version:1.8.2) to completed data copy, in ORC relevant code is as follows:
> {code:java}
> private static byte[] commonReadByteArrays(InStream stream, IntegerReader lengths,
>     LongColumnVector scratchlcv,
>     BytesColumnVector result, final int batchSize) throws IOException {
>   // Read lengths
>   scratchlcv.isRepeating = result.isRepeating;
>   scratchlcv.noNulls = result.noNulls;
>   scratchlcv.isNull = result.isNull;  // Notice we are replacing the isNull vector here...
>   lengths.nextVector(scratchlcv, scratchlcv.vector, batchSize);
>   int totalLength = 0;
>   if (!scratchlcv.isRepeating) {
>     for (int i = 0; i < batchSize; i++) {
>       if (!scratchlcv.isNull[i]) {
>         totalLength += (int) scratchlcv.vector[i];
>       }
>     }
>   } else {
>     if (!scratchlcv.isNull[0]) {
>       totalLength = (int) (batchSize * scratchlcv.vector[0]);
>     }
>   }
>   // Read all the strings for this batch
>   byte[] allBytes = new byte[totalLength];
>   int offset = 0;
>   int len = totalLength;
>   while (len > 0) {
>     int bytesRead = stream.read(allBytes, offset, len);
>     if (bytesRead < 0) {
>       throw new EOFException("Can't finish byte read from " + stream);
>     }
>     len -= bytesRead;
>     offset += bytesRead;
>   }
>   return allBytes;
> } {code}
>  As shown above, totalLength as a Long type param is used to mark the data size. If the data size too big to over max_int, converting to int will lead to value overflow and throws the following exception:
> {code:java}
> Caused by: java.lang.NegativeArraySizeException
>     at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1998)
>     at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2021)
>     at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2119)
>     at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1962)
>     at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
>     at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
>     at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
>     at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
>     at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:197)
>     at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
>     at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>     at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
>     at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
>     ... 20 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org