You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2023/03/08 16:31:00 UTC
[jira] [Commented] (SPARK-42715) NegativeArraySizeException by too many datas read from ORC file
[ https://issues.apache.org/jira/browse/SPARK-42715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697999#comment-17697999 ]
Apache Spark commented on SPARK-42715:
--------------------------------------
User 'chong0929' has created a pull request for this issue:
https://github.com/apache/spark/pull/40341
> NegativeArraySizeException by too many datas read from ORC file
> ---------------------------------------------------------------
>
> Key: SPARK-42715
> URL: https://issues.apache.org/jira/browse/SPARK-42715
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.3.2
> Reporter: XiaoLong Wu
> Priority: Minor
>
> If need more friendly exception msg about how to avoid this exception? Like when we catch this expetion, told user can reduce the value about spark.sql.orc.columnarReaderBatchSize;
> In the current version, for batch reading of orc files, we use the function OrcColumnarBatchReader.nextBatch() to do this and depends on [ORC|https://github.com/apache/orc](version:1.8.2) to completed data copy, in ORC relevant code is as follows:
> {code:java}
> private static byte[] commonReadByteArrays(InStream stream, IntegerReader lengths,
> LongColumnVector scratchlcv,
> BytesColumnVector result, final int batchSize) throws IOException {
> // Read lengths
> scratchlcv.isRepeating = result.isRepeating;
> scratchlcv.noNulls = result.noNulls;
> scratchlcv.isNull = result.isNull; // Notice we are replacing the isNull vector here...
> lengths.nextVector(scratchlcv, scratchlcv.vector, batchSize);
> int totalLength = 0;
> if (!scratchlcv.isRepeating) {
> for (int i = 0; i < batchSize; i++) {
> if (!scratchlcv.isNull[i]) {
> totalLength += (int) scratchlcv.vector[i];
> }
> }
> } else {
> if (!scratchlcv.isNull[0]) {
> totalLength = (int) (batchSize * scratchlcv.vector[0]);
> }
> }
> // Read all the strings for this batch
> byte[] allBytes = new byte[totalLength];
> int offset = 0;
> int len = totalLength;
> while (len > 0) {
> int bytesRead = stream.read(allBytes, offset, len);
> if (bytesRead < 0) {
> throw new EOFException("Can't finish byte read from " + stream);
> }
> len -= bytesRead;
> offset += bytesRead;
> }
> return allBytes;
> } {code}
> As shown above, totalLength as a Long type param is used to mark the data size. If the data size too big to over max_int, converting to int will lead to value overflow and throws the following exception:
> {code:java}
> Caused by: java.lang.NegativeArraySizeException
> at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1998)
> at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2021)
> at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2119)
> at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1962)
> at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
> at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
> at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
> at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
> at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:197)
> at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
> at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
> at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
> ... 20 more {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org