You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ala Luszczak (JIRA)" <ji...@apache.org> on 2018/05/01 08:24:00 UTC
[jira] [Created] (SPARK-24133) Reading Parquet files containing
large strings can fail with java.lang.ArrayIndexOutOfBoundsException
Ala Luszczak created SPARK-24133:
------------------------------------
Summary: Reading Parquet files containing large strings can fail with java.lang.ArrayIndexOutOfBoundsException
Key: SPARK-24133
URL: https://issues.apache.org/jira/browse/SPARK-24133
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.3.0
Reporter: Ala Luszczak
ColumnVectors store string data in one big byte array. Since the array size is capped at just under Integer.MAX_VALUE, a single ColumnVector cannot store more than 2GB of string data.
However, since the Parquet files commonly contain large blobs stored as strings, and ColumnVectors by default carry 4096 values, it's entirely possible to go past that limit.
In such cases a negative capacity is requested from WritableColumnVector.reserve(). The call succeeds (requested capacity is smaller than already allocated), and consequently java.lang.ArrayIndexOutOfBoundsException is thrown when the reader actually attempts to put the data into the array.
This behavior is hard to troubleshoot for the users. Spark should instead check for negative requested capacity in WritableColumnVector.reserve() and throw more informative error, instructing the user to tweak ColumnarBatch size.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org