You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Georeth Zhou (Jira)" <ji...@apache.org> on 2022/10/31 12:44:00 UTC

[jira] [Updated] (ARROW-18198) IndexOutOfBoundsException when loading compressed IPC format

     [ https://issues.apache.org/jira/browse/ARROW-18198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Georeth Zhou updated ARROW-18198:
---------------------------------
    Environment: 
Linux and Windows.
Apache Arrow Java version: 10.0.0, 9.0.0, 4.0.1.
Pandas 1.4.2 using pyarrow 8.0.0 (anaconda3-2022.05)

  was:
Linux and Windows.
Apache Arrow Java version: 10.0.0, 9.0.0, 4.0.1.
Pandas 1.4.2. (anaconda3-2022.05)


> IndexOutOfBoundsException when loading compressed IPC format
> ------------------------------------------------------------
>
>                 Key: ARROW-18198
>                 URL: https://issues.apache.org/jira/browse/ARROW-18198
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 4.0.1, 9.0.0, 10.0.0
>         Environment: Linux and Windows.
> Apache Arrow Java version: 10.0.0, 9.0.0, 4.0.1.
> Pandas 1.4.2 using pyarrow 8.0.0 (anaconda3-2022.05)
>            Reporter: Georeth Zhou
>            Priority: Major
>
> I encountered this bug when I loaded a dataframe stored in the Arrow IPC format.
>  
> {code:java}
> // Java Code from "Apache Arrow Java Cookbook"
> File file = new File("example.arrow");
> try (
>         BufferAllocator rootAllocator = new RootAllocator();
>         FileInputStream fileInputStream = new FileInputStream(file);
>         ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(), rootAllocator)
> ) {
>     System.out.println("Record batches in file: " + reader.getRecordBlocks().size());
>     for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
>         reader.loadRecordBatch(arrowBlock);
>         VectorSchemaRoot vectorSchemaRootRecover = reader.getVectorSchemaRoot();
>         System.out.print(vectorSchemaRootRecover.contentToTSVString());
>     }
> } catch (IOException e) {
>     e.printStackTrace();
> } {code}
> Call stack:
> {noformat}
> Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 0, length: 2048 (expected: range(0, 2024))
>     at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:701)
>     at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:955)
>     at org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:451)
>     at org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:732)
>     at org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:240)
>     at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:86)
>     at org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:220)
>     at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:166)
>     at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:197){noformat}
> This bug can be reproduced by a simple dataframe created by pandas:
>  
> {code:java}
> pd.DataFrame({'a': range(10000)}).to_feather('example.arrow') {code}
> Pandas compresses the dataframe by default. If the compression is turned off, Java can load the dataframe. Thus, I guess the bounds checking code is buggy when loading compressed file.
>  
> That dataframe can be loaded in polars, pandas and pyarrow, so it's unlikely to be a pandas bug.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)