You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/17 11:59:44 UTC

[GitHub] [iceberg] lirui-apache opened a new issue #3139: Read ORC table with nested partition column can lead to ArrayIndexOutOfBoundsException

lirui-apache opened a new issue #3139:
URL: https://github.com/apache/iceberg/issues/3139


   This can be reproduced by modifying test `TestPartitionValues::testPartitionedByNestedString`, to let it write with ORC format instead of parquet, e.g. with the following change:
   ```java
       // write into iceberg
       sourceDF.write()
           .format("iceberg")
           .option(WRITE_FORMAT, format) // add this line
           .mode(SaveMode.Append)
           .save(baseLocation);
   ```
   
    And the test would fail with:
   ```shell
   java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
   	at java.util.ArrayList.rangeCheck(ArrayList.java:659)
   	at java.util.ArrayList.get(ArrayList.java:435)
   	at org.apache.iceberg.orc.OrcValueReaders$StructReader.<init>(OrcValueReaders.java:161)
   	at org.apache.iceberg.spark.data.SparkOrcValueReaders$StructReader.<init>(SparkOrcValueReaders.java:143)
   	at org.apache.iceberg.spark.data.SparkOrcValueReaders.struct(SparkOrcValueReaders.java:70)
   	at org.apache.iceberg.spark.data.SparkOrcReader$ReadBuilder.record(SparkOrcReader.java:75)
   	at org.apache.iceberg.spark.data.SparkOrcReader$ReadBuilder.record(SparkOrcReader.java:65)
   	at org.apache.iceberg.orc.OrcSchemaWithTypeVisitor.visitRecord(OrcSchemaWithTypeVisitor.java:71)
   	at org.apache.iceberg.orc.OrcSchemaWithTypeVisitor.visit(OrcSchemaWithTypeVisitor.java:38)
   	at org.apache.iceberg.orc.OrcSchemaWithTypeVisitor.visit(OrcSchemaWithTypeVisitor.java:32)
   	at org.apache.iceberg.spark.data.SparkOrcReader.<init>(SparkOrcReader.java:52)
   	at org.apache.iceberg.spark.source.RowDataReader.lambda$newOrcIterable$2(RowDataReader.java:164)
   	at org.apache.iceberg.orc.OrcIterable.iterator(OrcIterable.java:108)
   	at org.apache.iceberg.orc.OrcIterable.iterator(OrcIterable.java:45)
   	at org.apache.iceberg.util.Filter.lambda$filter$0(Filter.java:35)
   	at org.apache.iceberg.io.CloseableIterable$2.iterator(CloseableIterable.java:73)
   	at org.apache.iceberg.spark.source.RowDataReader.open(RowDataReader.java:78)
   ```
   In `RowDataReader::newOrcIterable`, we exclude the struct field and create an `OrcIterable` with an empty schema, because the inner string is a constant (partition value). And later on we hit the exception when constructing the readers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on issue #3139: Read ORC table with nested partition column can lead to ArrayIndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
openinx commented on issue #3139:
URL: https://github.com/apache/iceberg/issues/3139#issuecomment-924525134


   Interesting ,  thanks for the report @lirui-apache ,  would you like to propose a PR for this bug ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] lirui-apache commented on issue #3139: Read ORC table with nested partition column can lead to ArrayIndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
lirui-apache commented on issue #3139:
URL: https://github.com/apache/iceberg/issues/3139#issuecomment-925485989


   Hey @openinx , I haven't decided what's the right way to fix this. A possible solution is to improve how we generate `idToConstant` in `RowDataReader::open`. Currently we only put metadata column and partition value into that map. Perhaps we can also do a DFS walk of the required schema, and if all nested fields of a compound type are constants, then the compound column itself can be considered a constant too. Does that make sense to you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhangminglei commented on issue #3139: Read ORC table with nested partition column can lead to ArrayIndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
zhangminglei commented on issue #3139:
URL: https://github.com/apache/iceberg/issues/3139#issuecomment-927855682


   https://github.com/apache/iceberg/pull/3186
   
   @openinx Hi, I've gave a PR to fix this. Could you please take a review ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] lirui-apache commented on issue #3139: Read ORC table with nested partition column can lead to ArrayIndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
lirui-apache commented on issue #3139:
URL: https://github.com/apache/iceberg/issues/3139#issuecomment-926299889


   @zhangminglei Sure, feel free to propose a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhangminglei commented on issue #3139: Read ORC table with nested partition column can lead to ArrayIndexOutOfBoundsException

Posted by GitBox <gi...@apache.org>.
zhangminglei commented on issue #3139:
URL: https://github.com/apache/iceberg/issues/3139#issuecomment-925504787


   @lirui-apache Could you please let me take this issue ? I would like to try this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org