You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/10 00:45:24 UTC

[GitHub] [iceberg] rdblue commented on issue #1021: Add _file and _pos metadata columns to ORC readers

rdblue commented on issue #1021:
URL: https://github.com/apache/iceberg/issues/1021#issuecomment-656420014

> How does the ORC reader know that it has to project metadata columns? Will it be part of the expected Iceberg schema provided to the readers?

Yes. Iceberg will pass in a schema that has the metadata column. We have an example implementation in our current branch with [the column definitions](https://github.com/Netflix/iceberg/blob/netflix-spark-2.4/spark/src/main/java/org/apache/iceberg/spark/source/MetadataColumns.java) and an [implementation for `_file`](https://github.com/Netflix/iceberg/blob/netflix-spark-2.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L410-L458) that uses a joined row (the way we used to do for identity partition values).

We would add detection to the value reader builder that injects the `_pos` reader.

> I was thinking of passing starting position of the first row in each VectorizedRowBatch as part of the OrcValueReader interface and then create a new OrcValueReader returns baseOffset+currentRowIndex for every row.

Sounds reasonable to me.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org