You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/15 04:46:44 UTC

[GitHub] [iceberg] shardulm94 opened a new pull request #1207: ORC: Support row position as a metadata column

shardulm94 opened a new pull request #1207:
URL: https://github.com/apache/iceberg/pull/1207


   Completes a part of #1021
   
   Adds support to output the row position within the file as a `_pos` column in the data record through ORC readers when the metadata column is provided in the expected schema 
   Main changes:
   - Add a `MetadataColumns` class to `iceberg-core` which will define the metadata columns supported by Iceberg
   - Expose row batch offset in the file to `OrcValueReader`s so that the we can create a new ValueReader which returns the row position
   - Add logic in the `StructReader` to check if a field is a metadata field and provide the correct reader for it


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shardulm94 commented on a change in pull request #1207: ORC: Support row position as a metadata column

Posted by GitBox <gi...@apache.org>.
shardulm94 commented on a change in pull request #1207:
URL: https://github.com/apache/iceberg/pull/1207#discussion_r455402002



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkOrcReaders.java
##########
@@ -378,16 +395,20 @@ private StructConverter(Types.StructType structType, List<Converter> fieldConver
     }
 
     @Override
-    public ColumnVector convert(org.apache.orc.storage.ql.exec.vector.ColumnVector vector, int batchSize) {
+    public ColumnVector convert(org.apache.orc.storage.ql.exec.vector.ColumnVector vector, int batchSize,
+                                long batchOffsetInFile) {
       StructColumnVector structVector = (StructColumnVector) vector;
       List<Types.NestedField> fields = structType.fields();
       List<ColumnVector> fieldVectors = Lists.newArrayListWithExpectedSize(fields.size());
       for (int pos = 0, vectorIndex = 0; pos < fields.size(); pos += 1) {
         Types.NestedField field = fields.get(pos);
         if (idToConstant.containsKey(field.fieldId())) {
           fieldVectors.add(new ConstantColumnVector(field.type(), batchSize, idToConstant.get(field.fieldId())));
+        } else if (field.equals(MetadataColumns.ROW_POSITION)) {

Review comment:
       I am checking the current field with the ROW_POSITION field defined in MetadataColumns, we can probably check just the field ID though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shardulm94 commented on a change in pull request #1207: ORC: Support row position as a metadata column

Posted by GitBox <gi...@apache.org>.
shardulm94 commented on a change in pull request #1207:
URL: https://github.com/apache/iceberg/pull/1207#discussion_r455401422



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcRowReader.java
##########
@@ -29,6 +29,6 @@
   /**
    * Reads a row.
    */
-  T read(VectorizedRowBatch batch, int row);
+  T read(VectorizedRowBatch batch, long batchOffsetInFile, int rowOffsetInBatch);

Review comment:
       Thanks for the suggestion! This reduces the code changes significantly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1207: ORC: Support row position as a metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1207:
URL: https://github.com/apache/iceberg/pull/1207#discussion_r455325073



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcRowReader.java
##########
@@ -29,6 +29,6 @@
   /**
    * Reads a row.
    */
-  T read(VectorizedRowBatch batch, int row);
+  T read(VectorizedRowBatch batch, long batchOffsetInFile, int rowOffsetInBatch);

Review comment:
       This seems to introduce a lot of code churn, when most implementations don't use `batchOffsetInFile`. What about a less intrusive way of passing this by using a context method that is called once for each batch?
   
   Parquet has something similar, where each row group causes new context to be passed to the readers: https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReader.java#L32
   
   This could expose a method like `setBatchContext(long batchOffsetInFile)` with a no-op default. Then only a few implementations would need to change.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1207: ORC: Support row position as a metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1207:
URL: https://github.com/apache/iceberg/pull/1207#issuecomment-661323094


   Looks good to me. Thanks for updating to the new approach!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1207: ORC: Support row position as a metadata column

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1207:
URL: https://github.com/apache/iceberg/pull/1207


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org