You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/09/03 08:54:07 UTC

[GitHub] [iceberg] cmathiesen opened a new pull request #1417: Hive: Add column projection

cmathiesen opened a new pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417


   Hello! This PR is to add column projection to the Hive-Iceberg integration. It follows a similar method to how I added the filter pushdown, by retrieving columns projected via Hive in the `getSplits()` method and adding them to the conf so they can be used in `IcebergInputFormat` when planning tasks. 
   
   I haven't added a new test here as its a bit more difficult to test because its not changing any query results, but I did walk through in debug mode to see if the (correct) columns were projected for all our existing tests, and also checked that it didn't break all the HiveRunner tests...
   
   cc: @rdblue @massdosage @pvary @rdsr @guilload 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#discussion_r483104732



##########
File path: mr/src/main/java/org/apache/iceberg/mr/InputFormatConfig.java
##########
@@ -47,6 +47,7 @@ private InputFormatConfig() {
   public static final String CATALOG = "iceberg.mr.catalog";
   public static final String HADOOP_CATALOG_WAREHOUSE_LOCATION = "iceberg.mr.catalog.hadoop.warehouse.location";
   public static final String CATALOG_LOADER_CLASS = "iceberg.mr.catalog.loader.class";
+  public static final String COLUMN_PROJECTIONS = "iceberg.mr.column.projections";

Review comment:
       The wording we've generally settled on is that `project` is used to set a requested schema and `select` is used to select columns like a `SELECT` statement in SQL. Following that convention, when we pass a schema we call it a "projection" and when we pass columns we call them "selected columns".
   
   I think it would be good to make this more clear, since this is passing columns to project and not a schema. How about `SELECTED_COLUMNS = "iceberg.mr.selected.columns"`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cmathiesen commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
cmathiesen commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-693407801


   Apologies for the delay! I've now updated this with changes that @guilload shared which fails all of the tests that create a local MR job (and select columns) because of the mismatch between the OI and the returned records from Iceberg. I'm also trying to find a solution but haven't come up with anything yet


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
marton-bod commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-732206161


   Any comments/insights? :)
   + @shardulm94 
   
   Thanks!!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cmathiesen commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
cmathiesen commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-696155568


   Hey @guilload, have you had any luck with solutions for problems with the SerDe? Also pinging @pvary to ask if you have any suggestions for a fix for the problem raised [here](https://github.com/apache/iceberg/pull/1417#issuecomment-686830961)?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-734384294


   Thanks for updating this! I merged it.
   
   Nice work, everyone!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cmathiesen commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
cmathiesen commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-696155568


   Hey @guilload, have you had any luck with solutions for problems with the SerDe? Also pinging @pvary to ask if you have any suggestions for a fix for the problem raised [here](https://github.com/apache/iceberg/pull/1417#issuecomment-686830961)?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
marton-bod commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-724620893


   > > Hi @cmathiesen - thanks a lot for starting to work on this!
   > > I was just wondering if this development is still in progress? Happy to provide any assistance if you happen to be lower on bandwidth at the moment. Thanks! :)
   > 
   > @marton-bod Thanks for pinging on this one, it's been on my TODO list for a while. I'll take a look at sorting out the merge conflicts today or tomorrow and push that. If you have some time to help out based on the comments above that would be really appreciated. We can add you as an external collaborator if you want to contribute directly to this branch or feel free to raise a PR against it that we merge in to our branch and that will then show up in this PR etc.
   
   Thanks @massdosage . I will start to look into this and will raise a PR against the Expedia branch.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-696382951


   > Hey @guilload, have you had any luck with solutions for problems with the SerDe? Also pinging @pvary to ask if you have any suggestions for a fix for the problem raised [here](https://github.com/apache/iceberg/pull/1417#issuecomment-686830961)?
   
   I have checked the TestInputOutputFormat.testInOutFormat test case in Hive, where we test the ORC column projection:
   https://github.com/apache/hive/blob/f9d6e84143261c55442ef723197f5b55efb80633/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInputOutputFormat.java#L1890
   
   Basically ORC returns the full record structure. The fields which are not requested are null:
   ```
   OrctStruct {1, null}
   ```
   
   Based on this I would assume that we should wrap the row returned by Iceberg to a Record which has the original full schema and returns the values for the columns we have value, and returns null for non-projected columns.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#discussion_r483007545



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java
##########
@@ -61,6 +62,9 @@
       }
     }
 
+    String[] readColumns = ColumnProjectionUtils.getReadColumnNames(job);

Review comment:
       I have seen this in Hive code as a comment in ColumnProjectionUtils.getReadColumnIDs:
   ```
         // NOTE: some code uses this list to correlate with column names, and yet these lists may
         //       contain duplicates, which this call will remove and the other won't. As far as I can
         //       tell, no code will actually use these two methods together; all is good if the code
         //       gets the ID list without relying on this method. Or maybe it just works by magic.
   ```
   Not sure if this comment is still true or it is already fixed. If we might get duplicates I am not sure how Iceberg will behave in this case. Might worth to check. Maybe with some query with different UDFs around the same column?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] guilload commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
guilload commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-686830961


   Hi @cmathiesen,
   
   Forwarding the selected columns in `getSplits` is not sufficient. You can take a look at the Parquet logs to check whether projections are correctly pushed down. You can also log the length (i.e. # of fields) of the records returned by the input format.
   
   Conceptually, here's what's need to be done:
   - pass the selected columns in `getSplits` AND in `getRecordReader`
   - modify the object inspector to match the projected schema
   
   [I have also started working on a PR](https://github.com/apache/iceberg/compare/master...guilload:guilload--hive--projection-pushdown) for this but unfortunately I haven't been able to get to a fully working solution. 
   
   I suspect that when the SerDe is instantiated, the `hive.io.file.readcolumn.names` property is not always set and so the "unprojected" objectinspector does not match the projected records, which manifests itself in failing tests with an array out of bound exception when Hive tries to read a field in a projected record with the mismatched object inspector.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod edited a comment on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
marton-bod edited a comment on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-732206161


   Any comments/insights? :)
   cc @shardulm94 
   Thanks!!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-696382951


   > Hey @guilload, have you had any luck with solutions for problems with the SerDe? Also pinging @pvary to ask if you have any suggestions for a fix for the problem raised [here](https://github.com/apache/iceberg/pull/1417#issuecomment-686830961)?
   
   I have checked the TestInputOutputFormat.testInOutFormat test case in Hive, where we test the ORC column projection:
   https://github.com/apache/hive/blob/f9d6e84143261c55442ef723197f5b55efb80633/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInputOutputFormat.java#L1890
   
   Basically ORC returns the full record structure. The fields which are not requested are null:
   ```
   OrctStruct {1, null}
   ```
   
   Based on this I would assume that we should wrap the row returned by Iceberg to a Record which has the original full schema and returns the values for the columns we have value, and returns null for non-projected columns.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-690757420


   Thanks for reviewing @guilload! It sounds like we should definitely take your suggestion and update this. I think the tests I was suggesting should catch those cases, too. It would be good to add those tests and confirm that the existing PR fails and then incorporate your work to fix it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-686608945


   This looks good, but I'd like to have a few test cases that validate how things could go wrong. For example:
   
   * Selecting out of order columns: `SELECT b, a, c FROM abc_table`
   * Selecting all but the first column, all but a middle column, and all but the last column
   * Selecting the same column twice: `SELECT a, a FROM abc_table`
   
   Also, how is Hive getting the final projection that Iceberg produces?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#discussion_r483935362



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -111,6 +111,10 @@
     if (schemaStr != null) {
       scan.project(SchemaParser.fromJson(schemaStr));
     }
+    String[] columnsToProject = conf.getStrings(InputFormatConfig.COLUMN_PROJECTIONS, null);
+    if (columnsToProject != null) {
+      scan.select(columnsToProject);

Review comment:
       In line with @rdblue's comment above, the usage of `columnsToProject` is potentially confusing given on the difference we've settled on for when to use `project` vs `select`. Additionally, there's a `project` call above on the parsed schema followed by this `select`, so the diffference between the two is especially noticeable here.
   
   In addition to updating the added config's name to `SELECTED_COLUMNS`, can you possibly change `columnsToProject` to be `columnsToSelect` or even simpler, just `selectedColumns`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] massdosage commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
massdosage commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-732228795


   @rdsr if you have time to look at this too that would be much appreciated.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] massdosage commented on a change in pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
massdosage commented on a change in pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#discussion_r483020062



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java
##########
@@ -61,6 +62,9 @@
       }
     }
 
+    String[] readColumns = ColumnProjectionUtils.getReadColumnNames(job);

Review comment:
       We stumbled across that same gem of a comment when we originally implemented this in Hiveberg. This seemed to suggest that this method *won't* return duplicates and thus should be safe to use. We haven't seen any problems so far but yes, if there is some way to trigger the potential duplicates via a Hive query we could add a HiveRunner test for this. Then again, maybe it just works by magic ;)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#discussion_r483934981



##########
File path: mr/src/main/java/org/apache/iceberg/mr/InputFormatConfig.java
##########
@@ -93,6 +94,11 @@ public ConfigBuilder schema(Schema schema) {
       return this;
     }
 
+    public ConfigBuilder select(String[] columns) {

Review comment:
       I believe that this is using `String[]` as the `IcebergInputFormat` has been changed in this PR to call `org.apache.hadoop.conf.Configuration#getStrings` method which returns `String[]`.
   
   However, I agree with the usage of collections (or Lists) vs `String[]`. It should be noted that `org.apache.hadoop.conf.Configuration` also has a `getStringCollection` function that could be called instead. 
   
   Documentation on `getStringCollection`:
   https://hadoop.apache.org/docs/current/api/org/apache/hadoop/conf/Configuration.html#getStringCollection-java.lang.String-




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cmathiesen edited a comment on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
cmathiesen edited a comment on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-693407801


   Apologies for the delay! I've now updated this with changes that @guilload shared which fails all of the tests that create a local MR job (and select columns) because of the mismatch between the OI and the returned records from Iceberg. I'm also trying to find a solution but haven't come up with anything yet. I'm seeing the same problem as Adrien where the Hive property isn't set when the SerDe is initiated 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] massdosage commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
massdosage commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-721245449


   This seems to be the cause of the failing `testJoinTables` test:
   ```
   Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
   	at org.apache.iceberg.data.GenericRecord.get(GenericRecord.java:114)
   	at org.apache.iceberg.mr.hive.serde.objectinspector.IcebergRecordObjectInspector.getStructFieldData(IcebergRecordObjectInspector.java:73)
   ```
   I haven't had time to look into the root cause yet.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-696382951


   > Hey @guilload, have you had any luck with solutions for problems with the SerDe? Also pinging @pvary to ask if you have any suggestions for a fix for the problem raised [here](https://github.com/apache/iceberg/pull/1417#issuecomment-686830961)?
   
   I have checked the TestInputOutputFormat.testInOutFormat test case in Hive, where we test the ORC column projection:
   https://github.com/apache/hive/blob/f9d6e84143261c55442ef723197f5b55efb80633/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInputOutputFormat.java#L1890
   
   Basically ORC returns the full record structure. The fields which are not requested are null:
   ```
   OrctStruct {1, null}
   ```
   
   Based on this I would assume that we should wrap the row returned by Iceberg to a Record which has the original full schema and returns the values for the columns we have value, and returns null for non-projected columns.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
marton-bod commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-729729304


   Ok, so here's a quick summary: MR was relatively simpler to fix, while Tez needs an upstream fix ([TEZ-4248](https://issues.apache.org/jira/browse/TEZ-4248)) to work when vectorization is enabled. Temporarily for now, vectorized Tez executions have been disabled, and once TEZ-4248 has shipped in the next 0.9.3 release, we can upgrade to that version and remove the assertion.
   
   @rdblue @guilload @pvary can you take another look please? Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
marton-bod commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-720486160


   Hi @cmathiesen  - thanks a lot for starting to work on this! 
   I was just wondering if this development is still in progress? Happy to provide any assistance if you happen to be lower on bandwidth at the moment. Thanks! :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#discussion_r483105151



##########
File path: mr/src/main/java/org/apache/iceberg/mr/InputFormatConfig.java
##########
@@ -93,6 +94,11 @@ public ConfigBuilder schema(Schema schema) {
       return this;
     }
 
+    public ConfigBuilder select(String[] columns) {

Review comment:
       Could we use a `List` instead of an array of strings here? It's usually easier if Iceberg converts to an array instead of expecting the caller to.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-732304310


   I'll take a look this week, thanks for the update! Glad to hear it is working.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] massdosage commented on pull request #1417: Hive: Add column projection

Posted by GitBox <gi...@apache.org>.
massdosage commented on pull request #1417:
URL: https://github.com/apache/iceberg/pull/1417#issuecomment-720538250






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org