You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/28 03:48:12 UTC

[GitHub] [iceberg] qphien opened a new pull request #2171: Hive: Fix schema projection issue on Tez

qphien opened a new pull request #2171:
URL: https://github.com/apache/iceberg/pull/2171


   With [this](https://github.com/apache/hive/blob/113f6af7528f016bf821f7a746bad496cc93f834/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/MapRecordProcessor.java#L173-L181), `hive.io.file.readcolumn.names` is not overlayed from  `HiveInputforma.jobConf` to `MapOperator.jobConf`. `MapOperator.jobConf` does not has corresponding projection columns, when `HiveIcebergSerde` is initialized, inspector contains all table fields is created.
   
   Due to wrong inspector, `ArrayIndexOutOfBoundsException` will be raised when partial fields are selected.
   
   This also fix vectorization read issue in https://github.com/apache/iceberg/issues/2120
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] qphien commented on pull request #2171: Hive: Fix schema projection issue on Tez

Posted by GitBox <gi...@apache.org>.
qphien commented on pull request #2171:
URL: https://github.com/apache/iceberg/pull/2171#issuecomment-769705387


   @pvary Thanks for your reply.  I agree that creating an extra record is not a good way to solve vectorization problem in hive, we need to find another way to solve this problem.
   
   [TEZ-4248](https://issues.apache.org/jira/browse/TEZ-4248) is a better way to solve  projection issue on tez.  BTW, do we need to change [this assert](https://github.com/apache/iceberg/blob/19622dcfcb426485748fa017a6181e23df5732dc/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java#L112-L118) currently since this issue happens no matter whether `hive.vectorized.execution.enabled` is enabled ?
   
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] qphien commented on pull request #2171: Hive: Fix schema projection issue on Tez

Posted by GitBox <gi...@apache.org>.
qphien commented on pull request #2171:
URL: https://github.com/apache/iceberg/pull/2171#issuecomment-770616220


   Close the PR. TEZ-4248 is a better way to solve the problem. The problem about vectorization on mr will be discussed in https://github.com/apache/iceberg/issues/2120.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] qphien commented on pull request #2171: Hive: Fix schema projection issue on Tez

Posted by GitBox <gi...@apache.org>.
qphien commented on pull request #2171:
URL: https://github.com/apache/iceberg/pull/2171#issuecomment-769742755


   ```
     public void testScanTable() throws IOException {
       shell.setHiveSessionValue("hive.vectorized.execution.enabled", "false");
   
       testTables.createTable(shell, "customers", HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA, fileFormat,
           HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS);
   
       // Adding the ORDER BY clause will cause Hive to spawn a local MR job this time.
       List<Object[]> descRows =
           shell.executeStatement("SELECT last_name, customer_id FROM default.customers ORDER BY customer_id DESC");
   
       Assert.assertEquals(3, descRows.size());
       Assert.assertArrayEquals(new Object[] {"Pink", 2L}, descRows.get(0));
       Assert.assertArrayEquals(new Object[] {"Green", 1L}, descRows.get(1));
       Assert.assertArrayEquals(new Object[] {"Brown", 0L}, descRows.get(2));
     }
   ```
   
   
   With vectorization disabled, this modified `testScanTable` test still throws the same exception due to missing `hive.io.file.readcolumn.names` in `MapOperator.jonConf`:
   ```
   Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
   	at org.apache.iceberg.data.GenericRecord.get(GenericRecord.java:114)
   	at org.apache.iceberg.mr.hive.serde.objectinspector.IcebergRecordObjectInspector.getStructFieldData(IcebergRecordObjectInspector.java:75)
   	at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator._evaluate(ExprNodeColumnEvaluator.java:95)
   	at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:80)
   	at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:68)
   	at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
   	... 25 more
   ```
   
   
   We can add `conf.set("tez.mrreader.config.update.properties", "hive.io.file.readcolumn.names")` in `HiveIcebergSerde.initialize()` or some other place with TEZ-4248 to solve this problem.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #2171: Hive: Fix schema projection issue on Tez

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #2171:
URL: https://github.com/apache/iceberg/pull/2171#issuecomment-769711970


   > BTW, do we need to change [this assert](https://github.com/apache/iceberg/blob/19622dcfcb426485748fa017a6181e23df5732dc/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java#L112-L118) currently since this issue happens no matter whether `hive.vectorized.execution.enabled` is enabled ?
   
   Testing vectorization is hard. Even if we turn on vectorization we have to check the output of `EXPLAIN VECTORIZATION <QUERY>` to check if the vectorization is actually happening or not, but if you can add a test case which fails even with vectorization disabled that's a good indicator that we definitely need TEZ-4248 or some other solution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #2171: Hive: Fix schema projection issue on Tez

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #2171:
URL: https://github.com/apache/iceberg/pull/2171#issuecomment-769674534


   @qphien: We have some internal deadlines end of this week, so we will come back to you in more details on the next week.
   
   In the meantime we had discussed this briefly with @marton-bod.
   
   For the current solution I have the following concerns:
   - This issues is better solved by [TEZ-4248](https://issues.apache.org/jira/browse/TEZ-4248). Since it is not yet in any release I understand why you are looking for solution in Iceberg.
   - Creating an extra `Record` object for every row in a table can be costly. We want to avoid running any extra lines of code on a codepath which runs on every row. And minimally we want to avoid creating objects on that codepath. I haven't checked but we might want to push down the stuff to Record generation.
   - When we solve the vectorization problem we can enable it in the tests here: https://github.com/apache/iceberg/blob/19622dcfcb426485748fa017a6181e23df5732dc/mr/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandlerTestUtils.java#L91-L93
   
   Thanks,
   Peter


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] qphien closed pull request #2171: Hive: Fix schema projection issue on Tez

Posted by GitBox <gi...@apache.org>.
qphien closed pull request #2171:
URL: https://github.com/apache/iceberg/pull/2171


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary edited a comment on pull request #2171: Hive: Fix schema projection issue on Tez

Posted by GitBox <gi...@apache.org>.
pvary edited a comment on pull request #2171:
URL: https://github.com/apache/iceberg/pull/2171#issuecomment-769711970


   > BTW, do we need to change [this assert](https://github.com/apache/iceberg/blob/19622dcfcb426485748fa017a6181e23df5732dc/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java#L112-L118) currently since this issue happens no matter whether `hive.vectorized.execution.enabled` is enabled ?
   
   Testing vectorization is hard. Even if we turn on vectorization we have to check the output of `EXPLAIN VECTORIZATION <QUERY>` to check if the vectorization is actually happening or not, but if you can add a test case which fails even with vectorization disabled that's a good indicator that we definitely need TEZ-4248 or some other solution, and helps investigating the problem


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org