You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/06 22:51:26 UTC

[GitHub] [iceberg] HotSushi opened a new pull request #1557: Hive: Fix for missing table schema in map reduce job configurations

HotSushi opened a new pull request #1557:
URL: https://github.com/apache/iceberg/pull/1557


   Hive queries which spawn map reduce jobs are currently failing on live yarn clusters with the following stack trace:
   ```
   2020-10-02 23:37:01,507 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: java.lang.NullPointerException
   	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
   	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
   	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:258)
   	at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:705)
   	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169)
   	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:438)
   	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
   	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
   	at java.security.AccessController.doPrivileged(Native Method)
   	at javax.security.auth.Subject.doAs(Subject.java:422)
   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
   	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
   Caused by: java.lang.NullPointerException
   	at java.util.Objects.requireNonNull(Objects.java:203)
   	at org.apache.iceberg.shaded.com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2296)
   	at org.apache.iceberg.shaded.com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:111)
   	at org.apache.iceberg.shaded.com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:54)
   	at org.apache.iceberg.SchemaParser.fromJson(SchemaParser.java:247)
   	at org.apache.iceberg.mr.mapreduce.IcebergInputFormat$IcebergRecordReader.initialize(IcebergInputFormat.java:176)
   	at org.apache.iceberg.mr.mapred.MapredIcebergInputFormat$MapredIcebergRecordReader.<init>(MapredIcebergInputFormat.java:92)
   	at org.apache.iceberg.mr.mapred.MapredIcebergInputFormat.getRecordReader(MapredIcebergInputFormat.java:78)
   	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:255)
   	... 9 more
   ```
   This error occurs on the mappers and the reason for this failure is that the job configurations such as `TABLE_SCHEMA`, `TABLE_LOCATION`, `TABLE_IDENTIFIER` are not set correctly. The location of failure is [here](https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java#L184): 
   
   The outcome of this PR is that the job configs are set correctly and map reduce job succeeds this erroneous stage. 
   
   I'm not sure why this error is not reproducible in HiveRunner unit tests.
   
   cc: @shardulm94. @omalley 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #1557: Hive: Fix for missing table schema in map reduce job configurations

Posted by GitBox <gi...@apache.org>.

pvary commented on pull request #1557:
URL: https://github.com/apache/iceberg/pull/1557#issuecomment-705131982


   > Queries which can run on the driver and doesn't spawn mr jobs succeed, the problem is only faced by queries such as DESC which needs mr jobs.
   
   Makes sense.
   It would be good to have a test case to prevent regression.
   Are we able to provide a test case which fails before the fix and works after?
   HiveIcebergStorageHandlerBaseTest.testJoinTables might be a good candidate for start.
   
   Thanks, Peter 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #1557: Hive: Fix for missing table schema in map reduce job configurations

Posted by GitBox <gi...@apache.org>.

pvary commented on pull request #1557:
URL: https://github.com/apache/iceberg/pull/1557#issuecomment-704707218


   > This error occurs on the mappers and the reason for this failure is that the job configurations such as `TABLE_SCHEMA`, `TABLE_LOCATION`, `TABLE_IDENTIFIER` are not set correctly.
   
   Which version of Hive you are using? Or this is query dependent?
   
   Thanks for spotting the issue! 
   Peter


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] massdosage edited a comment on pull request #1557: Hive: Fix for missing table schema in map reduce job configurations

Posted by GitBox <gi...@apache.org>.

massdosage edited a comment on pull request #1557:
URL: https://github.com/apache/iceberg/pull/1557#issuecomment-705584192


   > Looks reasonable to me, but will this affect jobs that run multiple scans in a single MR stage?
   > 
   > @massdosage, do we have HiveRunner tests for joins that run a two table scans in a stage?
   
   I think this does it: https://github.com/ExpediaGroup/iceberg/blob/master/mr/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandlerBaseTest.java#L145 as @pvary mentioned above. I know this caught an issue with multiple tables not being configured for the scan properly in the past but it's possible it doesn't capture all the cases that can occur.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] HotSushi commented on pull request #1557: Hive: Fix for missing table schema in map reduce job configurations

Posted by GitBox <gi...@apache.org>.

HotSushi commented on pull request #1557:
URL: https://github.com/apache/iceberg/pull/1557#issuecomment-705123960


   @pvary we're using hive 1.1. But I was not able to find any difference in the relevant code in Hive 2, which calls `configureJobConf` or `configureInputJobProperties`.
   
   Queries which can run on the driver and doesn't spawn mr jobs succeed, the problem is only faced by queries such as DESC which needs mr jobs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] massdosage commented on pull request #1557: Hive: Fix for missing table schema in map reduce job configurations

Posted by GitBox <gi...@apache.org>.

massdosage commented on pull request #1557:
URL: https://github.com/apache/iceberg/pull/1557#issuecomment-705584192


   > Looks reasonable to me, but will this affect jobs that run multiple scans in a single MR stage?
   > 
   > @massdosage, do we have HiveRunner tests for joins that run a two table scans in a stage?
   
   I think this does it: https://github.com/ExpediaGroup/iceberg/blob/master/mr/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandlerBaseTest.java#L145


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #1557: Hive: Fix for missing table schema in map reduce job configurations

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #1557:
URL: https://github.com/apache/iceberg/pull/1557#issuecomment-705152103


   Looks reasonable to me, but will this affect jobs that run multiple scans in a single MR stage?
   
   @massdosage, do we have HiveRunner tests for joins that run a two table scans in a stage?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #1557: Hive: Fix for missing table schema in map reduce job configurations

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #1557:
URL: https://github.com/apache/iceberg/pull/1557#issuecomment-705708847


   Okay, if we do have a test case that does a simple join, then I think this should be okay. It doesn't sound like we can reproduce the issue with the newer Hive versions, though. So I'll merge this without adding a test for it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #1557: Hive: Fix for missing table schema in map reduce job configurations

Posted by GitBox <gi...@apache.org>.

rdblue merged pull request #1557:
URL: https://github.com/apache/iceberg/pull/1557


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #1557: Hive: Fix for missing table schema in map reduce job configurations

Posted by GitBox <gi...@apache.org>.

rdblue merged pull request #1557:
URL: https://github.com/apache/iceberg/pull/1557


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] massdosage commented on pull request #1557: Hive: Fix for missing table schema in map reduce job configurations

Posted by GitBox <gi...@apache.org>.

massdosage commented on pull request #1557:
URL: https://github.com/apache/iceberg/pull/1557#issuecomment-705584192


   > Looks reasonable to me, but will this affect jobs that run multiple scans in a single MR stage?
   > 
   > @massdosage, do we have HiveRunner tests for joins that run a two table scans in a stage?
   
   I think this does it: https://github.com/ExpediaGroup/iceberg/blob/master/mr/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandlerBaseTest.java#L145


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #1557: Hive: Fix for missing table schema in map reduce job configurations

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #1557:
URL: https://github.com/apache/iceberg/pull/1557#issuecomment-705708847


   Okay, if we do have a test case that does a simple join, then I think this should be okay. It doesn't sound like we can reproduce the issue with the newer Hive versions, though. So I'll merge this without adding a test for it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] massdosage edited a comment on pull request #1557: Hive: Fix for missing table schema in map reduce job configurations

Posted by GitBox <gi...@apache.org>.

massdosage edited a comment on pull request #1557:
URL: https://github.com/apache/iceberg/pull/1557#issuecomment-705584192


   > Looks reasonable to me, but will this affect jobs that run multiple scans in a single MR stage?
   > 
   > @massdosage, do we have HiveRunner tests for joins that run a two table scans in a stage?
   
   I think this does it: https://github.com/ExpediaGroup/iceberg/blob/master/mr/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandlerBaseTest.java#L145 as @pvary mentioned above. I know this caught an issue with multiple tables not being configured for the scan properly in the past but it's possible it doesn't capture all the cases that can occur.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org