You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/08 21:53:46 UTC

[GitHub] [iceberg] HotSushi opened a new pull request #1564: Hive: Don't use catalog to initialize serde on mappers

HotSushi opened a new pull request #1564:
URL: https://github.com/apache/iceberg/pull/1564


   Hive when it spawns Map Reduce jobs, needs to initialize `HiveIcebergSerDe` on mappers. When HiveIcebergSerDe is initialized, [a catalog instance](https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java#L41)  is created and queried for table schema. This doesn't seem like a good behavior, considering that there can be 1000s of mappers, and that can overload the catalog, for example a hive metastore. 
   
   I accidentally came across this behavior while trying to run it on a yarn cluster, the mappers didn't have access to HiveMetaStore classes ([see the error](https://gist.github.com/HotSushi/fe86bdfe576138aa53d1b6cf4b12a24b)). The reason this is not reproducible in HiveRunner unit tests, is because the classpath already contains HiveMetaStore classes.
   
   With this PR, the jobconf will be checked first to see if serialized schema is already present, if so use that instead of querying the catalog. 
   
   Note that we can't completely get rid of the catalog call because `HiveIcebergSerDe` is also initialized during query analysis time when jobconf is not set (see the stacktrace if we get rid of the call [here](https://gist.github.com/HotSushi/33d8c7bdd59e7dbda202c3172a0be186)). 
   
   I didn't write any tests with this PR because i'm not sure how classpath changes can be tested in HiveRunner, but any ideas are welcome. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1564: Hive: Don't use catalog to initialize serde on mappers

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1564:
URL: https://github.com/apache/iceberg/pull/1564


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1564: Hive: Don't use catalog to initialize serde on mappers

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1564:
URL: https://github.com/apache/iceberg/pull/1564#issuecomment-706611790


   I think existing tests should be sufficient to cover this case, since they will definitely load the serde on mappers. Thanks for fixing this, @HotSushi, good catch!
   
   FYI @massdosage, @pvary, and @guilload.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HotSushi commented on pull request #1564: Hive: Don't use catalog to initialize serde on mappers

Posted by GitBox <gi...@apache.org>.
HotSushi commented on pull request #1564:
URL: https://github.com/apache/iceberg/pull/1564#issuecomment-708040409


   H @pvary, properties set in HiveIcebergStorageHandler.configureInputJobProperties() should propagate to the configuration object. Hive does copying of properties [here](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L408)  
   Another way to set would be through HiveIcebergStorageHandler.configureJobConf() . Were the properties still not transferred after this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #1564: Hive: Don't use catalog to initialize serde on mappers

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #1564:
URL: https://github.com/apache/iceberg/pull/1564#issuecomment-708167706


   > H @pvary, properties set in HiveIcebergStorageHandler.configureInputJobProperties() should propagate to the configuration object. Hive does copying of properties [here](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L408)
   > Another way to set would be through HiveIcebergStorageHandler.configureJobConf() . Were the properties still not transferred after this?
   
   @HotSushi: Thanks for the quick answer! In my last patch I try to load the schema from the `conf` and if not successful then from the `serDeProperties`. This means then one of the checks is unnecessary and I can remove it. I was just afraid that this is again some Hive 1.1 difference which we have to handle.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #1564: Hive: Don't use catalog to initialize serde on mappers

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #1564:
URL: https://github.com/apache/iceberg/pull/1564#issuecomment-707074548


   @HotSushi: Quick question: How do we populate the data in the configuration?
   We found the same issue with @rdblue during the reviews of #1495, and the solution there was to use the serDeProperties to carry over the data to the Mappers. With Hive HiveIcebergStorageHandler.configureInputJobProperties() will allow us to automatically add the schema to the properties. I am not sure how we can do this if we want this info travel in the configuration object.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org