You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/18 17:24:12 UTC

[GitHub] [iceberg] kbendick opened a new issue #2607: Allow pass through of catalog options to the Hadoop configuration

kbendick opened a new issue #2607:
URL: https://github.com/apache/iceberg/issues/2607


   We have users who are trying to use multiple catalogs, but because Spark treats the session catalog differently (and we pull the Hadoop configuration off of the spark context), users cannot pass things like different kerberos principals (even though we do allow for some catalog overrides that users can 
   
   If users wanted to access data from one HDFS cluster to another that had a different authority, they would not be able to do so because we pull the underlying Hadoop configuration from the spark session.
   
   Some example configurations that users cannot currently override include:
   
   ```
   "spark.hadoop.hive.metastore.connect.retries"
   "spark.hadoop.hive.metastore.kerberos.principal"
   "spark.hadoop.hive.metastore.sasl.enabled"
   ```
   
   This might be a spark specific issue (as the problem is pulling the Hadoop configuration from the spark session), outside of the currently allowed overrides like numClientConnections and uris.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-844454452


   When looking at this, I too was thinking that it would be anything from `hadoop.*`, as Russell mentioned. This way it's in line with the SparkConf.
   
   So `iceberg.catalog.other_catalog.hadoop.hive.metastore.connect.retries` -> parsed out per catalog and mapped to `hive.metastore.connect.retries` in Spark.
   
   I'm much less familiar with the Hive codebase, but I imagine the same issue would also come up?
   
   For reference, the places in the code where this issue comes up that I can see (at least for Spark) are here for the core Catalog code https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L179-L181 as well as here for the SparkCatalog (where we just pull the active sessions hadoop configuration): https://github.com/apache/iceberg/blob/98011e162b1837bf9153cfe14dfd3277c4fd3d1e/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L96-L101


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
pvary commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-864914887


   @rdblue: The specific examples above are for HiveConf configuration values and they have nothing to do with `hadoop` from the user point of view. Also there are plenty of other configuration values which might be good to set for a catalogs, so I think it would be good to find a general solution. Also we try to keep the same syntax for spark/hive/impala so the users could move seamlessly between the systems.
   
   I think the best solution would be to have a general `override.*` prefix for all of the catalog configurations.
   
   What do you think?
   
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-843380880


   cc @RussellSpitzer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
pvary commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-844138938


   Seems reasonable to me, and I do not know about other possibilities ATM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-844273858


   Since this is spark specific, I think i would just want it to be like the SparkConf and just have anything set as hadoop.* propagated (without hadoop)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-845429034


   > > I'm much less familiar with the Hive codebase, but I imagine the same issue would also come up?
   > 
   > Yes this will be definitely a problem for Hive as well.
   > 
   > > When looking at this, I too was thinking that it would be anything from `hadoop.*`, as Russell mentioned. This way it's in line with the SparkConf.
   > 
   > That would look quite awkward for a Hive user. It would be strange that a config starting with hadoop will be put to the HiveConf.
   
   Ok. I somewhat pictured that. Given that this will be a problem for several people, another idea that I've had floating around in my head (that might not be a good one so feel free to say so) would be to have possibly multiple hive-site.xml / whatever hive conf files and have THOSE be pointed to as table properties?
   
   Admittedly even just typing it out, I don't love it. But I'd be curious to hear what you think (and why it's as bad of an idea as I suspect it is).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-844097349


   That's why I was thinking at the moment unless we have another way of doing so or there is another solution I haven't imagined


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer closed issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
RussellSpitzer closed issue #2607:
URL: https://github.com/apache/iceberg/issues/2607


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-851142261


   > In hive you very rarely want to touch the local FS or even the remote FS directly. Not that it is not possible, but again something that would be strange.
   
   Thanks for helping me understand better. I've never had much practical experience with Hive (other than for small scale stuff or for some simple stuff where it was easier for whatever reason, e.g. queries against small datasets in the Airflow UI).
   
   > Currently I still would prefer the `...catalog_in_different_authority.config_to_push.hive...` solution, but I also admit it is not perfect (adding secrets to the public config and making sure everything is hidden correctly seems like a wish for disaster)
   
   Yeah. I can see your point with respect to the secrets in the public config. That does seem like a headache.
   
   I also don't necessarily have an idea for a "great" universal solution.
   
   As we have users who do need this, I think I'm going to give it an initial attempt using the idea that @RussellSpitzer and I had (for just the Spark catalogue at least), to at least see what the pain points are in general and then maybe something better will come to me / us when we've taken a look at the details of it in practice. 👍 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
pvary commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-845819766


   > Ok. I somewhat pictured that. Given that this will be a problem for several people, another idea that I've had floating around in my head (that might not be a good one so feel free to say so) would be to have possibly multiple hive-site.xml / whatever hive conf files and have THOSE be pointed to as table properties?
   
   In hive you very rarely want to touch the local FS or even the remote FS directly. Not that it is not possible, but again something that would be strange.
   
   Currently I still would prefer the `...catalog_in_different_authority.config_to_push.hive...` solution, but I also admit it is not perfect (adding secrets to the public config and making sure everything is hidden correctly seems like a wish for disaster)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-843466537


   @marton-bod or @pvary have you run into this? My thought was we could just propagate Catalog Options into the Hadoop configuration. We currently just move through the URIS and ClientConnections


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
pvary commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-864914887


   @rdblue: The specific examples above are for HiveConf configuration values and they have nothing to do with `hadoop` from the user point of view. Also there are plenty of other configuration values which might be good to set for a catalogs, so I think it would be good to find a general solution. Also we try to keep the same syntax for spark/hive/impala so the users could move seamlessly between the systems.
   
   I think the best solution would be to have a general `override.*` prefix for all of the catalog configurations.
   
   What do you think?
   
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer closed issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
RussellSpitzer closed issue #2607:
URL: https://github.com/apache/iceberg/issues/2607


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
pvary commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-844268977


   I personally prefer to avoid concatenation because this would lead to parsing problems down the line which I would like to avoid.
   
   Another option could the to add a key which contains the list of keys we want to push. Parsing a list of keys is more straightforward, but then it is more complicated to configure...
   ```
   iceberg.catalog.other_catalog.configs_to_push=hive.metastore.sasl.enabled,hive.metastore.kerberos.principal,hive.metastore.sasl.enabled
   hive.metastore.connect.retries=3
   hive.metastore.kerberos.principal=aaa
   hive.metastore.sasl.enabled=true
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
marton-bod commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-844224732


   I like this. Maybe a simplification could be to collect all the key-value pairs we want to push under a single key? That way we wouldn't need to iterate through the whole conf object to find these.
   
   Based on the previous example:
   `iceberg.catalog.other_catalog.configs_to_push=hive.metastore.sasl.enabled=true,hive.metastore.kerberos.principal=aaa,hive.metastore.sasl.enabled=true`
   
   Then if we know our catalog name is `other_catalog`, all we'd need to do is look for this special set of configs under `iceberg.catalog.other_catalog.configs_to_push` (if any) and nothing else.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-860124980


   The idea of using `hadoop.*` properties from the catalog config to set Hadoop conf properties makes sense to me. Sounds like the argument against this is that it would be weird in Hive, but isn't this proposed for Spark where it has a precedent? We could find some other prefix for Hive, like `override.*` maybe? Or just live with `hadoop.*`. I don't think it would be that odd.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
pvary commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-844777551


   > I'm much less familiar with the Hive codebase, but I imagine the same issue would also come up?
   
   Yes this will be definitely a problem for Hive as well.
   
   > When looking at this, I too was thinking that it would be anything from `hadoop.*`, as Russell mentioned. This way it's in line with the SparkConf.
   
   That would look quite awkward for a Hive user. It would be strange that a config starting with hadoop will be put to the HiveConf.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer closed issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
RussellSpitzer closed issue #2607:
URL: https://github.com/apache/iceberg/issues/2607


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-844097349


   That's why I was thinking at the moment unless we have another way of doing so or there is another solution I haven't imagine 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-851142970


   When I do that, I'll also try to check out the feasibility of the approaches mentioned by both of you, @pvary @marton-bod 👍.
   
   Thanks for your input thus far.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on issue #2607: Allow pass through of catalog options to the Hadoop configuration

Posted by GitBox <gi...@apache.org>.
pvary commented on issue #2607:
URL: https://github.com/apache/iceberg/issues/2607#issuecomment-843847799


   @RussellSpitzer: Do I understand correctly that the proposed solution would be to copy the catalog specific configurations to the HiveConf when trying to access the underlying Iceberg table?
   
   So for example:
   ```
   iceberg.catalog.catalog_in_different_authority.config_to_push.hive.metastore.connect.retries=3
   iceberg.catalog.catalog_in_different_authority.config_to_push.hive.metastore.kerberos.principal=aaa
   iceberg.catalog.catalog_in_different_authority.config_to_push.hive.metastore.sasl.enabled=true
   ```
   
   Should be pushed to the HiveConf used to connect to metastore, like:
   ```
   hive.metastore.connect.retries=3
   hive.metastore.kerberos.principal=aaa
   hive.metastore.sasl.enabled=true
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org