You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/01 02:26:35 UTC

[GitHub] [iceberg] wypoon opened a new pull request #4017: Write engine.hive.enabled=true in table properties for Hive-enabled table

wypoon opened a new pull request #4017:
URL: https://github.com/apache/iceberg/pull/4017


   When creating a table using the `HiveCatalog`, the table will be Hive-enabled if engine.hive.enabled is set to true in the table properties in the `TableMetadata`, or if the Hadoop `Configuration` has iceberg.engine.hive.enabled set to true. Currently, for the latter scenario, we do not write engine.hive.enabled=true in the table properties in the HMS. If  another writer subsequently updates the table, and does not have iceberg.engine.hive.enabled set to true in its Hadoop `Configuration`, then the fact that the table is Hive-enabled will be lost.
   To be consistent, we should write engine.hive.enabled=true in the table properties when committing a Hive-enabled table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wypoon commented on pull request #4017: Write engine.hive.enabled=true in table properties for Hive-enabled table

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #4017:
URL: https://github.com/apache/iceberg/pull/4017#issuecomment-1032165351


   @rdblue I realize now that you view this issue differently than I do. You view engine.hive.enabled as a property of the Hive environment, and see it as related to Iceberg classes being available on the classpath. I view engine.hive.enabled as a property of the table, as I view whether or not a table is Hive-enabled to be a trait of the table rather than of the Hive environment. Thus I do not see having the property in the table as "leaking Hive environment". Thus my proposal to write engine.hive.enabled in the table properties when it is created (not "from some point in time") if it is created a being Hive-enabled.
   From what I observe, as a Spark user, I have been able to create, read and write Hive-enabled Iceberg tables without having the Hive Iceberg StorageHandler, SerDe, InputFormat and OutputFormat classes in my classpath; I only need the Iceberg Spark runtime jar in my classpath. So I don't think of whether a table is Hive-enabled or not as tied to my environment and classpath.
   The way we encountered this issue is in a setting where we have a shared HMS, and where I create an Iceberg table using Spark (with iceberg.engine.hive.enabled=true in my conf, as we want to create Hive-enabled tables for interop with Hive). There is no Hive on this cluster. In a different cluster, we have Hive. iceberg.engine.hive.enabled is not set in the environment of this Hive cluster (so it defaults to false), as there did not seem to be a need for it. @pvary or @mbod can speak to the Hive side, since I'm not knowledgeable about it, but I believe that when you create an Iceberg table using Hive QL, then Hive creates it with engine.hive.enabled=true in the properties. It's just that when Spark created the table, it doesn't write engine.hive.enabled=true in the properties (which is what this change fixes). Hive can read the Iceberg table Spark created (because HMS says it has the Iceberg StorageHandler, SerDe, Input/OutputFormat), but if it updates it, in `HiveTableOperations#
 doCommit`, `hiveEngineEnabled` will be false, so the table is turned into a non-Hive-enabled table (the table will then no longer have the Iceberg StorageHandler, SerDe, Input/OutputFormat), and Hive won't be able to read it after that. We could also have a situation where Spark, in another cluster where iceberg.engine.hive.enabled is not set in the conf), updates the table, and that would also turn a Hive-enabled table into a non-Hive-enabled table if we don't have engine.hive.enabled=true in the table properties.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed pull request #4017: Write engine.hive.enabled=true in table properties for Hive-enabled table

Posted by GitBox <gi...@apache.org>.
rdblue closed pull request #4017:
URL: https://github.com/apache/iceberg/pull/4017


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wypoon commented on pull request #4017: Write engine.hive.enabled=true in table properties for Hive-enabled table

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #4017:
URL: https://github.com/apache/iceberg/pull/4017#issuecomment-1028233193


   @rdblue it is true that if your Hadoop `Configuration` has iceberg.engine.hive.enabled set to true for all your writers, this is not an issue. I'd argue that as a matter of logical consistency, the table property should be set if the table is created as Hive-enabled, so that a future writer (whose configuration we may not be in control of) that updates the table does not drop that Hive-enabledness.
   
   [In `HiveTableOperations#doCommit`, we check if `hiveEngineEnabled` by checking if "engine.hive.enabled" is in the `TableMetadata`'s properties and using its value if so, else use the value of "iceberg.engine.hive.enabled" in the Hadoop `Configuration`, **defaulting to false**. Thus there are two routes to `hiveEngineEnabled` being true, because "engine.hive.enabled" is in the table properties, or because of the Hadoop `Configuration`. What I'm proposing is that we always then encode it in the table properties, so future commits will keep the table being Hive-enabled, regardless of the writer's Hadoop `Configuration`. Otherwise, `hiveEngineEnabled` could be true due to the `Configuration`, and not write the property to `TableMetadata`, and another writer comes along with a different `Configuration`, and `hiveEngineEnabled` is now false from the `TableMetadata`.]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4017: Write engine.hive.enabled=true in table properties for Hive-enabled table

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #4017:
URL: https://github.com/apache/iceberg/pull/4017#issuecomment-1030917342


   I understand the proposal, but I still think that the current behavior is correct. The Hive environment comes from Hive. If Hive is enabled in that environment, then it means that you expect to have the Iceberg classes available everywhere. There should be no need for redundant settings at the table level (which overrides the environment). And in addition, if the Hive environment changes, then so should this behavior. Adding this at the table level leaks Hive environment from some point in time into a table forever.
   
   It would be good to hear more about how you encountered this to recommend a way to fix it. I suspect that your Hive environment was out of sync across the clusters? Did that reflect the state of the clusters as well -- meaning were you also inconsistently including Iceberg in the classpath?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4017: Write engine.hive.enabled=true in table properties for Hive-enabled table

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #4017:
URL: https://github.com/apache/iceberg/pull/4017#issuecomment-1040494238


   I don't think that Hive should be setting `engine.hive.enabled` unless `iceberg.engine.hive.enabled` is false in its environment. We should fall back to the environment settings and avoid setting `engine.hive.enabled` in tables unless it is actually required.
   
   It sounds like we at least understand the issue here now. I think the problem was that your Hive environment was not correctly configured. I'm going to close this issue since I don't think it is the right direction. We can discuss more about Hive if fixing the environment is not possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wypoon commented on pull request #4017: Write engine.hive.enabled=true in table properties for Hive-enabled table

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #4017:
URL: https://github.com/apache/iceberg/pull/4017#issuecomment-1026450761


   @pvary @marton-bod please look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4017: Write engine.hive.enabled=true in table properties for Hive-enabled table

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #4017:
URL: https://github.com/apache/iceberg/pull/4017#issuecomment-1028145839


   I think the current behavior is correct. If you want to use a single table with Hive, then you can set the table property explicitly, but this is not recommended. If you're using Hive and it is enabled in the environment, then there should be no need to also mark the table because all Hive environments should have the property set. I think the problem is when your Hive environments (that is, HiveConf across engines) differs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wypoon edited a comment on pull request #4017: Write engine.hive.enabled=true in table properties for Hive-enabled table

Posted by GitBox <gi...@apache.org>.
wypoon edited a comment on pull request #4017:
URL: https://github.com/apache/iceberg/pull/4017#issuecomment-1028233193


   @rdblue it is true that if your Hadoop `Configuration` has iceberg.engine.hive.enabled set to true for all your writers, this is not an issue. I'd argue that as a matter of logical consistency, the table property should be set if the table is created as Hive-enabled, so that a future writer (whose configuration we may not be in control of) that updates the table does not drop that Hive-enabledness.
   
   [In `HiveTableOperations#doCommit`, we check if `hiveEngineEnabled` by checking if "engine.hive.enabled" is in the `TableMetadata`'s properties and using its value if so, else use the value of "iceberg.engine.hive.enabled" in the Hadoop `Configuration`, **defaulting to false**. Thus there are two routes to `hiveEngineEnabled` being true, because "engine.hive.enabled" is in the table properties, or because of the Hadoop `Configuration`. What I'm proposing is that we always then encode it in the table properties, so future commits will keep the table being Hive-enabled, regardless of the writer's Hadoop `Configuration`. Otherwise, `hiveEngineEnabled` could be true due to the `Configuration`, and not write the property to `TableMetadata`, and another writer comes along with a different `Configuration`, and `hiveEngineEnabled` is now false from the `TableMetadata`.]
   
   We encountered the issue in a setting with a shared HMS serving multiple clusters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4017: Write engine.hive.enabled=true in table properties for Hive-enabled table

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #4017:
URL: https://github.com/apache/iceberg/pull/4017#issuecomment-1040494238


   I don't think that Hive should be setting `engine.hive.enabled` unless `iceberg.engine.hive.enabled` is false in its environment. We should fall back to the environment settings and avoid setting `engine.hive.enabled` in tables unless it is actually required.
   
   It sounds like we at least understand the issue here now. I think the problem was that your Hive environment was not correctly configured. I'm going to close this issue since I don't think it is the right direction. We can discuss more about Hive if fixing the environment is not possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed pull request #4017: Write engine.hive.enabled=true in table properties for Hive-enabled table

Posted by GitBox <gi...@apache.org>.
rdblue closed pull request #4017:
URL: https://github.com/apache/iceberg/pull/4017


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org