You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/09/11 16:39:57 UTC

[GitHub] [hudi] rajgowtham24 opened a new issue #2086: [SUPPORT] Hive Sync Not Working through run_sync_tool.sh

rajgowtham24 opened a new issue #2086:
URL: https://github.com/apache/hudi/issues/2086


   Hi, 
   
   I am trying to achieve two identical copy of the Hudi Tables in different buckets. 
   
   For the above scenario, I have copied the contents of the hudi table data files into another bucket and trying to run the hive sync through run_sync_tool.sh and i'm getting the below error. 
   
   Environment Details
   emr-6.0.0
   Hudi Version - 0.5.0
   
   Could anyone please take a look and suggest an alternate approach as well. Thanks!
   
   2020-09-11 12:57:52,736 ERROR [main] metadata.HiveUtils (HiveUtils.java:createMetaStoreClientFactory(498)) - Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found
   java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found
           at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2541)
           at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClientFactory(HiveUtils.java:491)
           at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClient(HiveUtils.java:480)
           at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:4371)
           at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:4351)
           at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:4607)
           at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:287)
           at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:270)
           at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:443)
           at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:371)
           at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:351)
           at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:327)
           at org.apache.hudi.hive.HoodieHiveClient.<init>(HoodieHiveClient.java:111)
           at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:60)
           at org.apache.hudi.hive.HiveSyncTool.main(HiveSyncTool.java:197)
   2020-09-11 12:57:52,756 WARN  [main] metadata.Hive (Hive.java:registerAllFunctionsOnce(273)) - Failed to register all functions.
   org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found)
           at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:4610)
           at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:287)
           at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:270)
           at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:443)
           at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:371)
           at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:351)
           at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:327)
           at org.apache.hudi.hive.HoodieHiveClient.<init>(HoodieHiveClient.java:111)
           at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:60)
           at org.apache.hudi.hive.HiveSyncTool.main(HiveSyncTool.java:197)
   Caused by: MetaException(message:Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found)
           at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClientFactory(HiveUtils.java:499)
           at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClient(HiveUtils.java:480)
           at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:4371)
           at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:4351)
           at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:4607)
           ... 9 more
   Exception in thread "main" org.apache.hudi.hive.HoodieHiveSyncException: Failed to create HiveMetaStoreClient
           at org.apache.hudi.hive.HoodieHiveClient.<init>(HoodieHiveClient.java:113)
           at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:60)
           at org.apache.hudi.hive.HiveSyncTool.main(HiveSyncTool.java:197)
   Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found)
           at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:275)
           at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:443)
           at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:371)
           at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:351)
           at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:327)
           at org.apache.hudi.hive.HoodieHiveClient.<init>(HoodieHiveClient.java:111)
           ... 2 more
   Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found)
           at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:4610)
           at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:287)
           at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:270)
           ... 7 more
   Caused by: MetaException(message:Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found)
           at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClientFactory(HiveUtils.java:499)
           at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClient(HiveUtils.java:480)
           at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:4371)
           at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:4351)
           at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:4607)
           ... 9 more


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rajgowtham24 commented on issue #2086: [SUPPORT] Hive Sync Not Working through run_sync_tool.sh

Posted by GitBox <gi...@apache.org>.
rajgowtham24 commented on issue #2086:
URL: https://github.com/apache/hudi/issues/2086#issuecomment-693315044


   @umehrot2 Currently we are reading real-time files from inbound bucket and writing into a landing bucket.
   Since the data would be consumed by multiple reporting tools, we thought of having two different buckets for write(using hudi) and read(using reporting tools).
   
   For the above approach instead of writing twice into each bucket. we thought of moving the underlying hudi table data files into reporting-bucket and run the hive_sync on top of the reporting-bucket hudi table data files. After the hive sync on reporting bucket, if both the table data are in sync, we though of using this approach for our Ingestion Pattern


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rajgowtham24 commented on issue #2086: [SUPPORT] Hive Sync Not Working through run_sync_tool.sh

Posted by GitBox <gi...@apache.org>.
rajgowtham24 commented on issue #2086:
URL: https://github.com/apache/hudi/issues/2086#issuecomment-692747232


   Thanks @bvaradar. AWS Support have suggested to add the below HIVE_METASTORE in run_sync_tool.sh file and now i'm facing a different exception. Working with AWS Support team. Will keep this thread posted with the update.
   
   # added for AWS Glue Catalog hive metastore libraries.
   HIVE_METASTORE=/usr/share/aws/aws-java-sdk/aws-java-sdk-glue-1.11.682.jar:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-hive2-client-1.11.0.jar:$HIVE_METASTORE
   
   On top of your head, do you think of any alternate approach for writing hudi tables into multiple buckets? TIA!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 closed issue #2086: [SUPPORT] Hive Sync Not Working through run_sync_tool.sh

Posted by GitBox <gi...@apache.org>.
umehrot2 closed issue #2086:
URL: https://github.com/apache/hudi/issues/2086


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rajgowtham24 edited a comment on issue #2086: [SUPPORT] Hive Sync Not Working through run_sync_tool.sh

Posted by GitBox <gi...@apache.org>.
rajgowtham24 edited a comment on issue #2086:
URL: https://github.com/apache/hudi/issues/2086#issuecomment-693315044


   @umehrot2 Currently we are reading near real-time files from inbound bucket and writing into a landing bucket.
   Since the data would be consumed by multiple reporting tools, we thought of having two different buckets for write(using hudi) and read(using reporting tools).
   
   For the above approach instead of writing twice into each bucket. we thought of moving the underlying hudi table data files into reporting-bucket and run the hive_sync on top of the reporting-bucket hudi table data files. After the hive sync on reporting bucket, if both the table data are in sync, we though of using this approach for our Ingestion Pattern


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on issue #2086: [SUPPORT] Hive Sync Not Working through run_sync_tool.sh

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on issue #2086:
URL: https://github.com/apache/hudi/issues/2086#issuecomment-700345425


   @rajgowtham24 I believe that should work, as long as you are keeping your `reporting` bucket in sync with the `hudi write` bucket, by copying/replacing all the data files in `reporting` bucket on each ingestion.
   
   I will close this issue now. Please re-open if needed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rajgowtham24 edited a comment on issue #2086: [SUPPORT] Hive Sync Not Working through run_sync_tool.sh

Posted by GitBox <gi...@apache.org>.
rajgowtham24 edited a comment on issue #2086:
URL: https://github.com/apache/hudi/issues/2086#issuecomment-692747232


   Thanks @bvaradar. AWS Support have suggested to add the below HIVE_METASTORE in run_sync_tool.sh file and now i'm facing a different exception. Working with AWS Support team. Will keep this thread posted with the update.
   
   added for AWS Glue Catalog hive metastore libraries.
   HIVE_METASTORE=/usr/share/aws/aws-java-sdk/aws-java-sdk-glue-1.11.682.jar:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-hive2-client-1.11.0.jar:$HIVE_METASTORE
   
   On top of your head, do you think of any alternate approach for writing hudi tables into multiple buckets? TIA!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2086: [SUPPORT] Hive Sync Not Working through run_sync_tool.sh

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2086:
URL: https://github.com/apache/hudi/issues/2086#issuecomment-692485806


   @rajgowtham24 : Please take a look at https://github.com/apache/hudi/issues/1977#issuecomment-678558692 (cc @umehrot2 )


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on issue #2086: [SUPPORT] Hive Sync Not Working through run_sync_tool.sh

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on issue #2086:
URL: https://github.com/apache/hudi/issues/2086#issuecomment-693140427


   @rajgowtham24 yes based on the exception the issue is happening because the glue sdk is not present in hive-sync-tool classpath. The solution AWS Support gave is correct.
   
   As for writing hudi tables to multiple buckets, can you explain the use-case more ? For the initial write this seems the way to approach, but if you have subsequent updates how would you be maintaining these two tables in sync ? You would likely have to update both the tables.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org