You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/05/03 19:05:31 UTC

[GitHub] [hudi] codejoyan commented on issue #2852: [SUPPORT] Read Hudi Table from Hive - Hive Sync clarification

codejoyan commented on issue #2852:
URL: https://github.com/apache/hudi/issues/2852#issuecomment-831468728


   @n3nash @bvaradar Please let me know if I will miss out any features if I do this?
   
   Step 1: Save the dataframe as Hudi table without the hive_sync options.
   Step 2: Use beeline and use add jars
   Step 3: Create an external hive table pointing to the hudi table path with the hoodie metadata columns as below
   
   ```
   scala> transformedDF.write.format("org.apache.hudi").
        | options(getQuickstartWriteConfigs).
        | option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "col_9").
        | option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "col_2,col_1,col_3").
        | option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        | option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.ComplexKeyGenerator").
        | option("hoodie.upsert.shuffle.parallelism","2").
        | option("hoodie.insert.shuffle.parallelism","2").
        | option(HoodieWriteConfig.TABLE_NAME, "TestTableHudiHive").
        | mode(SaveMode.Append).
        | save(targetPath)
   
   beeline -u "jdbc:hive2://hiveserver_host:10001/default;principal=hive/_HOST@ABC.COM;transportMode=http;httpPath=cliservice"
   
   add jar hdfs://xxxxxx/user/joyan/hudi-hadoop-mr-bundle-0.7.0.jar
   
   CREATE EXTERNAL TABLE IF NOT EXISTS stg_wmt_us_fin_us_wm_fin_sales_dl_secure.TestTableHudiHive (
   `_hoodie_commit_time` string,
   `_hoodie_commit_seqno` string,
   `_hoodie_record_key` string,
   `_hoodie_partition_path` string,
   `_hoodie_file_name` string,
   col_1 string,
   col_2 int,
   col_3 int,
   col_4 string,
   col_5 string,
   col_6 int,
   col_7 bigint,
   col_8 string,
   col_9 bigint,
   col_10 string,
   cntry_cd string,
   bus_dt DATE )
   PARTITIONED BY (partitionpath string)
   ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
   STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION 'gs://xxxxxxxxxxxxxx/test_table_tgt_04142021_1';
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org