You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/09 16:25:19 UTC

[GitHub] [hudi] gunjdesai opened a new issue, #5542: [SUPPORT] Error Querying HUDI through Trino when syncing HMS via Spark

gunjdesai opened a new issue, #5542:
URL: https://github.com/apache/hudi/issues/5542

   **Describe the problem you faced**
   
   I am using HMS to sync my data via Spark and directly querying that data through Trino, but when i trying to run the command
   ```
   SELECT * FROM table_name LIMIT 10
   ```
   I get the following error
   ```
   Query 20220509_155204_00012_irn5c failed: Unable to create input format org.apache.hudi.hadoop.HoodieParquetInputFormat
   ```
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Spark Job for writing to S3 with HMS added
   2. `hudi-hadoop-mr-0.11.0.jar` bundle added in `<trino_install>/plugin/hive-hadoop2`
   3. Run the query in trino
   
   **Expected behavior**
   
   Ideally, the result should display 10 rows of data
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 3.1.1
   
   * Hive version : N/A
   
   * Hadoop version : N/A 
   
   * Storage (HDFS/S3/GCS..) : S3 via Minio
   
   * Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   These are the config options passed to the Spark Structured Streaming Job
   ```
         df.writeStream
               .format(Format.HUDI)
               .option(DataSourceWriteOptions.ASYNC_COMPACT_ENABLE.key(), true)
               .option(HoodieWriteConfig.PRECOMBINE_FIELD_NAME.key(), "updated_at")
               .option(DataSourceWriteOptions.TABLE_TYPE.key(), "COPY_ON_WRITE")
               .option(DataSourceWriteOptions.OPERATION.key(), upsert)
               .option(DataSourceWriteOptions.STREAMING_RETRY_INTERVAL_MS.key(), 2000)
               .option(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key(), "view_id")
               .option(HoodieWriteConfig.TBL_NAME.key(), "question")
               .option(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key(), "created_date")
               .option(Options.CHECKPOINT_LOCATION_KEY, "s3a://warehouse/checkpoints/question")
               .option(Options.ARCHIVE_MIN_COMMITS_KEY, 3)
               .option(Options.HOODIE_METADATA_KEEP_MIN_COMMITS_KEY,  2)
               .option(Options.HOODIE_METADATA_KEEP_MAX_COMMITS_KEY, 4)
               .option(Options.HOODIE_EMBED_TIMELINE_SERVER_KEY, "false")
               .option(HoodieIndexConfig.INDEX_TYPE.key(), "SIMPLE")
               .option(HiveSyncConfig.HIVE_SYNC_MODE.key(), "hms")
               .option(KeyGeneratorOptions.HIVE_STYLE_PARTITIONING_ENABLE.key(), "true")
               .option(HiveSyncConfig.METASTORE_URIS.key(), "thrift://hive-metastore.trino.svc.cluster.local:9083")
               .option(HoodieSyncConfig.META_SYNC_DATABASE_NAME.key(), "warehouse")
               .option(HoodieSyncConfig.META_SYNC_TABLE_NAME.key(), "question")
               .option(HoodieSyncConfig.META_SYNC_PARTITION_FIELDS.key(), "created_date")
               .option(HiveSyncConfig.HIVE_SYNC_ENABLED.key(), "true")
               .outputMode("append")
               .queryName("questions")
               .start("s3a://warehouse/transaction-db/questions")
   ```
   
   After querying the Metastore, this is the output i get after joining TBLS & DBS.
   
   ```
   6797 | org.apache.hudi.hadoop.HoodieParquetInputFormat               | s3a://warehouse/transaction-db/questions                                             |     6797
   ```
   
   **Stacktrace**
   
   ```Query 20220509_155204_00012_irn5c failed: Unable to create input format org.apache.hudi.hadoop.HoodieParquetInputFormat```
   
   
   I have followed the https://hudi.apache.org/docs/syncing_metastore/ doc to setup HMS Sync. Our setup doesn't contain Hive or Hadoop
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] gunjdesai closed issue #5542: [SUPPORT] Error Querying HUDI through Trino when syncing HMS via Spark

Posted by GitBox <gi...@apache.org>.

gunjdesai closed issue #5542: [SUPPORT] Error Querying HUDI through Trino when syncing HMS via Spark
URL: https://github.com/apache/hudi/issues/5542


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] gunjdesai commented on issue #5542: [SUPPORT] Error Querying HUDI through Trino when syncing HMS via Spark

Posted by GitBox <gi...@apache.org>.

gunjdesai commented on issue #5542:
URL: https://github.com/apache/hudi/issues/5542#issuecomment-1121618067

   I am sorry about this one. I got confused between `hudi-hadoop-mr-0.11.0.jar` and `hudi-hadoop-mr-bundle-0.11.0.jar`. Closing this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org