You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/06/18 07:52:04 UTC

[GitHub] [hudi] masterlemmi opened a new issue #1747: [SUPPORT] HiveSynctool syncs wrong location

masterlemmi opened a new issue #1747:
URL: https://github.com/apache/hudi/issues/1747


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   YES
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   When inserting a record using 
   keygenerator.class         -> ComplexKeyGenerator
   partition_extractor_class ->  NonPartitionedExtractor 
   
   Hudi places the parquet files in a default folder under base path.
   But HiveSync sets Location to base path only so it is unable to query the table properly and returns zero results.
   
   Workaround is to alter the created table and set location to basepath/default
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   see above 
   
   **Expected behavior**
   set location properly
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : .0.5.2-incubating
   
   * Spark version : 2.4.5
   
   * Hive version : 2.3.3
   
   * Hadoop version : 2.8
   
   * Storage (HDFS/S3/GCS..) : hdfs
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   FROM LOGS
   `20/06/18 00:30:33 INFO HoodieHiveClient: Creating table with CREATE EXTERNAL TABLE  IF NOT EXISTS `default`.`app`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `invoiceID` bigint, `vendorID` bigint, `invoiceNum` bigint, `createdAt` string, `invoiceCurrencyCode` bigint) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://slc14xgh.us.oracle.com:9000/hudi/test/tables/app'`
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on issue #1747: [SUPPORT] HiveSynctool syncs wrong location

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1747:
URL: https://github.com/apache/hudi/issues/1747#issuecomment-647141194


   @masterlemmi  Will take a look today.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1747: [SUPPORT] HiveSynctool syncs wrong location

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1747:
URL: https://github.com/apache/hudi/issues/1747#issuecomment-647072617


   @bhasudha  can you please take a pass


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on issue #1747: [SUPPORT] HiveSynctool syncs wrong location

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1747:
URL: https://github.com/apache/hudi/issues/1747#issuecomment-647191850


   @masterlemmi For generating non partitioned Hudi dataset you need both NonpartitionedKeyGenerator and NonPartitionedExtractor like mentioned here - https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoIuseDeltaStreamerorSparkDataSourceAPItowritetoaNon-partitionedHudidataset?
   I think yours is a new use case where you need complex key type but non partitioned Hudi dataset. And complex key type is using a `default` partition path today. In that case extending the ComplexKeyGenerator like how you suggested would work.
   
   @bvaradar  @vinothchandar  I think this is a valid use case. And we should figure out a way to provide Non partitioned Hudi dataset for any key type ? Any thoughts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on issue #1747: [SUPPORT] HiveSynctool syncs wrong location

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1747:
URL: https://github.com/apache/hudi/issues/1747#issuecomment-647190008


   @masterlemmi  can you paste the configs passed on when doing insert ? I am wondering if anything was set for `hoodie.datasource.write.partitionpath.field`config


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha closed issue #1747: [SUPPORT] HiveSynctool syncs wrong location

Posted by GitBox <gi...@apache.org>.

bhasudha closed issue #1747:
URL: https://github.com/apache/hudi/issues/1747


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on issue #1747: [SUPPORT] HiveSynctool syncs wrong location

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1747:
URL: https://github.com/apache/hudi/issues/1747#issuecomment-649509707


   Agreed @bvaradar .
   
   @masterlemmi  closing this task in favor of the Jira issue created here - https://issues.apache.org/jira/browse/HUDI-1053. Lets continue there. Feel free to add to that issue. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] masterlemmi commented on issue #1747: [SUPPORT] HiveSynctool syncs wrong location

Posted by GitBox <gi...@apache.org>.

masterlemmi commented on issue #1747:
URL: https://github.com/apache/hudi/issues/1747#issuecomment-646403336


   looks like extending the ComplexKeyGenerator and specifying EMPTY STRING as partition path is a solution.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1747: [SUPPORT] HiveSynctool syncs wrong location

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1747:
URL: https://github.com/apache/hudi/issues/1747#issuecomment-647351218


   @bhasudha : Yes, agree we should make it explicit for users to configure this setting. I am wondering if we can have a config to let users explicitly tell if the dataset is partitioned or not when using complex keys as opposed to having a separate implementation. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org