You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/05 14:35:39 UTC

[GitHub] [hudi] joao-miranda opened a new issue, #6601: [SUPPORT] "default" folder no outputted by Hudi for non-partitioned tables when used with Spark

joao-miranda opened a new issue, #6601:
URL: https://github.com/apache/hudi/issues/6601

   **Describe the problem you faced**
   We are using Hudi in a Scala Glue Job. We need then crawl the data generated and get a table in the Glue Data Catalog. We need this for both partitioned and non-partitioned data.
   
   For partitioned data the output is as follows:
   ```
   .../database_name/table_name/partition=partition_key/<data files>
   ```
   
   After crawling we get a table in the data catalog with the correct name (table_name). That's the desired behavior.
   
   For non-partitioned data the output is as follows:
   ```
   .../database_name/table_name/<data files>
   ```
   
   The crawler then generates a table per each file. This is not what we want.
   
   We know the format we need for the crawler to work correctly: a default folder needs to exist before the data files:
   ```
   .../database_name/table_name/default/<data files>
   ```
   
   with the following structure for Hudi support files:
   ```
   .../database_name/table_name/.hoodie
   .../database_name/table_name/default/.hoodie_partition_metadata
   ```
   
   This was seemingly the behavior up to Hudi 0.9.0, but no longer reproduced from 0.10.0 onwards.
   
   Is there any configuration we could possibly be missing?
   
   
   **Steps to reproduce the behavior**
   **Dependencies:**
   ```
   "org.apache.hudi" %% "hudi-spark-bundle" % "2.12-0.10.0"
   "org.apache.hudi" %% "hudi-utilities-bundle" % "2.12-0.10.0"
   ```
   
   **Configuration used:**
   ```
   var hudiOptions = scala.collection.mutable.Map[String, String](
         HoodieWriteConfig.TABLE_NAME -> "hudiTableName",
         HoodieWriteConfig.COMBINE_BEFORE_INSERT.key() -> "true",
         DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
         DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
         DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "primaryKeyField",
         DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY ->  "ts",
         DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY -> classOf[AWSDmsAvroPayload].getName,
         DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> classOf[CustomKeyGenerator].getName,
         DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, ""
       )
   ```
   
   **Following options are added if a partition key is defined:**
   ```
         hudiOptions.put(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partitionKeyField")
         hudiOptions.put(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
         hudiOptions.put(HoodieIndexConfig.INDEX_TYPE.key(), "GLOBAL_BLOOM")
         hudiOptions.put(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key(), "true")
         hudiOptions.put(DataSourceWriteOptions.DROP_PARTITION_COLUMNS.key(), "true")
   ```
   
   **Saved into a file:**
   ```
       // Write the DataFrame as a Hudi dataset
       mappedDF
         .dropDuplicates()
         .write
         .format("org.apache.hudi")
         .options(hudiOptions)
         .mode(SaveMode.Append)
         .save("targetDirectory")
   ```
   
   **Expected behavior**
   Output of Hudi is compatible with AWS Glue Crawler, with or without partitions.
   
   **Environment Description**
   
   - Hudi version : 0.10.0
   - Spark version : 3.1.1
   - Scala version: 2.12.15
   - AWS Glue version : 3.0.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6601: [SUPPORT] "default" folder not outputted by Hudi for non-partitioned tables when used with Spark

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #6601:
URL: https://github.com/apache/hudi/issues/6601#issuecomment-1239804231

   hey I am bit confused. you claim that you are interested in non partitioned tables. but I see you are using CustomKeyGenerator. I would expect you to use NonPartitionedKeyGenerator in that case. 
   https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator
   
   I tried our quick start example for both 0.9.0 and 0.11. both are showing similar behavior. I don't see any default folder under table base path. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] joao-miranda commented on issue #6601: [SUPPORT] "default" folder not outputted by Hudi for non-partitioned tables when used with Spark

Posted by GitBox <gi...@apache.org>.

joao-miranda commented on issue #6601:
URL: https://github.com/apache/hudi/issues/6601#issuecomment-1240849925

   Thank you for replying.
   
   You are correct that we should be using a different Generator. However that ended up not being necessary since our solution was just to add an empty column and partition non-partitioned tables by that new column.
   
   With that, we can use AWS Glue Crawler with both non-partitioned and partitioned tables.
   
   The only suggestion I'd give you to avoid this work around would be to add an option so that when there is no partition key a fallback folder is still added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] joao-miranda closed issue #6601: [SUPPORT] "default" folder not outputted by Hudi for non-partitioned tables when used with Spark

Posted by GitBox <gi...@apache.org>.

joao-miranda closed issue #6601: [SUPPORT] "default" folder not outputted by Hudi for non-partitioned tables when used with Spark
URL: https://github.com/apache/hudi/issues/6601


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6601: [SUPPORT] "default" folder not outputted by Hudi for non-partitioned tables when used with Spark

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #6601:
URL: https://github.com/apache/hudi/issues/6601#issuecomment-1239806342

   Using custom key gen, but setting empty value for partition path is a wrong usage may be. Can you try fixing it and give it a try. 
   Reason why hudi could write to default folder is: 
   lets say you configure partition path file as col5. for those records where col5 value is null, hudi writes to "default" folder. btw, we are changing that with 0.12. our default fallback folder is `__HIVE_DEFAULT_PARTITION__`
   More details here https://hudi.apache.org/releases/release-0.12.0#fallback-partition
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org