You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Jonathan Vexler (Jira)" <ji...@apache.org> on 2023/03/16 14:03:00 UTC

[jira] [Closed] (HUDI-5871) Bootstrap does not work with partitions with /

     [ https://issues.apache.org/jira/browse/HUDI-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Vexler closed HUDI-5871.
---------------------------------
    Resolution: Not A Problem

When you created a partitioned table in spark, it removes the partition column from the data files. I thought that the directory names were encoded with %2F instead of / but the columns still had /. Since this is not the case, Hudi is correctly handling the data it is given. In my second example where I changed the directory structure, I removed the hive style partitioning. That is why that example is failing.

> Bootstrap does not work with partitions with /
> ----------------------------------------------
>
>                 Key: HUDI-5871
>                 URL: https://issues.apache.org/jira/browse/HUDI-5871
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: bootstrap, spark
>            Reporter: Jonathan Vexler
>            Priority: Major
>         Attachments: scala_output_bootstrap1.txt
>
>
> I have parquet data that I load into a dataframe and then save to a datatable by doing 
>  
> {code:java}
> df.write.partitionBy("partition").parquet(tablePath) {code}
> In the table, each partition is a directory labeled like partition=2022%2F1%2F25
>  
> I then do a bootstrap by doing
>  
> {code:scala}
> import org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
> import org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector
> import org.apache.hudi.{DataSourceWriteOptions, HoodieDataSourceHelpers}
> import org.apache.hudi.config.{HoodieBootstrapConfig, HoodieWriteConfig}
> import org.apache.hudi.keygen.SimpleKeyGenerator
> import org.apache.spark.sql.SaveModeimport org.apache.spark.sql.types._
> val srcPath = "/Users/jon/Documents/bootstrap_testing/partitioned-parquet-table-fixed"
> val basePath = "/Users/jon/Documents/bootstrap_testing/tables/test8"
> val bootstrapDF = spark.emptyDataFramebootstrapDF.write
>     .format("hudi")      
> .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)      .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "key")      .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partition")      .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts")      .option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, srcPath)      .option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, classOf[SimpleKeyGenerator].getName)      .option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR, classOf[BootstrapRegexModeSelector].getName)      .option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR_REGEX, "2022/1/2[4-8]")      .option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR_REGEX_MODE, "METADATA_ONLY")      .option(HoodieBootstrapConfig.FULL_BOOTSTRAP_INPUT_PROVIDER, classOf[SparkParquetBootstrapDataProvider].getName) 
> .mode(SaveMode.Overwrite)
> .save(basePath)
> {code}
> that does not create any metadata_only because the regex is selecting on directory name, not partition_path, this should be clarified in the configs. I then change the regex to
> {code:java}
> partition=2022%2F1%2F2[4-8] {code}
> This properly works, but there is an isssue,
> Inside the hudi table, the directories are 
> {code:java}
> 2022			partition=2022%2F1%2F24	partition=2022%2F1%2F25	partition=2022%2F1%2F26	partition=2022%2F1%2F27	partition=2022%2F1%2F28 {code}
> The 2022 contains the FULL_BOOTSTRAP partitions but the METADATA_ONLY partitions are in those other directory. 
> Maybe that is ok so I try to read from the hudi table. This file contains the output from my attempt: [^scala_output_bootstrap1.txt] 
> I go back to my parquet table and make a copy and move the partitions into the hudi structure where 
> 2022->1->24
> 2022->1->25
> ...
> 2022-1->31
> 2022->2->1
> ....
> is the directory structure. I change the regex back to how it was originally and run the bootstrap again. This time, the hudi directory contains 2022 which has the partitions that are METADATA_ONLY, but there is another directory __HIVE_DEFAULT_PARTITION that contains the FULL_BOOTSTRAP files. 
> When I attempt to read from the hudi table I get 
> {code:java}
> scala> spark.read.format("hudi").load(basePath).createOrReplaceTempView("test_table")
> scala> spark.sql("select * from test_table where _hoodie_partition_path=2022/1/29").count
> 23/03/02 15:11:42 WARN HFileBootstrapIndex: No value found for partition key (__HIVE_DEFAULT_PARTITION__)
> 23/03/02 15:11:42 WARN HFileBootstrapIndex: No value found for partition key (__HIVE_DEFAULT_PARTITION__)
> res16: Long = 0
> scala> spark.sql("select * from test_table where _hoodie_partition_path=2022/1/24").count
> 23/03/02 15:11:51 WARN HFileBootstrapIndex: No value found for partition key (__HIVE_DEFAULT_PARTITION__)
> res17: Long = 0 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)