You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Yann Byron (Jira)" <ji...@apache.org> on 2022/01/11 14:33:00 UTC
[jira] [Created] (HUDI-3214) [UMBRELLA] optimize auto partition in spark
Yann Byron created HUDI-3214:
--------------------------------
Summary: [UMBRELLA] optimize auto partition in spark
Key: HUDI-3214
URL: https://issues.apache.org/jira/browse/HUDI-3214
Project: Apache Hudi
Issue Type: Improvement
Components: Spark Integration, Writer Core
Reporter: Yann Byron
recently, if partition's value has the format like "pt1=xxxx/pt2=yyyy/pt3=zzzz" which split by slash, Hudi will partition automatically. The directory of this table will have multi partition structure.
I think it's unpredictable. So create this umbrella task to optimize auto partition in order to make the behavior more reasonable.
Also, in hudi 0.8, schama will hold `pt1`, `pt2`, `pt3`, but not in 0.9+.
There are a few of sub tasks:
* add a flag to control whether enable auto-partition, to make the default behavior reasonable..
* achieve a new key generator designed specifically for this scenario.
* solve the bug about the different schema when enable *hoodie.file.index.enable* or not in this case.
Test Codes:
{code:java}
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_trips_cow"
val basePath = "file:///tmp/hudi_trips_cow"
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city="))
newDf.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath) {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)