You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Yann Byron (Jira)" <ji...@apache.org> on 2022/01/11 14:33:00 UTC

[jira] [Created] (HUDI-3214) [UMBRELLA] optimize auto partition in spark

Yann Byron created HUDI-3214:
--------------------------------

             Summary: [UMBRELLA] optimize auto partition in spark
                 Key: HUDI-3214
                 URL: https://issues.apache.org/jira/browse/HUDI-3214
             Project: Apache Hudi
          Issue Type: Improvement
          Components: Spark Integration, Writer Core
            Reporter: Yann Byron


recently, if partition's value has the format like "pt1=xxxx/pt2=yyyy/pt3=zzzz" which split by slash, Hudi will partition automatically. The directory of this table will have multi partition structure.

I think it's unpredictable. So create this umbrella task to optimize auto partition in order to make the behavior more reasonable.

Also, in hudi 0.8, schama will hold `pt1`, `pt2`, `pt3`, but not in 0.9+.

There are a few of sub tasks:
 * add a flag to control whether enable auto-partition, to make the default behavior reasonable..
 * achieve a new key generator designed specifically for this scenario.
 * solve the bug about the different schema when enable *hoodie.file.index.enable* or not in this case.

 

Test Codes: 
{code:java}
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._

val tableName = "hudi_trips_cow"
val basePath = "file:///tmp/hudi_trips_cow"
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(10))

val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city="))

newDf.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)