You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2021/10/08 00:36:00 UTC
[jira] [Updated] (HUDI-2390) KeyGenerator discrepancy between
DataFrame writer and SQL
[ https://issues.apache.org/jira/browse/HUDI-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-2390:
---------------------------------
Labels: sev:critical user-support-issues (was: sev:critical)
> KeyGenerator discrepancy between DataFrame writer and SQL
> ---------------------------------------------------------
>
> Key: HUDI-2390
> URL: https://issues.apache.org/jira/browse/HUDI-2390
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: Spark Integration
> Affects Versions: 0.9.0
> Reporter: renhao
> Assignee: Yann Byron
> Priority: Critical
> Labels: sev:critical, user-support-issues
>
> Test Case:
> {code:java}
> import org.apache.hudi.QuickstartUtils._
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._{code}
> 1.准备数据
>
> {code:java}
> spark.sql("create table test1(a int,b string,c string) using hudi partitioned by(b) options(primaryKey='a')")
> spark.sql("insert into table test1 select 1,2,3")
> {code}
>
> 2.创建hudi table test2
> {code:java}
> spark.sql("create table test2(a int,b string,c string) using hudi partitioned by(b) options(primaryKey='a')"){code}
> 3.datasource向test2写入数据
>
> {code:java}
> val base_data=spark.sql("select * from testdb.test1")
> base_data.write.format("hudi").
> option(TABLE_TYPE_OPT_KEY, COW_TABLE_TYPE_OPT_VAL).
> option(RECORDKEY_FIELD_OPT_KEY, "a").
> option(PARTITIONPATH_FIELD_OPT_KEY, "b").
> option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.SimpleKeyGenerator").
> option(OPERATION_OPT_KEY, "bulk_insert").
> option(HIVE_SYNC_ENABLED_OPT_KEY, "true").
> option(HIVE_PARTITION_FIELDS_OPT_KEY, "b").
> option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,"org.apache.hudi.hive.MultiPartKeysValueExtractor").
> option(HIVE_DATABASE_OPT_KEY, "testdb").
> option(HIVE_TABLE_OPT_KEY, "test2").
> option(HIVE_USE_JDBC_OPT_KEY, "true").
> option("hoodie.bulkinsert.shuffle.parallelism", 4).
> option("hoodie.datasource.write.hive_style_partitioning", "true").
> option(TABLE_NAME, "test2").mode(Append).save(s"/user/hive/warehouse/testdb.db/test2")
> {code}
>
> 此时执行查询结果如下:
> {code:java}
> +---+---+---+
> | a| b| c|
> +---+---+---+
> | 1| 3| 2|
> +---+---+---+{code}
> 4.删除一条记录
> {code:java}
> spark.sql("delete from testdb.test2 where a=1"){code}
> 5.执行查询,a=1的记录未被删除
> {code:java}
> spark.sql("select a,b,c from testdb.test2").show{code}
> {code:java}
> +---+---+---+
> | a| b| c|
> +---+---+---+
> | 1| 3| 2|
> +---+---+---+{code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)