You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Yann Byron (Jira)" <ji...@apache.org> on 2021/11/08 07:17:00 UTC
[jira] [Comment Edited] (HUDI-2390) KeyGenerator discrepancy between DataFrame writer and SQL

    [ https://issues.apache.org/jira/browse/HUDI-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440212#comment-17440212 ] 

Yann Byron edited comment on HUDI-2390 at 11/8/21, 7:16 AM:
------------------------------------------------------------

[~hren] This issue is solved by #2538.

You don't need to provide `hoodie.datasource.write.keygenerator.class`, `hoodie.datasource.write.hive_style_partitioning` if you've already created table by spark-sql or written data. 

for example:

 
{code:java}
base_data.write.format("hudi").
option(TABLE_TYPE_OPT_KEY, COW_TABLE_TYPE_OPT_VAL).      
option(RECORDKEY_FIELD_OPT_KEY, "a").      
option(PARTITIONPATH_FIELD_OPT_KEY, "b").      
option(OPERATION_OPT_KEY, "bulk_insert").      
option(HIVE_SYNC_ENABLED_OPT_KEY, "true").      
option(HIVE_PARTITION_FIELDS_OPT_KEY, "b").   
option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,"org.apache.hudi.hive.MultiPartKeysValueExtractor").      
option(HIVE_DATABASE_OPT_KEY, "testdb").      
option(HIVE_TABLE_OPT_KEY, "test2").      
option(HIVE_USE_JDBC_OPT_KEY, "true").      
option("hoodie.bulkinsert.shuffle.parallelism", 4).
option(TABLE_NAME, "test2").mode(Append).save(s"/user/hive/warehouse/testdb.db/test2") {code}


was (Author: biyan900116@gmail.com):
[~hren] This issue is solved by #2538.

You don't need to provide `hoodie.datasource.write.keygenerator.class`, `hoodie.datasource.write.hive_style_partitioning` if you've already created table by spark-sql or written data. 

> KeyGenerator discrepancy between DataFrame writer and SQL
> ---------------------------------------------------------
>
>                 Key: HUDI-2390
>                 URL: https://issues.apache.org/jira/browse/HUDI-2390
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Spark Integration
>    Affects Versions: 0.9.0
>            Reporter: renhao
>            Assignee: Yann Byron
>            Priority: Blocker
>              Labels: sev:critical, user-support-issues
>             Fix For: 0.10.0
>
>
> Test Case:
> {code:java}
>  import org.apache.hudi.QuickstartUtils._
>  import scala.collection.JavaConversions._
>  import org.apache.spark.sql.SaveMode._
>  import org.apache.hudi.DataSourceReadOptions._
>  import org.apache.hudi.DataSourceWriteOptions._
>  import org.apache.hudi.config.HoodieWriteConfig._{code}
> 1.准备数据
>  
> {code:java}
> spark.sql("create table test1(a int,b string,c string) using hudi partitioned by(b) options(primaryKey='a')")
> spark.sql("insert into table test1 select 1,2,3")
> {code}
>  
> 2.创建hudi table test2
> {code:java}
> spark.sql("create table test2(a int,b string,c string) using hudi partitioned by(b) options(primaryKey='a')"){code}
> 3.datasource向test2写入数据
>  
> {code:java}
> val base_data=spark.sql("select * from testdb.test1")
> base_data.write.format("hudi").
> option(TABLE_TYPE_OPT_KEY, COW_TABLE_TYPE_OPT_VAL).      
> option(RECORDKEY_FIELD_OPT_KEY, "a").      
> option(PARTITIONPATH_FIELD_OPT_KEY, "b").      
> option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.SimpleKeyGenerator"). 
> option(OPERATION_OPT_KEY, "bulk_insert").      
> option(HIVE_SYNC_ENABLED_OPT_KEY, "true").      
> option(HIVE_PARTITION_FIELDS_OPT_KEY, "b").   
> option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,"org.apache.hudi.hive.MultiPartKeysValueExtractor").      
> option(HIVE_DATABASE_OPT_KEY, "testdb").      
> option(HIVE_TABLE_OPT_KEY, "test2").      
> option(HIVE_USE_JDBC_OPT_KEY, "true").      
> option("hoodie.bulkinsert.shuffle.parallelism", 4).
> option("hoodie.datasource.write.hive_style_partitioning", "true").      
> option(TABLE_NAME, "test2").mode(Append).save(s"/user/hive/warehouse/testdb.db/test2")
> {code}
>  
> 此时执行查询结果如下：
> {code:java}
> +---+---+---+
> | a| b| c|
> +---+---+---+
> | 1| 3| 2|
> +---+---+---+{code}
> 4.删除一条记录
> {code:java}
> spark.sql("delete from testdb.test2 where a=1"){code}
> 5.执行查询，a=1的记录未被删除
> {code:java}
> spark.sql("select a,b,c from testdb.test2").show{code}
> {code:java}
> +---+---+---+
> | a| b| c|
> +---+---+---+
> | 1| 3| 2|
> +---+---+---+{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)