You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/12 13:54:56 UTC

[GitHub] [hudi] levisLi opened a new issue, #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

levisLi opened a new issue, #6931:
URL: https://github.com/apache/hudi/issues/6931

   when i use sparksql to create hudi table , i find it not support hudi properties of 'hoodie.datasource.write.operation = insert' .
   example:
   create table if not exists hudi.h3(
     id bigint, 
     name string, 
     price double
   ) using hudi
   options (
     primaryKey = 'id',
     type = 'mor',
     hoodie.cleaner.fileversions.retained = '1',
     hoodie.datasource.write.operation = 'insert'
   ); 
   when i defind hoodie.datasource.write.operation = insert ,actual its upsert model. how can i solve this issues?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] levisLi commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
levisLi commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1277304441

   @JoshuaZhuCN I have try to set `hoodie.sql.insert.mode=non-strict`, but no effect
   eg:
   create table if not exists hudi.h3(
   id bigint,
   name string,
   price double
   ) using hudi
   options (
   primaryKey = 'id',
   type = 'mor',
   hoodie.cleaner.fileversions.retained = '1',
   hoodie.datasource.write.operation = 'insert',
   hoodie.sql.insert.mode = 'non-strict');


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YannByron commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
YannByron commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1288475156

   @nsivabalan i think we can close this.
   this issue to spark-sql has been explained by @Zouxxyy and @boneanxs , and @Zouxxyy provides a pr https://github.com/apache/hudi/pull/6949 to describe how can work with `hoodie.datasource.write.operation` and `hoodie.merge.allow.duplicate.on.inserts`. If still have issue to flink-sql, better to create a new issue to follow up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] JoshuaZhuCN commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
JoshuaZhuCN commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1277204119

   try to set  `hoodie.sql.insert.mode`='non-strict' ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YannByron commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
YannByron commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1276939335

   @Zouxxyy do you have time to follow up this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] levisLi commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
levisLi commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1279645408

   @Zouxxyy I set `hoodie.datasource.write.operation='insert' ` and `hoodie.merge.allow.duplicate.on.inserts=true` properites in hudi table, when i use spark-sql insert into record to hudi table ,this can show duplicate record,but when i use flink-sql insert into record to hudi table ,only show no  duplicate record.
   
   spark-sql>
   `create table if not exists hudi.hudi_merge_test(
        uuid string
       ,name string
       ,age int
       ,ts timestamp
       ,dt string
   ) using hudi
   tblproperties  (
       type = 'mor'
       ,primaryKey = 'uuid'
       ,hoodie.datasource.write.operation='insert'
       ,hoodie.cleaner.fileversions.retained = '1'
       ,hoodie.merge.allow.duplicate.on.inserts='true'
       ,hive_sync.skip_ro_suffix = 'true' -- 去除ro后缀
       ,write.parquet.max.file.size='120'  --文件最大大小M
       ,hoodie.datasource.write.hive_style_partitioning='true'
       ,hoodie.archive.merge.enable='true' --自动小文件合并
       ,hoodie.cleaner.commits.retained='1' --提交版本保留个数
   )
       partitioned by (dt)
       location 'hdfs://namespace-HA-3/hudi/hudi_merge_test';`
   
   flink-sql>
   `CREATE TABLE IF NOT EXISTS hudi_merge_test(
       uuid VARCHAR(20),
       name VARCHAR(10),
       age INT,
       ts TIMESTAMP(3),
       dt VARCHAR(20)
   )
   PARTITIONED BY (dt)
   WITH (
       'connector' ='hudi',
       'table.type' = 'MERGE_ON_READ',
       'write.operation'='insert',
       'hoodie.datasource.write.recordkey.field' = 'uuid',
       'write.precombine.field' = 'ts',
       'path' = 'hdfs://namespace-HA-3/hudi/hudi_merge_test',
       'write.tasks' = '4',
       'compaction.tasks' = '4',
       'hoodie.archive.merge.enable'='true', --自动小文件合并
       'hoodie.cleaner.commits.retained'='1', --提交版本保留个数
       'hoodie.datasource.write.hive_style_partitioning'='true',   --设置hive的分区格式
       'hoodie.embed.timeline.server'='false',
       'hoodie.parquet.small.file.limit'='0',
       'hoodie.merge.allow.duplicate.on.inserts'='true',
       'hive_sync.enable' = 'true',     -- Required。开启 Hive 同步功能
       'hive_sync.mode' = 'hms',         -- Required。将 hive sync mode 设置为 hms, 默认 jdbc
       'hive_sync.metastore.uris' = 'thrift://dev2:9083',  -- Required。metastore 的端口
       'hive_sync.jdbc_url' = 'jdbc:hive2://dev2:10000',
       'hive_sync.skip_ro_suffix' = 'true', -- 去除ro后缀
       'hive_sync.table'='hudi_compacte',                          -- required。hive 新建的表名
       'hive_sync.db'='hudi',                         -- required。hive 新建的数据库名
       'hive_sync.username' = 'hive',
       'hive_sync.password' = '123456'
   )`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] levisLi commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
levisLi commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1276911694

   Version: hudi-0.11+spark-3.2.1
   Setup: ${SPARK_HOME}/bin/spark-sql
   Create hudi table with spark-sql cmd
   spark-sql >
   create table if not exists hudi.h3(
         id bigint,
        name string,
        price double
      ) using hudi
        options (
           primaryKey = 'id',
           type = 'mor',
           hoodie.cleaner.fileversions.retained = '1',
           hoodie.datasource.write.operation = 'insert');
   spark-sql> insert into hudi.h3 values(1,'name1','0.5');
   spark-sql> insert into hudi.h3 values(1,'name2','0.6');
   spark-sql> select * from hudi.h3;
    
   When i query my hudi table records,its show only one record,others two record
   I expect result is :
      1|name1|0.5
      1|name2|0.6
   But  last result is below: 
      1|name2|0.6 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
Zouxxyy commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1277714356

   > @levisLi I think @Zouxxyy is right, how many files in your hudi table after you did 2 write operations? If there's only one file, it means small files are merged, maybe you can try `set hoodie.merge.allow.duplicate.on.inserts=true`
   
   That's what I missed, this configuration works


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
boneanxs commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1277356788

   @levisLi I think @Zouxxyy is right, how many files in your hudi table after you did 2 write operations? If there's only one file, it means small files are merged, maybe you can try
   `set hoodie.merge.allow.duplicate.on.inserts=true`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
Zouxxyy commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1276979842

   You are actually in insert mode, but since the record is in the same filegroup, there show no duplicate key records.
   
   If you add a configuration `set hoodie.parquet.small.file.limit=0`, you will then find duplicate key records because the record is written to different Filegroups.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] levisLi commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
levisLi commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1277024969

   If I add a configuration `set hoodie.parquet.small.file.limit=0` , its mean , I will write many of small file


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan closed issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert' 
URL: https://github.com/apache/hudi/issues/6931


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1287876597

   hey folks. are we good to close out this issue. or is there any pending issue to be addressed. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1276829393

   @levisLi Could you provide more details on your setup?  What Hudi and Spark versions are you using? How do you start the spark-shell or spark-sql?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
Zouxxyy commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1278917445

   > @JoshuaZhuCN I have try to set `hoodie.sql.insert.mode=non-strict`, but no effect eg: create table if not exists hudi.h3( id bigint, name string, price double ) using hudi options ( primaryKey = 'id', type = 'mor', hoodie.cleaner.fileversions.retained = '1', hoodie.datasource.write.operation = 'insert', hoodie.sql.insert.mode = 'non-strict');
   
   hi, have you used `hoodie.merge.allow.duplicate.on.inserts=true` for testing? I added a test case to verify it https://github.com/apache/hudi/pull/6949. Also, it is recommended to use `hoodie.sql.insert.mode` to configure insert mode when you are using spark-sql
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] levisLi commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
levisLi commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1277026671

   @YannByron I do not set time field, The default `preCombineField = ts` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
Zouxxyy commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1280152704

   > @Zouxxyy I set `hoodie.datasource.write.operation='insert' ` and `hoodie.merge.allow.duplicate.on.inserts=true` properites in hudi table, when i use spark-sql insert into record to hudi table ,this can show duplicate record,but when i use flink-sql insert into record to hudi table ,only show no duplicate record.
   > 
   > spark-sql> `create table if not exists hudi.hudi_merge_test( uuid string ,name string ,age int ,ts timestamp ,dt string ) using hudi tblproperties ( type = 'mor' ,primaryKey = 'uuid' ,hoodie.datasource.write.operation='insert' ,hoodie.cleaner.fileversions.retained = '1' ,hoodie.merge.allow.duplicate.on.inserts='true' ,hive_sync.skip_ro_suffix = 'true' -- 去除ro后缀 ,write.parquet.max.file.size='120' --文件最大大小M ,hoodie.datasource.write.hive_style_partitioning='true' ,hoodie.archive.merge.enable='true' --自动小文件合并 ,hoodie.cleaner.commits.retained='1' --提交版本保留个数 ) partitioned by (dt) location 'hdfs://namespace-HA-3/hudi/hudi_merge_test';`
   > 
   > flink-sql> `CREATE TABLE IF NOT EXISTS hudi_merge_test( uuid VARCHAR(20), name VARCHAR(10), age INT, ts TIMESTAMP(3), dt VARCHAR(20) ) PARTITIONED BY (dt) WITH ( 'connector' ='hudi', 'table.type' = 'MERGE_ON_READ', 'write.operation'='insert', 'hoodie.datasource.write.recordkey.field' = 'uuid', 'write.precombine.field' = 'ts', 'path' = 'hdfs://namespace-HA-3/hudi/hudi_merge_test', 'write.tasks' = '4', 'compaction.tasks' = '4', 'hoodie.archive.merge.enable'='true', --自动小文件合并 'hoodie.cleaner.commits.retained'='1', --提交版本保留个数 'hoodie.datasource.write.hive_style_partitioning'='true', --设置hive的分区格式 'hoodie.embed.timeline.server'='false', 'hoodie.parquet.small.file.limit'='0', 'hoodie.merge.allow.duplicate.on.inserts'='true', 'hive_sync.enable' = 'true', -- Required。开启 Hive 同步功能 'hive_sync.mode' = 'hms', -- Required。将 hive sync mode 设置为 hms, 默认 jdbc 'hive_sync.metastore.uris' = 'thrift://dev2:9083', -- Required。m
 etastore 的端口 'hive_sync.jdbc_url' = 'jdbc:hive2://dev2:10000', 'hive_sync.skip_ro_suffix' = 'true', -- 去除ro后缀 'hive_sync.table'='hudi_compacte', -- required。hive 新建的表名 'hive_sync.db'='hudi', -- required。hive 新建的数据库名 'hive_sync.username' = 'hive', 'hive_sync.password' = '123456' )`
   
   Sorry, I don't know much about flink


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6931: SparkSQL create hudi DDL do not support hoodie.datasource.write.operation = 'insert'

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6931:
URL: https://github.com/apache/hudi/issues/6931#issuecomment-1290600659

   cool, thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org