You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/11/17 16:32:43 UTC

[GitHub] [incubator-doris] sunzhangbin opened a new issue #4917: Routine load丢失数据

sunzhangbin opened a new issue #4917:
URL: https://github.com/apache/incubator-doris/issues/4917


   routine load任务如下:问题是偶尔丢失少量数据,但是重新提交一下此routine load 数据可以补回来。 
   CREATE ROUTINE LOAD xes1v1_db.ods_xes_platform_order_order_detail_bushu_4 ON ods_xes_platform_order_order_detail
           COLUMNS(
           `order_id`,
           `product_id`,
           `promotion_id`,
           `id`,
           `app_id`,
           `user_id`,
           `product_type`,
           `product_name`,
           `product_num`,
           `product_price`,
           `coupon_price`,
           `promotion_price`,
           `promotion_type`,
           `parent_product_id`,
           `parent_product_type`,
           `source_id`,
           `extras`,
           `created_time`,
           `updated_time`,
           `version`,
           `prepaid_card_price`,
           `table`
           ),
           where 
           `table` regexp 'order_detail_[0-9]' and app_id=8 and created_time>='2020-11-17 00:00:00'
           PROPERTIES
           (
             "format" = "json",
             "jsonpaths" = 
             "[
             \"$.data.order_id\",
             \"$.data.product_id\",
             \"$.data.promotion_id\",
             \"$.data.id\",
             \"$.data.app_id\",
             \"$.data.user_id\",
             \"$.data.product_type\",
             \"$.data.product_name\",
             \"$.data.product_num\",
             \"$.data.product_price\",
             \"$.data.coupon_price\",
             \"$.data.promotion_price\",
             \"$.data.promotion_type\",
             \"$.data.parent_product_id\",
             \"$.data.parent_product_type\",
             \"$.data.source_id\",
             \"$.data.extras\",
             \"$.data.created_time\",
             \"$.data.updated_time\",
             \"$.data.version\",
             \"$.data.prepaid_card_price\",
             \"$.table\"]"
           )
           FROM KAFKA
           (
           "kafka_broker_list" = "10.20.34.60:9092,10.20.34.62:9092,10.20.34.64:9092",
           "kafka_topic" = "xes_plarform_order_4",
           "property.group.id" = "ods_xes_platform_order_order_detail_bushu",
           "property.client.id" = "ods_xes_platform_order_order_detail_bushu",
           "property.kafka_default_offsets" = "OFFSET_BEGINNING"
                   );
   
   CREATE TABLE `ods_xes_platform_order_order_detail` (
     `order_id` varchar(64) NULL DEFAULT "0" COMMENT "订单ID",
     `product_id` int(11) NULL DEFAULT "0" COMMENT "商品ID",
     `promotion_id` varchar(64) NULL DEFAULT "0" COMMENT "买赠/续报礼包规则id",
     `id` varchar(64) NULL COMMENT "id",
     `app_id` varchar(64) NULL DEFAULT "0" COMMENT "业务线ID",
     `user_id` int(11) NULL DEFAULT "0" COMMENT "用户ID",
     `product_type` int(11) NULL DEFAULT "0" COMMENT "商品类别",
     `product_name` varchar(255) NULL DEFAULT "" COMMENT "商品名称",
     `product_num` int(11) NULL DEFAULT "0" COMMENT "商品数量",
     `product_price` int(11) NULL DEFAULT "0" COMMENT "商品销售金额",
     `coupon_price` int(11) NULL DEFAULT "0" COMMENT "优惠券分摊金额",
     `promotion_price` int(11) NULL DEFAULT "0" COMMENT "促销分摊金额",
     `promotion_type` int(11) NULL DEFAULT "0" COMMENT "促销类型",
     `parent_product_id` int(11) NULL DEFAULT "0" COMMENT "父商品ID",
     `parent_product_type` int(11) NULL DEFAULT "0" COMMENT "父商品类别,业务线可自己定义",
     `source_id` varchar(30) NULL DEFAULT "" COMMENT "热点数据",
     `extras` varchar(3072) NULL DEFAULT "" COMMENT "订单商品中附属信息存储 比如促销的关键不变更信息存储之类的,不会来查询,不会用来检索",
     `created_time` varchar(64) NULL DEFAULT "0000-00-00 00:00:00" COMMENT "创建时间",
     `updated_time` varchar(64) NULL DEFAULT "1970-00-00 00:00:00" COMMENT "修改时间",
     `version` varchar(64) NULL DEFAULT "" COMMENT "版本控制",
     `prepaid_card_price` int(11) NULL DEFAULT "0" COMMENT "礼品卡金额",
     `table` varchar(64) NULL DEFAULT "" COMMENT "来源表"
   ) ENGINE=OLAP
   UNIQUE KEY(`order_id`, `product_id`, `promotion_id`)
   COMMENT "网校订单商品表"
   DISTRIBUTED BY HASH(`order_id`) BUCKETS 10
   PROPERTIES (
   "replication_num" = "3",
   "in_memory" = "false",
   "storage_format" = "V2"
   ); 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] stalary commented on issue #4917: Routine load丢失数据

Posted by GitBox <gi...@apache.org>.
stalary commented on issue #4917:
URL: https://github.com/apache/incubator-doris/issues/4917#issuecomment-729732561


   +1,我这边也遇到了这个问题,我有离线在线两个集群,总是数据出现不一致情况,重跑rutineload就好了


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] sunzhangbin commented on issue #4917: Routine load丢失数据

Posted by GitBox <gi...@apache.org>.
sunzhangbin commented on issue #4917:
URL: https://github.com/apache/incubator-doris/issues/4917#issuecomment-729738941


   > +1,我这边也遇到了这个问题,我有离线在线两个集群,总是数据出现不一致情况,重跑rutineload就好了
   
   我启动两个routine load同时同步数据目前还能保证不丢数,但这不是解决根本的方法


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] sunzhangbin commented on issue #4917: Routine load丢失数据

Posted by GitBox <gi...@apache.org>.
sunzhangbin commented on issue #4917:
URL: https://github.com/apache/incubator-doris/issues/4917#issuecomment-737185426


   > > 如果你用unique 模型的话可能是因为不保证顺序,导致数据被replace 了。
   > > routine load如果并发的话不同任务之间的执行是无序的,也就是说你kafka 中的 offset 靠前的数据并不一定先导入。
   > > 又由于是unique 模型,后面导入成功的数据会覆盖前面的数据,所以会产生你说的丢数据的错觉。
   > > If you use the unique model, the data may be replaced because the order is not guaranteed.
   > > If routine load is concurrent, the execution of different tasks is disordered, which means that the data with the first offset in your Kafka is not necessarily imported first.
   > > Also, because it is a unique model, the data that is successfully imported later will overwrite the previous data, so the illusion of data loss will occur.
   > 
   > 如果一个routineload的话能保证分区顺序性吗?
   
   我用的是uniq,但是确认不是被replace了,丢失的数据整行都不在表里;


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] stalary commented on issue #4917: Routine load丢失数据

Posted by GitBox <gi...@apache.org>.
stalary commented on issue #4917:
URL: https://github.com/apache/incubator-doris/issues/4917#issuecomment-737182844


   > 如果你用unique 模型的话可能是因为不保证顺序,导致数据被replace 了。
   > routine load如果并发的话不同任务之间的执行是无序的,也就是说你kafka 中的 offset 靠前的数据并不一定先导入。
   > 又由于是unique 模型,后面导入成功的数据会覆盖前面的数据,所以会产生你说的丢数据的错觉。
   > 
   > If you use the unique model, the data may be replaced because the order is not guaranteed.
   > If routine load is concurrent, the execution of different tasks is disordered, which means that the data with the first offset in your Kafka is not necessarily imported first.
   > Also, because it is a unique model, the data that is successfully imported later will overwrite the previous data, so the illusion of data loss will occur.
   
   如果一个routineload的话能保证分区顺序性吗?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] stalary commented on issue #4917: Routine load丢失数据

Posted by GitBox <gi...@apache.org>.
stalary commented on issue #4917:
URL: https://github.com/apache/incubator-doris/issues/4917#issuecomment-729794316


   是否和rdKafka中的enable.auto.commit配置有关,我看这个默认是true的


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] stalary commented on issue #4917: Routine load丢失数据

Posted by GitBox <gi...@apache.org>.
stalary commented on issue #4917:
URL: https://github.com/apache/incubator-doris/issues/4917#issuecomment-729744705


   @sunzhangbin 开两个有可能会导致时序问题吧


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] sunzhangbin commented on issue #4917: Routine load丢失数据

Posted by GitBox <gi...@apache.org>.
sunzhangbin commented on issue #4917:
URL: https://github.com/apache/incubator-doris/issues/4917#issuecomment-737178214


   同时4个routine load job往同一张表同步数据偶尔会丢数据,改成1个routine  load job往一个表同步数据没有再发现丢数据的情况?具体原因查不到,日志也看不出来异常。。。。


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] EmmyMiao87 commented on issue #4917: Routine load丢失数据

Posted by GitBox <gi...@apache.org>.
EmmyMiao87 commented on issue #4917:
URL: https://github.com/apache/incubator-doris/issues/4917#issuecomment-737180128


   如果你用unique 模型的话可能是因为不保证顺序,导致数据被replace 了。
   routine load如果并发的话不同任务之间的执行是无序的,也就是说你kafka 中的 offset 靠前的数据并不一定先导入。
   又由于是unique 模型,后面导入成功的数据会覆盖前面的数据,所以会产生你说的丢数据的错觉。
   
   
   If you use the unique model, the data may be replaced because the order is not guaranteed.
   If routine load is concurrent, the execution of different tasks is disordered, which means that the data with the first offset in your Kafka is not necessarily imported first.
   Also, because it is a unique model, the data that is successfully imported later will overwrite the previous data, so the illusion of data loss will occur.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org