You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/10/29 12:28:24 UTC

[GitHub] [hudi] RajasekarSribalan opened a new issue #2214: [SUPPORT] Hudi Upsert but with duplicates record for same key

RajasekarSribalan opened a new issue #2214:
URL: https://github.com/apache/hudi/issues/2214


   We are having a Hudi spark pipeline which constantly  does upsert on a Hudi table. Incoming traffic is 5k records per sec on the table. We use COW table type but after upsert we could see lot of duplicate rows for same record key. We do set the precombine field which is date string field. Upsert should always update the record but it creates a duplicate entry. Pls note, we might get duplicate records in the incoming messages so dataframe will have duplicate records.
    Also , we query from Spark SQL and we set the properties/config according to Hudi doc.
   
   ****Version details****
   
   Table type : COW
   Operation : Upsert
   Hudi - 0.5.2-incubating
   Spark - 2.2.0
   
   @vinothchandar  @bvaradar @bhasudha  Please assist!. We thought of running repair deduplicate form Hudi Cli but seems like it only support for partitioned tables but our table is non-partitioned table.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] RajasekarSribalan commented on issue #2214: [SUPPORT] Hudi Upsert but with duplicates record for same key

Posted by GitBox <gi...@apache.org>.
RajasekarSribalan commented on issue #2214:
URL: https://github.com/apache/hudi/issues/2214#issuecomment-719155536


   Thanks Balaji for quick response.
   
   Pls find my answer below.
   
   
   Do you have hoodie.combine.before.upsert set to true ?
   
   We don't set this flag , so it should be true by default.
   
   You can also check if the duplicates have the same _hoodie_commit_time value
   to see if this is the pattern ?
   
   Yes they have the same _hoodie_commit_time ,same parquet files ,same hoodie
   record key and different commit seq no for each duplicate entry.
   
   It is also possible that you have more than one writer ingesting data to
   the same dataset concurrently. This will not work as expected.
   
   
   We one hudi pipeline for one table and I suppose hudi doesn't support
   concurrent writes/upserts. We consume messages from kafka ,transform and
   then upsert in hudi.So I am still.unable to get you regarding ingesting
   same dataset concurrently.Can you provide some information on this scenario?
   
   Thanks,
   Raj
   
   On Fri, Oct 30, 2020, 3:04 AM Balaji Varadarajan <no...@github.com>
   wrote:
   
   > @RajasekarSribalan <https://github.com/RajasekarSribalan> : Do you have
   > hoodie.combine.before.upsert set to true ? By default, this is true, so
   > unless you have set to false, this should not be a problem ? You can also
   > check if the duplicates have the same _hoodie_commit_time value to see if
   > this is the pattern ?
   >
   > Another question, when you say duplicate record - Do they have same
   > _hoodie_record_key value ?
   >
   > It is also possible that you have more than one writer ingesting data to
   > the same dataset concurrently. This will not work as expected.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/hudi/issues/2214#issuecomment-719037420>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AFMO6I44KNN2RQMYAIBVCJLSNHNVTANCNFSM4TDVGEYA>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] RajasekarSribalan closed issue #2214: [SUPPORT] Hudi Upsert but with duplicates record for same key

Posted by GitBox <gi...@apache.org>.
RajasekarSribalan closed issue #2214:
URL: https://github.com/apache/hudi/issues/2214


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] RajasekarSribalan commented on issue #2214: [SUPPORT] Hudi Upsert but with duplicates record for same key

Posted by GitBox <gi...@apache.org>.
RajasekarSribalan commented on issue #2214:
URL: https://github.com/apache/hudi/issues/2214#issuecomment-720304578


   Thanks @bvaradar for response. Please find the data from a table. I am querying id and getting many duplicates. I am querying Hudi table from Spark.
   
   **Hudi Upsert config:**
   
   upsertDf.write
                     .format("hudi")
                     .option(OPERATION_OPT_KEY, "upsert")
                     .option(PRECOMBINE_FIELD_OPT_KEY, "hudi_ingestion_at")
                     .option(RECORDKEY_FIELD_OPT_KEY, hudi_key)
                     .option(PARTITIONPATH_FIELD_OPT_KEY, "")
                     .option(KEYGENERATOR_CLASS_OPT_KEY, classOf[NonpartitionedKeyGenerator].getName)
                     .option(TABLE_NAME, tablename)
                     .option(TABLE_TYPE_OPT_KEY, "COPY_ON_WRITE")
                     .option(HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP, "2")
                     .option(HoodieCompactionConfig.MIN_COMMITS_TO_KEEP_PROP, "3")
                     .option(HoodieCompactionConfig.MAX_COMMITS_TO_KEEP_PROP, "4")
                     .option(HIVE_SYNC_ENABLED_OPT_KEY, "true")
                     .option(HIVE_URL_OPT_KEY, "jdbc:hive2://XXXXXXXX")
                     .option(HIVE_DATABASE_OPT_KEY, hudi_db)
                     .option(HIVE_TABLE_OPT_KEY, tablename)
                     .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, classOf[NonPartitionedExtractor].getName)
                     .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
                     .option("hoodie.upsert.shuffle.parallelism", "100")
                     .mode(Append)
                     .save("/user/XXXXXXX/hudi/" + path + "/" + tablename)
   
   **Sample data with duplicates**
   
   +-------------------+-----------------------+------------------+----------------------+---------------------------------------------------------------------------+--------------+-------------------+--------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-----------+-----------+-----------+---------------+------------+---------------+-------------------+-------------------+-----------+-----------+-----------+-----------+-----------+------------+------------+
   |_hoodie_commit_time|_hoodie_commit_seqno   |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                          |rds_shard_name|hudi_ingestion_at  |id      |account_id|flexifield_id|text_01|text_02|text_03|text_04|text_05|text_06|text_07|text_08|text_09|text_10|slt_text_11|slt_text_12|int_text_13|decimal_text_14|date_text_15|boolean_text_16|created_at         |updated_at         |mlt_text_17|mlt_text_18|mlt_text_19|mlt_text_20|mlt_text_21|lock_version|eslt_text_22|
   +-------------------+-----------------------+------------------+----------------------+---------------------------------------------------------------------------+--------------+-------------------+--------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-----------+-----------+-----------+---------------+------------+---------------+-------------------+-------------------+-----------+-----------+-----------+-----------+-----------+------------+------------+
   |20201030005747     |20201030005747_8_219096|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219097|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219098|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219099|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219100|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219101|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219102|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219103|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219104|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219105|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219106|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219107|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   +-------------------+-----------------------+------------------+----------------------+---------------------------------------------------------------------------+--------------+-------------------+--------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-----------+-----------+-----------+---------------+------------+---------------+-------------------+-------------------+-----------+-----------+-----------+-----------+-----------+------------+------------+
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2214: [SUPPORT] Hudi Upsert but with duplicates record for same key

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2214:
URL: https://github.com/apache/hudi/issues/2214#issuecomment-724246492


   Sorry for the delay @RajasekarSribalan I tried looking at 0.5.2 codebase to see how else it is possible.  If  you have enabled "insert" operation the first time when these records are written to dataset, and if the batch contains duplicates, then this is possible.  I do not see any reason other than a misconfiguration. 
   
   Would you be able to reproduce this using a contained script and I would dig deeper into this? I am not able to see this from the local docker setup. 
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] RajasekarSribalan commented on issue #2214: [SUPPORT] Hudi Upsert but with duplicates record for same key

Posted by GitBox <gi...@apache.org>.
RajasekarSribalan commented on issue #2214:
URL: https://github.com/apache/hudi/issues/2214#issuecomment-724447887


   Thanks @bvaradar  you are correct . Our initial snapshot to Hudi table has duplicates since id is not the only primary key.  Now we got to know the issue. Many thanks for your assistance.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2214: [SUPPORT] Hudi Upsert but with duplicates record for same key

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2214:
URL: https://github.com/apache/hudi/issues/2214#issuecomment-719037420


   @RajasekarSribalan : Do you have hoodie.combine.before.upsert set to true ? By default, this is true, so unless you have set to false, this should not be a problem ? You can also check if the duplicates have the same `_hoodie_commit_time` value to see if this is the pattern ?
   
    Another question, when you say duplicate record - Do they have same `_hoodie_record_key` value ?
   
   It is also possible that you have more than one writer ingesting data to the same dataset concurrently. This will not work as expected.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] RajasekarSribalan edited a comment on issue #2214: [SUPPORT] Hudi Upsert but with duplicates record for same key

Posted by GitBox <gi...@apache.org>.
RajasekarSribalan edited a comment on issue #2214:
URL: https://github.com/apache/hudi/issues/2214#issuecomment-719155536


   Thanks Balaji for quick response.
   
   Pls find my answer below.
   
   
   Do you have hoodie.combine.before.upsert set to true ?
   
   We don't set this flag , so it should be true by default.
   
   You can also check if the duplicates have the same _hoodie_commit_time value
   to see if this is the pattern ?
   
   Yes they have the same _hoodie_commit_time ,same parquet files ,same hoodie
   record key and different commit seq no for each duplicate entry.
   
   It is also possible that you have more than one writer ingesting data to
   the same dataset concurrently. This will not work as expected.
   
   
   We have one hudi pipeline for one table and I suppose hudi doesn't support
   concurrent writes/upserts. We consume messages from kafka ,transform and
   then upsert in hudi.So I am still.unable to get you regarding ingesting
   same dataset concurrently.Can you provide some information on this scenario?
   
   Thanks,
   Raj
   
   On Fri, Oct 30, 2020, 3:04 AM Balaji Varadarajan <no...@github.com>
   wrote:
   
   > @RajasekarSribalan <https://github.com/RajasekarSribalan> : Do you have
   > hoodie.combine.before.upsert set to true ? By default, this is true, so
   > unless you have set to false, this should not be a problem ? You can also
   > check if the duplicates have the same _hoodie_commit_time value to see if
   > this is the pattern ?
   >
   > Another question, when you say duplicate record - Do they have same
   > _hoodie_record_key value ?
   >
   > It is also possible that you have more than one writer ingesting data to
   > the same dataset concurrently. This will not work as expected.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/hudi/issues/2214#issuecomment-719037420>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AFMO6I44KNN2RQMYAIBVCJLSNHNVTANCNFSM4TDVGEYA>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org