You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/11/03 14:16:01 UTC

[GitHub] [hudi] RajasekarSribalan commented on issue #2214: [SUPPORT] Hudi Upsert but with duplicates record for same key

RajasekarSribalan commented on issue #2214:
URL: https://github.com/apache/hudi/issues/2214#issuecomment-720304578


   Thanks @bvaradar for response. Please find the data from a table. I am querying id and getting many duplicates. I am querying Hudi table from Spark.
   
   **Hudi Upsert config:**
   
   upsertDf.write
                     .format("hudi")
                     .option(OPERATION_OPT_KEY, "upsert")
                     .option(PRECOMBINE_FIELD_OPT_KEY, "hudi_ingestion_at")
                     .option(RECORDKEY_FIELD_OPT_KEY, hudi_key)
                     .option(PARTITIONPATH_FIELD_OPT_KEY, "")
                     .option(KEYGENERATOR_CLASS_OPT_KEY, classOf[NonpartitionedKeyGenerator].getName)
                     .option(TABLE_NAME, tablename)
                     .option(TABLE_TYPE_OPT_KEY, "COPY_ON_WRITE")
                     .option(HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP, "2")
                     .option(HoodieCompactionConfig.MIN_COMMITS_TO_KEEP_PROP, "3")
                     .option(HoodieCompactionConfig.MAX_COMMITS_TO_KEEP_PROP, "4")
                     .option(HIVE_SYNC_ENABLED_OPT_KEY, "true")
                     .option(HIVE_URL_OPT_KEY, "jdbc:hive2://XXXXXXXX")
                     .option(HIVE_DATABASE_OPT_KEY, hudi_db)
                     .option(HIVE_TABLE_OPT_KEY, tablename)
                     .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, classOf[NonPartitionedExtractor].getName)
                     .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
                     .option("hoodie.upsert.shuffle.parallelism", "100")
                     .mode(Append)
                     .save("/user/XXXXXXX/hudi/" + path + "/" + tablename)
   
   **Sample data with duplicates**
   
   +-------------------+-----------------------+------------------+----------------------+---------------------------------------------------------------------------+--------------+-------------------+--------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-----------+-----------+-----------+---------------+------------+---------------+-------------------+-------------------+-----------+-----------+-----------+-----------+-----------+------------+------------+
   |_hoodie_commit_time|_hoodie_commit_seqno   |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                          |rds_shard_name|hudi_ingestion_at  |id      |account_id|flexifield_id|text_01|text_02|text_03|text_04|text_05|text_06|text_07|text_08|text_09|text_10|slt_text_11|slt_text_12|int_text_13|decimal_text_14|date_text_15|boolean_text_16|created_at         |updated_at         |mlt_text_17|mlt_text_18|mlt_text_19|mlt_text_20|mlt_text_21|lock_version|eslt_text_22|
   +-------------------+-----------------------+------------------+----------------------+---------------------------------------------------------------------------+--------------+-------------------+--------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-----------+-----------+-----------+---------------+------------+---------------+-------------------+-------------------+-----------+-----------+-----------+-----------+-----------+------------+------------+
   |20201030005747     |20201030005747_8_219096|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219097|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219098|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219099|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219100|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219101|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219102|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219103|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219104|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219105|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219106|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   |20201030005747     |20201030005747_8_219107|37599142          |                      |613096d4-72b2-4c0e-b5af-364c3e2305dd-0_8-1145-226826_20201030005747.parquet|XXXXXX_shard2|2020-10-30 00:14:24|37599142|108121    |1160018262   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
          |--- {}
   
       |--- {}
   
          |2020-09-22 05:58:05|2020-10-30 00:14:24|--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |--- {}
   
      |38          |--- {}
   
       |
   +-------------------+-----------------------+------------------+----------------------+---------------------------------------------------------------------------+--------------+-------------------+--------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-----------+-----------+-----------+---------------+------------+---------------+-------------------+-------------------+-----------+-----------+-----------+-----------+-----------+------------+------------+
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org