You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/06/22 07:25:09 UTC

[GitHub] [hudi] Akshay2Agarwal opened a new issue #3131: [SUPPORT] Creating non-partitioned table in hudi generates duplicates

Akshay2Agarwal opened a new issue #3131:
URL: https://github.com/apache/hudi/issues/3131


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I am trying out non-partitioned table in hudi, in which I am facing issues with duplicate records. Primary culprit over it, I am assuming is that, initial write is happening in base path of the table and not at `default` partition. Might be that, I am missing out on configs.
   
   Configs that I am setting are as follows:
   ```
         RECORDKEY_FIELD_OPT_KEY -> "id",
         PRECOMBINE_FIELD_OPT_KEY -> "_hoodie_partition_key",
         PARTITIONPATH_FIELD_OPT_KEY -> "",
         HIVE_STYLE_PARTITIONING_OPT_KEY -> "false",
         HUDI_PARQUET_COMPRESSION_CODEC_KEY -> "snappy",
         TABLE_NAME -> "location_db",
         TABLE_TYPE_OPT_KEY -> COW_TABLE_TYPE_OPT_VAL,
         KEYGENERATOR_CLASS_OPT_KEY -> classOf[org.apache.hudi.keygen.NonpartitionedKeyGenerator].getName,
         HIVE_SYNC_ENABLED_OPT_KEY -> "true",
         HIVE_URL_OPT_KEY -> hiveSyncAccessCredentials.jdbcUrl,
         HIVE_USER_OPT_KEY -> hiveSyncAccessCredentials.user,
         HIVE_PASS_OPT_KEY -> hiveSyncAccessCredentials.password,
         HIVE_DATABASE_OPT_KEY -> flowConfig.getString("hive.database"),
         HIVE_TABLE_OPT_KEY -> flowConfig.getString("hive.table"),
         HIVE_AUTO_CREATE_DATABASE_OPT_KEY -> "true",
         HIVE_PARTITION_FIELDS_OPT_KEY -> "",
         HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[org.apache.hudi.hive.NonPartitionedExtractor].getName
   ```
   **Expected behavior**
   
   Upon a first commit, it writes the data in base folder not in `default`. And in next run for upsert, I am seeing data is being written in `default` partition path. This results in duplicate records as follows:
   ```
   scala> spark.sql("select count(id) as c, id  from location_db group by id having c> 1").show
   21/06/21 16:55:49 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
   +---+----+
   |  c|  id|
   +---+----+
   |  2| 912|
   |  2|1432|
   +---+----+
   ```
   ```
   scala> spark.sql("select * from location_db where id = 912").show
   +-------------------+--------------------+------------------+----------------------+--------------------+---+----------+----------------+----------------+------------------+----------------+---------------+------+------+---------+--------------------+--------+--------------+----------+------------+-----------------------+------------------+
   |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|created_by|    created_date|last_modified_by|last_modified_date|dispatch_enabled|external_loc_id|gst_in|hasapp|is_active|            loc_name|loc_type|ownership_type|address_id|station_name|_hoodie_incremental_key|lake_active_record|
   +-------------------+--------------------+------------------+----------------------+--------------------+---+----------+----------------+----------------+------------------+----------------+---------------+------+------+---------+--------------------+--------+--------------+----------+------------+-----------------------+------------------+
   |     20210621173143|20210621173143_0_180|               912|                      |ea9e53ed-e57b-4bd...|912|         1|1588568790066000|               1|  1588568790066000|            true|  lXXX-XXX_XXXX|    NA|  null|     true|ZZZ ZZZZZ _V ZZZ ...|      DP|          null|       ZZZ|        ABC1|    1592477090763000000|              true|
   |     20210621173400| 20210621173400_0_13|               912|               default|cf998a21-cc7b-496...|912|         1|1588568790066000|               1|  1623921111853000|            true|  lXXX-XXX_XXXX|    NA|  null|    false|ZZZ ZZZZZ _V ZZZ ...|      DP|          null|       ZZZ|        ABC1|    1623921111856000001|              true|
   +-------------------+--------------------+------------------+----------------------+--------------------+---+----------+----------------+----------------+------------------+----------------+---------------+------+------+---------+--------------------+--------+--------------+----------+------------+-----------------------+------------------+
   ```
   
   **Environment Description**
   
   * Hudi version : 0.8.0
   
   * Spark version : 2.4.7
   
   * Hive version : 2.3.8
   
   * Hadoop version : 2.10.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Akshay2Agarwal closed issue #3131: [SUPPORT] Creating non-partitioned table in hudi generates duplicates

Posted by GitBox <gi...@apache.org>.
Akshay2Agarwal closed issue #3131:
URL: https://github.com/apache/hudi/issues/3131


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Akshay2Agarwal commented on issue #3131: [SUPPORT] Creating non-partitioned table in hudi generates duplicates

Posted by GitBox <gi...@apache.org>.
Akshay2Agarwal commented on issue #3131:
URL: https://github.com/apache/hudi/issues/3131#issuecomment-865835493


   Sorry, I missed `KEYGENERATOR_CLASS_OPT_KEY` in upsert. Closing the ticket, sorry for nuisance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org