You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/12 21:44:09 UTC

[GitHub] [hudi] rmpifer opened a new issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

rmpifer opened a new issue #1958:
URL: https://github.com/apache/hudi/issues/1958


   **Describe the problem you faced**
   
   When using global indexes (GLOBAL_BLOOM, HBASE) and a records partition value has been updated during an upsert the record gets written to the previous partition folder. This causes issues when querying Hive _ro and _rt tables as it displays the new data but with the previous partition value. We can see this by looking at the result of the query for 'ad' partition column and the Hudi data stored
   ```
   scala> spark.sql("select * from gb_partition_update_ro").show()
   +-------------------+--------------------+------------------+----------------------+--------------------+--------+---------+----------------+-------------+-------------+
   |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|     wbn|    cs_ss|     action_date|   ad_updated|           ad|
   +-------------------+--------------------+------------------+----------------------+--------------------+--------+---------+----------------+-------------+-------------+
   |     20200812204404|  20200812204404_0_1|          12345678|         2020-08-06-12|5ecf97ef-b9d9-4e3...|12345678|Delivered|1596716921000605|2020-08-06-13|2020-08-06-12|
   
   scala> spark.read.format("parquet").load("s3://ryanpife-emr-dev/hudi/data/delhivery_data_hudi_mor_global_bloom/1/2020-08-06-12/5ecf97ef-b9d9-4e3f-b586-0140f6ebdd3a-0_0-97-39036_20200812204404.parquet").show()
   +-------------------+--------------------+------------------+----------------------+--------------------+--------+---------+----------------+-------------+-------------+
   |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|     wbn|    cs_ss|     action_date|           ad|   ad_updated|
   +-------------------+--------------------+------------------+----------------------+--------------------+--------+---------+----------------+-------------+-------------+
   |     20200812204404|  20200812204404_0_1|          12345678|         2020-08-06-12|5ecf97ef-b9d9-4e3...|12345678|Delivered|1596716921000605|2020-08-06-13|2020-08-06-13|
   ```
   
   This is probably due to the table definition in hive reading in partition value from folder location
   
   ```
   hive> SHOW CREATE TABLE gb_partition_update_ro;
   OK
   CREATE EXTERNAL TABLE `gb_partition_update_ro`(
     `_hoodie_commit_time` string, 
     `_hoodie_commit_seqno` string, 
     `_hoodie_record_key` string, 
     `_hoodie_partition_path` string, 
     `_hoodie_file_name` string, 
     `wbn` int, 
     `cs_ss` string, 
     `action_date` bigint, 
     `ad_updated` string)
   PARTITIONED BY ( 
     `ad` string)
   ROW FORMAT SERDE 
     'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
   STORED AS INPUTFORMAT 
     'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
   OUTPUTFORMAT 
     'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION
     's3://ryanpife-emr-dev/hudi/data/delhivery_data_hudi_mor_global_bloom/1'
   TBLPROPERTIES (
     'last_commit_time_sync'='20200812204404', 
     'transient_lastDdlTime'='1597264976')
   ```
   
   This does not seem like correct behavior as the query results return new data with stale partition column value
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create table with global index
   2. Insert a record
   3. Upsert same record key with new partition value
   4. Query Hive table (_ro or _rt) for record key and previous partition value will be displayed
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.5.2
   
   * EMR version : 5.30.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1958:
URL: https://github.com/apache/hudi/issues/1958#issuecomment-767210126


   https://github.com/apache/hudi/pull/1978 have fixed it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1958:
URL: https://github.com/apache/hudi/issues/1958#issuecomment-673274582


   @n3nash : Is this a valid issue in Hbase ? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #1958:
URL: https://github.com/apache/hudi/issues/1958


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #1958:
URL: https://github.com/apache/hudi/issues/1958


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1958:
URL: https://github.com/apache/hudi/issues/1958#issuecomment-731496394


   @rmpifer : guess the patch is landed in master. can you try it out and close if it works. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1958:
URL: https://github.com/apache/hudi/issues/1958#issuecomment-673206288


   We have to make some fixes to HbaseIndex in this regards. Guess its a bug. We have a fix in all implicit global indexes, where a upsert record to a different partition will upsert to new partition and also delete from old partition. Don't think we have this fix in HbaseIndex. 
   I have create a ticket to track this https://issues.apache.org/jira/browse/HUDI-1184


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1958:
URL: https://github.com/apache/hudi/issues/1958#issuecomment-767210126


   https://github.com/apache/hudi/pull/1978 have fixed it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1958:
URL: https://github.com/apache/hudi/issues/1958#issuecomment-673133768


   @nsivabalan : Can you please look at this and respond ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1958:
URL: https://github.com/apache/hudi/issues/1958#issuecomment-674457449


   @n3nash : gentle ping. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #1958:
URL: https://github.com/apache/hudi/issues/1958#issuecomment-699031170


   We're waiting on the PR to be landed after which we can close this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org