You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/02/16 20:43:57 UTC

[GitHub] [hudi] cb149 opened a new issue #4830: [SUPPORT] HUDIPARQUET missing the last cleaned partition

cb149 opened a new issue #4830:
URL: https://github.com/apache/hudi/issues/4830


   **Describe the problem you faced**
   
   I just switched two of my Hudi tables in Impala from PARQUET to HUDIPARQUET, by dropping the old table first and then creating it again using HUDIPARQUET as described in the [documentation](https://hudi.apache.org/docs/querying_data/#impala-34-or-later)
   
   One table is partitioned with `year=.../month=...` and is never clustered. This table shows no problems at all.
   
   The other table is partitioned with `year=.../month=.../day=...`, data is ingested hourly, every night the previous day is clustered and CLEANER_COMMITS_RETAINED is set to 48.
   
   As expected, I no longer see duplicates for both tables in Impala, however, I am getting a weird behavior for one day.
   
   If I run
   ```scala
   spark.read.format("hudi").load("<myTable>").count
   ```
   I get `76766150602`
   
   If I run
   ```sql
   SELECT count(*) FROM myTable;
   ```
   in Impala, I get `76614373360` (this is run after ALTER TABLE RECOVER PARTITONS and REFRESH table so that should not be the issue)
   
   Looking at the data, it is missing all the rows from the partition `year=2022/month=2/day=13`, for which a count in spark returns `151777242`, which matches the difference between the above counts.
   
   If I run 
   ```sql
   select day,count(*) from myTable WHERE year=2022 and month=2 group by day ORDER BY day ASC;
   ```
   it shows every day from 1 to 16 except for day=13.
   
   The first time I ran 
   ```sql
   select count(*) from myTable WHERE year=2022 and month=1 and day=13;
   ```
   I  got 0, not I am getting `150407032` but the count should be `151777242`. However, the count when grouping by day is still missing.
   
   Looking at the files, `day=14` is clustered and no file slices have been cleaned yet, `day=13` and all prior partitions have already been clustered and non-clustered files have been cleaned.
   
   Is it a coincidence or could this be caused somehow by the fact, that `day=13` is the last partition that is clustered and has been cleaned, with the following days having been clustered but not cleaned (except for the latest day which hasn't been clustered yet)
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   unknown
   
   **Expected behavior**
   
   HUDIPARQUET should not result in rows missing when querying the table in Impala.
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   
   * Spark version : 2.4.7
   
   * Hive version :
   
   * Hadoop version : 3.1.1
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] cb149 edited a comment on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
cb149 edited a comment on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1043158298


   As expected, today the data for `day=14` is missing (even though it was there yesterday), while the data for `day=13` is available, so it is somehow always missing the last partition that was cleaned. 
   It seems this issue only shows up for tables with clustering.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] cb149 commented on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
cb149 commented on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1047891127


   > We know of a bug wrt pending clustering that is being fixed #4810. I will let you do the correlation.
   
   Not sure if this is related, since there are no pending clusterings and no inflight commits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] cb149 commented on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
cb149 commented on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1043158298


   As expected, today the data for `day=14` is missing, while the data for `day=13` is available


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] cb149 edited a comment on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
cb149 edited a comment on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1048874783


   > `SHOW TABLE STATS myTable ; shows 0 files and size 0B for that partition.`. -> This makes me wonder if the partition is somehow not registered. Since this is an external table, we can check if the data is present physically by checking the partition path. Does Impala depend on Hive metastore to gather this stats or does it do on its own ? If first, we need to check the HMS to see if this partition is registered with the table.
   
   @bhasudha The weird part is that the partition shows up every day prior, e.g. today, I can see data for day 23,22,21,19,18 etc.
   Tonight when `day=21` gets cleaned, `day=20` will show up and `day=21` will be missing from Impala for the next 24 hours.
   
   So the data is present physically and registered, plus there are no issues when I query the data with Spark or `spark.sql("select day from myTable where year=2022 and month = 2 group by day ORDER BY day asc")` (though using spark.sql needs  hudi-spark-bundle, it still reads the data as parquet and returns duplicates)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] garyli1019 commented on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
garyli1019 commented on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1054984922


   @nsivabalan I implemented the hudi connector on Impala, but I didn't touch the impala codebase since then(almost two years). It's quite complicated to set up the impala dev environment. IIRC we need a linux machine. I might not able to work on this at this point but happy to help if any contributor are interested. Filled a ticket here: https://issues.apache.org/jira/browse/HUDI-3537
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] garyli1019 commented on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
garyli1019 commented on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1050626414


   I think this could be related to the clustering. The impala query is straight forward for `HUDI_PARQUET`. It will list all the parquet files and use the filter method provided by Hudi to get the latest snapshot. If any partition is missing, probably the filtering logic might not work well with clustering. 
   BTW, the hudi version in impala codebase is still 0.5.0-incubating https://github.com/apache/impala/blob/fe04c500d7c32606c9024259a972f6843fab678e/bin/impala-config.sh#L204


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] cb149 commented on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
cb149 commented on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1048874783


   > `SHOW TABLE STATS myTable ; shows 0 files and size 0B for that partition.`. -> This makes me wonder if the partition is somehow not registered. Since this is an external table, we can check if the data is present physically by checking the partition path. Does Impala depend on Hive metastore to gather this stats or does it do on its own ? If first, we need to check the HMS to see if this partition is registered with the table.
   
   The weird part is that the partition shows up every day prior, e.g. today, I can see data for day 23,22,21,19,18 etc.
   Tonight when `day=21` gets cleaned, `day=20` will show up and `day=21` will be missing from Impala for the next 24 hours.
   
   So the data is present physically and registered, plus there are no issues when I query the data with Spark or `spark.sql("select day from myTable where year=2022 and month = 2 group by day ORDER BY day asc")` (though using spark.sql needs  hudi-spark-bundle, it still reads the data as parquet and returns duplicates)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] cb149 edited a comment on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
cb149 edited a comment on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1043158298


   As expected, today the data for `day=14` is missing (even though it was there yesterday), while the data for `day=13` is available, so it is somehow always missing the last partition that was cleaned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1047293800


   @bhasudha : can you assist here please. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bhasudha commented on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
bhasudha commented on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1048326642


   @cb149  Are you still seeing this issue where every day the last partition (that was cleaned) is not showing up in select count(*) ? I have couple questions just to make sure you have already checked these
   - Are the missing partitions showing up in show partitions statement?
   - clustering and cleaning has happened and this is the last partition correct? Is this consistently happening every day in a rolling fashion? 
   Can you also paste the corresponding .commit/.replacecommit/.clean files for the missing partition?  Have you taken a look into those files to see if anything strange is happening?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1047295523


   We know of a bug wrt pending clustering that is being fixed https://github.com/apache/hudi/pull/4810. I will let you do the correlation.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1050387876


   @garyli1019 : Can you please help these folks. I don't have any exp in impala. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bhasudha commented on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
bhasudha commented on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1048341628


   ```SHOW TABLE STATS myTable ; shows 0 files and size 0B for that partition.```. -> This makes me wonder if the partition is somehow not registered. Since this is an external table, we can check if the data is present physically by checking the partition path.  Does Impala depend on Hive metastore to gather this stats or does it do on its own ?  If first, we need to check the HMS to see if this partition is registered with the table. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1054642479


   Hey @garyli1019 : have you worked on upgrading hudi version in impala before? can you take up the work item since you have context. Or if you know someone who can help us on that, would be great!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] cb149 edited a comment on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
cb149 edited a comment on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1048927415


   > @cb149 Are you still seeing this issue where every day the last partition (that was cleaned) is not showing up in select count(*) ? I have couple questions just to make sure you have already checked these
   > 
   > * Are the missing partitions showing up in show partitions statement?
   > * clustering and cleaning has happened and this is the last partition correct? Is this consistently happening every day in a rolling fashion?
   >   Can you also paste the corresponding .commit/.replacecommit/.clean files for the missing partition?  Have you taken a look into those files to see if anything strange is happening?
   
   - Yes, the missing partition show up in `show partitions myTable`, but shows 0 files and 0 Bytes
   - Yes, happens in a rolling fashion, every day the partition from 3 days ago is missing, all other partitions show up
   
   Today, where `day=20` is missing, using hudi-cli I can see the following for the latest clean:
   ```
   clean showpartitions --clean 20220223030128861
   ╔══════════════════════════╤═════════════════════╤══════════════════════════════════╤════════════════════════╗
   ║ Partition Path           │ Cleaning policy     │ Total Files Successfully Deleted │ Total Failed Deletions ║
   ╠══════════════════════════╪═════════════════════╪══════════════════════════════════╪════════════════════════╣
   ║ year=2022/month=2/day=20 │ KEEP_LATEST_COMMITS │ 50                               │ 0                      ║
   ╟──────────────────────────┼─────────────────────┼──────────────────────────────────┼────────────────────────╢
   ║ year=2022/month=2/day=21 │ KEEP_LATEST_COMMITS │ 0                                │ 0                      ║
   ╚══════════════════════════╧═════════════════════╧══════════════════════════════════╧════════════════════════╝
   ```
   Not sure if this is relevant, but I am using inline clustering and async incremental cleaning


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] cb149 commented on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
cb149 commented on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1048927415


   > @cb149 Are you still seeing this issue where every day the last partition (that was cleaned) is not showing up in select count(*) ? I have couple questions just to make sure you have already checked these
   > 
   > * Are the missing partitions showing up in show partitions statement?
   > * clustering and cleaning has happened and this is the last partition correct? Is this consistently happening every day in a rolling fashion?
   >   Can you also paste the corresponding .commit/.replacecommit/.clean files for the missing partition?  Have you taken a look into those files to see if anything strange is happening?
   
   - Yes, the missing partition show up in `show partitions myTable`, but shows 0 files and 0 Bytes
   - Yes, happens in a rolling fashion, every day the partition from 3 days ago is missing, all other partitions show up
   
   Today, where `day=20` is missing, using hudi-cli I can see the following for the latest clean:
   ```
   clean showpartitions --clean 20220223030128861
   ╔══════════════════════════╤═════════════════════╤══════════════════════════════════╤════════════════════════╗
   ║ Partition Path           │ Cleaning policy     │ Total Files Successfully Deleted │ Total Failed Deletions ║
   ╠══════════════════════════╪═════════════════════╪══════════════════════════════════╪════════════════════════╣
   ║ year=2022/month=2/day=20 │ KEEP_LATEST_COMMITS │ 50                               │ 0                      ║
   ╟──────────────────────────┼─────────────────────┼──────────────────────────────────┼────────────────────────╢
   ║ year=2022/month=2/day=21 │ KEEP_LATEST_COMMITS │ 0                                │ 0                      ║
   ╚══════════════════════════╧═════════════════════╧══════════════════════════════════╧════════════════════════╝
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4830:
URL: https://github.com/apache/hudi/issues/4830#issuecomment-1061351839


   closing the github issue as we have a tracking jira. @cb149 : looks like we don't have much resources on our side. Would you  be able to assist us here. Gary can guide you on how to go about it. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #4830: [SUPPORT] Impala HUDIPARQUET missing the last cleaned partition

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #4830:
URL: https://github.com/apache/hudi/issues/4830


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org