You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by joyan sil <jo...@gmail.com> on 2022/04/04 18:28:10 UTC

GetLatestBaseFiles API Query

Hi Team,

Need your opinion when you get a chance.I am trying to use
getLatestBaseFiles API to list the base files. There are 2 commits. The
first commit has 197 distinct record keys and the second commit has 99
distinct record keys. 2nd commit is a subset of 1st commit. However while
testing I see a difference in count when using a snapshot query versus the
query selecting from only the latest base files. In my opinion, the
snapshot query also uses getLatestFiles API to list the files (
HoodieBaseRelation.scala). What might be the reason for this discrepancy
and why is getLatestBaseFiles API returning only the *latest commit data* ?
Any insights will greatly help.

scala> spark.sql("select date, count(1) from stock_tick_cow  group by
date").show(false)
+----------+--------+
|date      |count(1)|
+----------+--------+
|2019/08/31|197     |
|2018/08/31|197      |
+----------+--------+

scala> spark.sql("select date, count(1) from stock_tick_cow where
_hoodie_file_name in
('4163329d-d2a1-4797-957f-80f76dfb78eb-0_0-35-36_20220404123406720.parquet',
'ff92e184-f3af-45f5-a480-449ebe6f78c6-0_0-21-22_20220404132921439.parquet')
group by date").show(false)
+----------+--------+
|date      |count(1)|
+----------+--------+
|2019/08/31|197     |
|2018/08/31|99      |
+----------+--------+