You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by joyan sil <jo...@gmail.com> on 2022/04/04 18:28:10 UTC
GetLatestBaseFiles API Query
Hi Team,
Need your opinion when you get a chance.I am trying to use
getLatestBaseFiles API to list the base files. There are 2 commits. The
first commit has 197 distinct record keys and the second commit has 99
distinct record keys. 2nd commit is a subset of 1st commit. However while
testing I see a difference in count when using a snapshot query versus the
query selecting from only the latest base files. In my opinion, the
snapshot query also uses getLatestFiles API to list the files (
HoodieBaseRelation.scala). What might be the reason for this discrepancy
and why is getLatestBaseFiles API returning only the *latest commit data* ?
Any insights will greatly help.
scala> spark.sql("select date, count(1) from stock_tick_cow group by
date").show(false)
+----------+--------+
|date |count(1)|
+----------+--------+
|2019/08/31|197 |
|2018/08/31|197 |
+----------+--------+
scala> spark.sql("select date, count(1) from stock_tick_cow where
_hoodie_file_name in
('4163329d-d2a1-4797-957f-80f76dfb78eb-0_0-35-36_20220404123406720.parquet',
'ff92e184-f3af-45f5-a480-449ebe6f78c6-0_0-21-22_20220404132921439.parquet')
group by date").show(false)
+----------+--------+
|date |count(1)|
+----------+--------+
|2019/08/31|197 |
|2018/08/31|99 |
+----------+--------+