You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/01/06 22:53:07 UTC

[GitHub] [incubator-hudi] bhasudha edited a comment on issue #689: [WIP] [HUDI-25] Optimize HoodieInputFormat.listStatus for faster Hive Incremental queries

bhasudha edited a comment on issue #689: [WIP] [HUDI-25] Optimize HoodieInputFormat.listStatus for faster Hive Incremental queries
URL: https://github.com/apache/incubator-hudi/pull/689#issuecomment-571349958
 
 
   I rebased to latest master and verified the Hive queries in Docker Demo using the new patch. Verified that all queries in the Demo work as expected and incremental queries leverage optimizations in this patch when hive.fetch.task.conversion is disabled (as desired).
   
   I was able to run tests using spark.sql() against some of the production tables (both MOR and COW types). I used  --conf spark.sql.hive.convertMetastoreParquet=false so Hive serDe is used instead. Below is a flavor of queries that I tested. The results match between pre-fix and post-fix hudi-spark-bundle jars. 
   
   Snapshot queries
   =============
   simple count:
   spark.sql("select count(*) from tableA where datestr = '2019-12-10'").show()
   
   non-hudi hudi datasets join:
   spark.sql("select m.col1 as colA, t.col2 as colB from table1 m left join table2 as t on t._row_key = m.col1 and t.datestr >= '2016-01-01' join table3 c on m.col4 = c.col5 where c.col6 = 'XYZ'").show()
   
   non-hudi non-hudi datasets join:
   spark.sql("select o.col1, count(distinct e.col2) from tableA o join tableB e on o.id = e.id where to_date(e.col1) >= date_sub(current_date, 10) or to_date(e.col3) >= date_sub(current_date, 10) group by 1 order by 2 desc").show()
   
   hudi hudi datasets join:
   spark.sql("select t.id, count(t.load) as total_count FROM tableT t LEFT JOIN tableO o on t.id = o.id AND o.datestr > '2019-12-28' AND NOT o.isactive WHERE t.datestr > '2019-12-28' AND NOT t.isactive group by 1 order by 1,2 desc").show()
   
   group by, order and rank:
   spark.sql("select * from ( select *, rank() over ( partition by rg order by total_items desc ) as row_number from ( select rg, usr, count(*) as total_items from tableA where date(datestr) >= date('2019-10-11') and date(datestr) < date('2019-10-16') and event = 'complete' and  SUBSTRING_INDEX(rg, '.',1) = 'adhoc' group by 1,2 order by 1, count(*) desc ) ) where row_number <= 1").show()
   
   Incremental queries
   ==============
   spark.sql("select name, count(*) from  tableA where event_status = 'complete'  and `_hoodie_commit_time` > '20200101235440' group by 1").show()

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services