You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Udit Mehrotra (Jira)" <ji...@apache.org> on 2020/04/23 00:00:00 UTC
[jira] [Comment Edited] (HUDI-829) Efficiently reading hudi tables through spark-shell

    [ https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090098#comment-17090098 ] 

Udit Mehrotra edited comment on HUDI-829 at 4/22/20, 11:59 PM:
---------------------------------------------------------------

You may also want to look at my implementation of custom relation in Spark to read bootstrapped tables [https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191] . Here I am building my own file index using spark's InMemoryFileIndex, but the filtering part is just one operation now because once I have all the files, the hudi filesystem view is created just once to get latest files. Its still work in progress and I am yet to see how fast this is going to be. We can consider moving to a place where our reading in spark happens through our relations, and underneath we use the native readers.


was (Author: uditme):
You may also want to look at my implementation of custom relation in Spark to read bootstrapped tables [https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191] . Here I am building my own file index using spark's InMemoryFileIndex, but the filtering part is just one operation now because once I have all the files, the hudi filesystem view just once to get latest files. Its still work in progress and I am yet to see how fast this is going to be. We can consider moving to a place where our reading in spark happens through our relations, and underneath we use the native readers.

> Efficiently reading hudi tables through spark-shell
> ---------------------------------------------------
>
>                 Key: HUDI-829
>                 URL: https://issues.apache.org/jira/browse/HUDI-829
>             Project: Apache Hudi (incubating)
>          Issue Type: Task
>          Components: Spark Integration
>            Reporter: Nishith Agarwal
>            Assignee: Nishith Agarwal
>            Priority: Major
>
> [~uditme] Created this ticket to track some discussion on read/query path of spark with Hudi tables. 
> My understanding is that when you read Hudi tables through spark-shell, some of your queries are slower due to some sequential activity performed by spark when interacting with Hudi tables (even with spark.sql.hive.convertMetastoreParquet which can give you the same data reading speed and all the vectorization benefits). Is this slowness observed during spark query planning ? Can you please elaborate on this ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)