You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Rajesh (Jira)" <ji...@apache.org> on 2022/07/14 16:07:00 UTC

[jira] [Assigned] (HUDI-4046) spark.read.load API

     [ https://issues.apache.org/jira/browse/HUDI-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rajesh reassigned HUDI-4046:
----------------------------

    Assignee: Rajesh

> spark.read.load API
> -------------------
>
>                 Key: HUDI-4046
>                 URL: https://issues.apache.org/jira/browse/HUDI-4046
>             Project: Apache Hudi
>          Issue Type: Bug
>    Affects Versions: 0.10.1
>            Reporter: Istvan Darvas
>            Assignee: Rajesh
>            Priority: Minor
>
> Hi Guys!
> I would like to controll the number of partions which will be read by HUDI.
>  
> base_path: str
> partition_paths: List[str] = ["prefix/part1","prefix/part2","prefix/part3"]
> table_df= (spark.read
> .format('org.apache.hudi')
> .option("basePath", base_path)
> .option("hoodie.datasource.read.paths",",".join(partition_paths)) # coma separated list
> .load(partition_paths))
>  
> This is working if I explicitly set "hoodie.datasource.read.paths". Actually I need to generate a comaseparated list for that parameter.
> If I do not set it, I got a HUDI exception which tells me I need to set it.
>  
> It would be grate if HUDI would use the partition_paths from the Spark Read API - .load(partition_paths)
>  
> one more thing:
>  I do not get exception If do not set "hoodie.datasource.read.paths" and I use load(base_path), but in this case spark HUDI read will read up the whole table which can be very timeconsuming with a very big table with lots of partitions.
>  
> Darvi
> Connected SLACK thread: [https://apache-hudi.slack.com/archives/C4D716NPQ/p1651667472584579]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)