You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Bhavani Sudha (Jira)" <ji...@apache.org> on 2020/08/14 18:31:00 UTC

[jira] [Resolved] (HUDI-783) Add official python support to create hudi datasets using pyspark

     [ https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bhavani Sudha resolved HUDI-783.
--------------------------------
    Resolution: Fixed

Based on the PRs, it looks like this is done. Please feel free to open newer issues if there is any work needed!

> Add official python support to create hudi datasets using pyspark
> -----------------------------------------------------------------
>
>                 Key: HUDI-783
>                 URL: https://issues.apache.org/jira/browse/HUDI-783
>             Project: Apache Hudi
>          Issue Type: Wish
>          Components: Utilities
>            Reporter: Vinoth Govindarajan
>            Assignee: Vinoth Govindarajan
>            Priority: Major
>              Labels: features, pull-request-available
>             Fix For: 0.6.0
>
>
> *Goal:*
>  As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
>  # Create a hudi-pyspark package that users can import and start reading/writing hudi datasets.
>  # Explain how to read/write hudi datasets using pyspark in a blog post/documentation.
>  # Add the hudi-pyspark module to the hudi demo docker along with the instructions.
>  # Make the package available as part of the [spark packages index|https://spark-packages.org/] and [python package index|https://pypi.org/]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark using which HUDI files can be read as DataFrame and write to any Hadoop supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)