You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Josiah Berkebile (Jira)" <ji...@apache.org> on 2021/09/01 15:21:00 UTC
[jira] [Commented] (HUDI-1407) Pandas(python) integration w/ Apache Hudi

    [ https://issues.apache.org/jira/browse/HUDI-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408232#comment-17408232 ] 

Josiah Berkebile commented on HUDI-1407:
----------------------------------------

[~afrinjamanbd] 

I am pretty sure the main thought behind this ticket is to provide people with a way to use Hudi without the need to fire-up a Spark or Flink cluster.  A lot of places choose to just do their data transformations in a simple Pandas script that executes in a short-lived container on a small node in a cluster or even in some sort of cloud container service because tools like Spark or Flink are too heavy of a tool to warrant their use given the comparative cost of EMR, DataBricks, and the like.

So far, the docs I read right now only provide examples for using Hudi with Spark and/or Flink. It would be nice to be able to get all the goodies out of Hudi through simple Python script in the future that doesn't require such heavy frameworks. Pandas fits that bill since it is just a regular Python library that can be pip-installed and no additional overhead is needed to use it (unlike Spark or Flink where you need to have a configured cluster and the Spark or Flink runtime installed in order to use it).

I know I would appreciate a solution like this that enables me to avoid depending on a heavy big data tool that I don't perceive the need for.

*In this vein, however, I do have one additional opinion to provide:*

I think you would get more mileage and bang-for-the-buck out of this if, instead of integrating with Pandas, you integrated with Apache Arrow.  Arrow is a C++ project to implement efficient columnar storage if datasets in memory and provide convenient ways to read in and write out such columnar formats like Parquet. The reason I think Arrow might be a better place to start is that Arrow is being adopted by many, many languages (Python, C/C++, Go, Rust, Java, Javascript, Julia, MATLAB, R, Ruby are mentioned on the project page), so if Hudi is integrated into Arrow, it would be a smaller step, then, to provide Hudi to all of those languages, not just Python. To boot, Arrow is capable of freely converting back and forth between Arrow Table format and Pandas DataFrame format, so you get a sort of Pandas compatibility, although not direct, out of the box if you support Arrow, first.  Direct Pandas compatibility then could be kicked down the road a bit more if desired.

> Pandas(python) integration w/ Apache Hudi
> -----------------------------------------
>
>                 Key: HUDI-1407
>                 URL: https://issues.apache.org/jira/browse/HUDI-1407
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: DeltaStreamer
>            Reporter: sivabalan narayanan
>            Priority: Major
>              Labels: gsoc, gsoc2021, mentor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)