You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Felix (Jira)" <ji...@apache.org> on 2022/01/24 02:37:00 UTC

[jira] [Updated] (ARROW-15421) Need a pip install option for out-of-the-box HDFS support

     [ https://issues.apache.org/jira/browse/ARROW-15421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Felix updated ARROW-15421:
--------------------------
    Description: 
Hi folks! And thank you for your great work.

I want to use PyArrow to develop a simple client application that needs to connect to HDFS clusters and exchange data with it

But if I want to use HDFS in PyArrow, I have to manually download full Hadoop distro, find there {{libhdfs.so}} – and manually provide hadoop's CLASSPATH as an environment variable.

I need something like *{{{}_pip3 install pyarrow_{}}}{_}{{[hdfs]}}{_}* that will give my pyarrow with pre-built libhdfs and minimal set of Hadoop JARs needed for its run – where pyarrow.hdfs.* classes could be called without additional boilerplate code.

Can you please add it in future releases of PyArrow?

  was:
Hi folks! And thank you for your great work.

I want to use PyArrow to develop a simple client application that needs to connect to HDFS clusters and exchange data with it

But if I want to use HDFS in PyArrow, I have to manually download full Hadoop distro, find there {{libhdfs.so}} – and manually provide hadoop's CLASSPATH as an environment variable.

I need something like *{{{_}{{pip3 install pyarrow}}{_}_{{[hdfs]}}_}}* that will give my pyarrow with pre-built libhdfs and minimal set of Hadoop JARs needed for its run – where pyarrow.hdfs.* classes could be called without additional boilerplate code.

Can you please add it in future releases of PyArrow?


> Need a pip install option for out-of-the-box HDFS support
> ---------------------------------------------------------
>
>                 Key: ARROW-15421
>                 URL: https://issues.apache.org/jira/browse/ARROW-15421
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: Felix
>            Priority: Major
>
> Hi folks! And thank you for your great work.
> I want to use PyArrow to develop a simple client application that needs to connect to HDFS clusters and exchange data with it
> But if I want to use HDFS in PyArrow, I have to manually download full Hadoop distro, find there {{libhdfs.so}} – and manually provide hadoop's CLASSPATH as an environment variable.
> I need something like *{{{}_pip3 install pyarrow_{}}}{_}{{[hdfs]}}{_}* that will give my pyarrow with pre-built libhdfs and minimal set of Hadoop JARs needed for its run – where pyarrow.hdfs.* classes could be called without additional boilerplate code.
> Can you please add it in future releases of PyArrow?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)