You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/12/22 11:31:00 UTC

[jira] [Updated] (ARROW-7486) [Python] Allow HDFS FileSystem to be created without Hadoop present

     [ https://issues.apache.org/jira/browse/ARROW-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche updated ARROW-7486:
-----------------------------------------
    Labels: filesystem hadoop hdfs  (was: hadoop)

> [Python] Allow HDFS FileSystem to be created without Hadoop present
> -------------------------------------------------------------------
>
>                 Key: ARROW-7486
>                 URL: https://issues.apache.org/jira/browse/ARROW-7486
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Matthew Rocklin
>            Priority: Minor
>              Labels: filesystem, hadoop, hdfs
>
> I would like to be able to construct an HDFS FileSystem object on a machine without Hadoop installed.  I don't need it to be able to actually do anything.  I just need creating it to not fail.
> This would enable Dask users to run computations on an HDFS enabled cluster from outside of that cluster.  This almost works today.  We send a small computation to a worker (which has HDFS access) to generate the task graph for loading data, and then we bring that task graph back to the local machine, continue building on it, and then finally submit everything off to the workers for execution.
> The flaw here is when we bring back the task graph from the worker back to the client.  It contains a reference to a PyArrow HDFSFileSystem object, which upon de-serialization calls _maybe_set_hadoop_classpath().  I suspect that if this was allowed to fail that things would work out ok for us.  
> Downstream issue originally reported here: https://github.com/dask/dask/issues/5758



--
This message was sent by Atlassian Jira
(v8.3.4#803005)