You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Matthew Rocklin (Jira)" <ji...@apache.org> on 2019/12/31 22:18:00 UTC

[jira] [Created] (ARROW-7486) Allow HDFS FileSystem to be created without Hadoop present

Matthew Rocklin created ARROW-7486:
--------------------------------------

             Summary: Allow HDFS FileSystem to be created without Hadoop present
                 Key: ARROW-7486
                 URL: https://issues.apache.org/jira/browse/ARROW-7486
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Matthew Rocklin


I would like to be able to construct an HDFS FileSystem object on a machine without Hadoop installed.  I don't need it to be able to actually do anything.  I just need creating it to not fail.

This would enable Dask users to run computations on an HDFS enabled cluster from outside of that cluster.  This almost works today.  We send a small computation to a worker (which has HDFS access) to generate the task graph for loading data, and then we bring that task graph back to the local machine, continue building on it, and then finally submit everything off to the workers for execution.

The flaw here is when we bring back the task graph from the worker back to the client.  It contains a reference to a PyArrow HDFSFileSystem object, which upon de-serialization calls _maybe_set_hadoop_classpath().  I suspect that if this was allowed to fail that things would work out ok for us.  

Downstream issue originally reported here: https://github.com/dask/dask/issues/5758



--
This message was sent by Atlassian Jira
(v8.3.4#803005)