You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Juan Galvez (Jira)" <ji...@apache.org> on 2020/12/10 15:23:00 UTC

[jira] [Updated] (ARROW-10872) pyarrow.fs.HadoopFileSystem cannot access Azure Data Lake (ADLS)

     [ https://issues.apache.org/jira/browse/ARROW-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Juan Galvez updated ARROW-10872:
--------------------------------
    Description: 
It's not possible to open a `{{abfs://}}` or `abfss://` URI with the pyarrow.fs.HadoopFileSystem.

Using HadoopFileSystem.from_uri(path) does not work and libhdfs will throw an error saying that the authority is invalid (I checked that this is because the string is empty).

Note that the legacy pyarrow.hdfs.HadoopFileSystem interface works by doing for example:
 * pyarrow.hdfs.HadoopFileSystem(host="abfs://xxx@xxx.dfs.core.windows.net")
 * pyarrow.hdfs.connect(host="abfs://xxx@xxx.dfs.core.windows.net")

and I believe the new interface should work too by passing the full URI as "host" to `pyarrow.fs.HadoopFileSystem` constructor. However, the constructor wrongly prepends "hdfs://" at the beginning: [https://github.com/apache/arrow/blob/25c736d48dc289f457e74d15d05db65f6d539447/python/pyarrow/_hdfs.pyx#L64]

  was:
It's not possible to open a `abfs://` or `abfss://` URI with the pyarrow.fs.HadoopFileSystem.

Using HadoopFileSystem.from_uri(path) does not work and libhdfs will throw an error saying that the authority is invalid (I checked that this is because the string is empty).

Note that the legacy pyarrow.hdfs.HadoopFileSystem interface works by doing for example:
 * pyarrow.hdfs.HadoopFileSystem(host="abfs://xxx@xxx.dfs.core.windows.net")
 * pyarrow.hdfs.connect(host="abfs://xxx@xxx.dfs.core.windows.net")

and I believe the new interface should work too by passing the full URI as "host" to `pyarrow.fs.HadoopFileSystem` constructor. However, the constructor wrongly prepends "hdfs://" at the beginning: https://github.com/apache/arrow/blob/25c736d48dc289f457e74d15d05db65f6d539447/python/pyarrow/_hdfs.pyx#L64


> pyarrow.fs.HadoopFileSystem cannot access Azure Data Lake (ADLS)
> ----------------------------------------------------------------
>
>                 Key: ARROW-10872
>                 URL: https://issues.apache.org/jira/browse/ARROW-10872
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Juan Galvez
>            Priority: Major
>
> It's not possible to open a `{{abfs://}}` or `abfss://` URI with the pyarrow.fs.HadoopFileSystem.
> Using HadoopFileSystem.from_uri(path) does not work and libhdfs will throw an error saying that the authority is invalid (I checked that this is because the string is empty).
> Note that the legacy pyarrow.hdfs.HadoopFileSystem interface works by doing for example:
>  * pyarrow.hdfs.HadoopFileSystem(host="abfs://xxx@xxx.dfs.core.windows.net")
>  * pyarrow.hdfs.connect(host="abfs://xxx@xxx.dfs.core.windows.net")
> and I believe the new interface should work too by passing the full URI as "host" to `pyarrow.fs.HadoopFileSystem` constructor. However, the constructor wrongly prepends "hdfs://" at the beginning: [https://github.com/apache/arrow/blob/25c736d48dc289f457e74d15d05db65f6d539447/python/pyarrow/_hdfs.pyx#L64]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)