You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/06/16 14:16:00 UTC

[jira] [Comment Edited] (ARROW-13011) [Python] Using fs.HadoopFileSystem in the dask tests crashes

    [ https://issues.apache.org/jira/browse/ARROW-13011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17364322#comment-17364322 ] 

Joris Van den Bossche edited comment on ARROW-13011 at 6/16/21, 2:15 PM:
-------------------------------------------------------------------------

I could reproduce it with the dask docker image, and could also further slim down to get a small reproducer. The segfault seems to occur when trying to create the HadoopFileSystem multiple times without having the proper env variable set up:

{code:python}
>>> from pyarrow.fs import HadoopFileSystem
>>> hdfs = HadoopFileSystem(host="localhost", port=8020)
Environment variable CLASSPATH not set!
getJNIEnv: getGlobalJNIEnv failed
../src/arrow/filesystem/hdfs.cc:51: Failed to disconnect hdfs client: IOError: HDFS hdfsFS::Disconnect failed, errno: 9 (Bad file descriptor)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_hdfs.pyx", line 83, in pyarrow._hdfs.HadoopFileSystem.__init__
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: HDFS connection failed
>>> hdfs = HadoopFileSystem(host="localhost", port=8020)
Segmentation fault (core dumped)
{code}


was (Author: jorisvandenbossche):
Ì could reproduce it with the dask docker image, and could also further slim down to get a small reproducer. The segfault seems to occur when trying to create the HadoopFileSystem multiple times without having the proper en variable set up:

{code:python}
>>> from pyarrow.fs import HadoopFileSystem
>>> hdfs = HadoopFileSystem(host="localhost", port=8020)
Environment variable CLASSPATH not set!
getJNIEnv: getGlobalJNIEnv failed
../src/arrow/filesystem/hdfs.cc:51: Failed to disconnect hdfs client: IOError: HDFS hdfsFS::Disconnect failed, errno: 9 (Bad file descriptor)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_hdfs.pyx", line 83, in pyarrow._hdfs.HadoopFileSystem.__init__
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: HDFS connection failed
>>> hdfs = HadoopFileSystem(host="localhost", port=8020)
Segmentation fault (core dumped)
{code}

> [Python] Using fs.HadoopFileSystem in the dask tests crashes
> ------------------------------------------------------------
>
>                 Key: ARROW-13011
>                 URL: https://issues.apache.org/jira/browse/ARROW-13011
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: filesystem, hdfs
>
> See [https://github.com/dask/dask/pull/7752#issuecomment-856231163] and discussion below.
>  
> Didn't investigate yet (and I also think dask cannot yet use the new filesystem, since it has a different API, but it should nonetheless not crash ..), there is a docker workflow to reproduce the tests I can try.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)