You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Michal Danko (JIRA)" <ji...@apache.org> on 2018/02/08 11:27:00 UTC
[jira] [Updated] (ARROW-2113) [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed"

     [ https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michal Danko updated ARROW-2113:
--------------------------------
    Description: 
Steps to replicate the issue:

mkdir /tmp/test
 cd /tmp/test
 mkdir jars
 cd jars
 touch test1.jar
 mkdir -p ../lib/zookeeper
 cd ../lib/zookeeper
 ln -s ../../jars/test1.jar ./test1.jar
 ln -s test1.jar test.jar
 mkdir -p ../hadoop/lib
 cd ../hadoop/lib
 ln -s ../../../lib/zookeeper/test.jar ./test.jar

(this part depends on your configuration you need those values for pyarrow.hdfs to work:)

(path to libjvm:)

(export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)

(path to libhdfs:)

(export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)

export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"

python
 import pyarrow.hdfs as hdfs;
 fs = hdfs.connect(user="hdfs")

 

Ends with error:

------------
 loadFileSystems error:
 (unable to get root cause for java.lang.NoClassDefFoundError)
 (unable to get stack trace for java.lang.NoClassDefFoundError)
 hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, kerbTicketCachePath=(NULL), userName=pa) error:
 (unable to get root cause for java.lang.NoClassDefFoundError)
 (unable to get stack trace for java.lang.NoClassDefFoundError)
 Traceback (most recent call last): (
 File "<stdin>", line 1, in <module>
 File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 170, in connect
 kerb_ticket=kerb_ticket, driver=driver)
 File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, in __init__
 self._connect(host, port, user, kerb_ticket, driver)
 File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
 File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
 pyarrow.lib.ArrowIOError: HDFS connection failed
 -------------

 

export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
 python
 import pyarrow.hdfs as hdfs;
 fs = hdfs.connect(user="hdfs")

 

Works properly.

 

I can't find reason why first CLASSPATH doesn't work and second one does, because it's path to same .jar, just with extra symlink in it. To me, it looks like pyarrow.lib.check has problem with symlinks defined with many ../.../.. .

I would expect that pyarrow would work with any definition of path to .jar

Please notice that path are not generated at random, it is path copied from Cloudera distribution of Hadoop (original file was zookeeper.jar),

Because of this issue, our customer currently can't use pyarrow lib for oozie workflows.

  was:
Steps to replicate the issue:

mkdir /tmp/test
cd /tmp/test
mkdir jars
cd jars
touch test1.jar
mkdir -p ../lib/zookeeper
cd ../lib/zookeeper
ln -s ../../jars/test1.jar ./test1.jar
ln -s test1.jar test.jar
mkdir -p ../hadoop/lib
cd ../hadoop/lib
ln -s ../../../lib/zookeeper/test.jar ./test.jar

export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"

python
import pyarrow.hdfs as hdfs;
fs = hdfs.connect(user="hdfs")

 

Ends with error:

------------
loadFileSystems error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, kerbTicketCachePath=(NULL), userName=pa) error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
Traceback (most recent call last): (
 File "<stdin>", line 1, in <module>
 File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 170, in connect
 kerb_ticket=kerb_ticket, driver=driver)
 File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, in __init__
 self._connect(host, port, user, kerb_ticket, driver)
 File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
 File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
pyarrow.lib.ArrowIOError: HDFS connection failed
-------------

 

export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
python
import pyarrow.hdfs as hdfs;
fs = hdfs.connect(user="hdfs")

 

Works properly.

 

I can't find reason why first CLASSPATH doesn't work and second one does, because it's path to same .jar, just with extra symlink in it. To me, it looks like pyarrow.lib.check has problem with symlinks defined with many ../.../.. .

I would expect that pyarrow would work with any definition of path to .jar

Please notice that path are not generated at random, it is path copied from Cloudera distribution of Hadoop (original file was zookeeper.jar),

Because of this issue, our customer currently can't use pyarrow lib for oozie workflows.


> [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed"
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-2113
>                 URL: https://issues.apache.org/jira/browse/ARROW-2113
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>         Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH 5.13.1
>            Reporter: Michal Danko
>            Priority: Major
>
> Steps to replicate the issue:
> mkdir /tmp/test
>  cd /tmp/test
>  mkdir jars
>  cd jars
>  touch test1.jar
>  mkdir -p ../lib/zookeeper
>  cd ../lib/zookeeper
>  ln -s ../../jars/test1.jar ./test1.jar
>  ln -s test1.jar test.jar
>  mkdir -p ../hadoop/lib
>  cd ../hadoop/lib
>  ln -s ../../../lib/zookeeper/test.jar ./test.jar
> (this part depends on your configuration you need those values for pyarrow.hdfs to work:)
> (path to libjvm:)
> (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)
> (path to libhdfs:)
> (export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)
> export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"
> python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Ends with error:
> ------------
>  loadFileSystems error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, kerbTicketCachePath=(NULL), userName=pa) error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  Traceback (most recent call last): (
>  File "<stdin>", line 1, in <module>
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 170, in connect
>  kerb_ticket=kerb_ticket, driver=driver)
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, in __init__
>  self._connect(host, port, user, kerb_ticket, driver)
>  File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
>  File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
>  pyarrow.lib.ArrowIOError: HDFS connection failed
>  -------------
>  
> export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
>  python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Works properly.
>  
> I can't find reason why first CLASSPATH doesn't work and second one does, because it's path to same .jar, just with extra symlink in it. To me, it looks like pyarrow.lib.check has problem with symlinks defined with many ../.../.. .
> I would expect that pyarrow would work with any definition of path to .jar
> Please notice that path are not generated at random, it is path copied from Cloudera distribution of Hadoop (original file was zookeeper.jar),
> Because of this issue, our customer currently can't use pyarrow lib for oozie workflows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)