You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2019/11/14 12:05:00 UTC
[jira] [Updated] (ARROW-6389) java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR]

     [ https://issues.apache.org/jira/browse/ARROW-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoine Pitrou updated ARROW-6389:
----------------------------------
    Priority: Major  (was: Blocker)

> java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR]
> ----------------------------------------------------------------
>
>                 Key: ARROW-6389
>                 URL: https://issues.apache.org/jira/browse/ARROW-6389
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Java, Python
>    Affects Versions: 0.14.1
>         Environment: Hadoop 2.85
> EMR 5.24.1
> python version: 3.7.4
> skein version: 0.8.0
>            Reporter: Ben Schreck
>            Priority: Major
>
> I can't access hdfs through pyarrow ( from inside a yarn container created by skein)
> This code works in a jupyter notebook running on the master node, or in an ipython terminal on a worker when given the ARROW_LIBHDFS_DIR env var:
> ```{{import pyarrow; pyarrow.hdfs.connect()```}}
>  
> However, when running on yarn by submitting the following skein application, I get a Java error.
>  
> {{name: test_conn
> queue: default
> master:
>   env:
>     ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native
>     JAVA_HOME: /etc/alternatives/jre
>   resources:
>     vcores: 1
>     memory: 10 GiB
>   files:
>     conda_env: /home/hadoop/environment.tar.gz
>   script: |
>     echo $HADOOP_HOME
>     echo $JAVA_HOME
>     echo $HADOOP_CLASSPATH
>     echo $ARROW_LIBHDFS_DIR
>     source conda_env/bin/activate
>     python -c "import pyarrow; pyarrow.hdfs.connect(); print(fs.open('test.txt').read())"
>     echo "Hello World!"}}
> FYI I tried with/without all those extra env vars, to no effect. I also tried modifying the EMR cluster with any of the following
>  
> {{"fs.hdfs.impl": "org.apache.hadoop.fs.Hdfs"
> "fs.AbstractFileSystem.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"
> "fs.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"}}
> The {{fs.AbstractFileSystem.hdfs.impl}} one gave a slightly different error- it was able to find which class by name to use for the "hdfs://" prefix, namely {{org.apache.hadoop.hdfs.DistributedFileSystem}}, but not able to find that class.
> Logs:
>  
> {{=========================================================================================
> LogType:application.driver.log
> Log Upload Time:Thu Aug 29 20:51:59 +0000 2019
> LogLength:2635
> Log Contents:
> /usr/lib/hadoop
> /usr/lib/jvm/java-openjdk
> :/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*
> hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error:
> java.io.IOException: No FileSystem for scheme: hdfs
>         at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2846)
>         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
>         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
>         at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2884)
>         at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:439)
>         at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:414)
>         at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:411)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>         at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:411)
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
>   File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 215, in connect
>     extra_conf=extra_conf)
>   File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 40, in __init__
>     self._connect(host, port, user, kerb_ticket, driver, extra_conf)
>   File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect
>   File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: HDFS connection failed
> Hello World!
> End of LogType:application.driver.log
> LogType:application.master.log
> Log Upload Time:Thu Aug 29 20:51:59 +0000 2019
> LogLength:1588
> Log Contents:
> 19/08/29 20:51:55 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
> 19/08/29 20:51:55 INFO skein.ApplicationMaster: Running as user hadoop
> 19/08/29 20:51:55 INFO skein.ApplicationMaster: Application specification successfully loaded
> 19/08/29 20:51:56 INFO client.RMProxy: Connecting to ResourceManager at IP.ec2.internal/IP:8030
> 19/08/29 20:51:56 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
> 19/08/29 20:51:56 INFO skein.ApplicationMaster: gRPC server started at IP.ec2.internal:39361
> 19/08/29 20:51:57 INFO skein.ApplicationMaster: WebUI server started at IP.ec2.internal:36511
> 19/08/29 20:51:57 INFO skein.ApplicationMaster: Registering application with resource manager
> 19/08/29 20:51:57 INFO client.RMProxy: Connecting to ResourceManager at IP.ec2.internal/IP:8032
> 19/08/29 20:51:57 INFO skein.ApplicationMaster: Starting application driver
> 19/08/29 20:51:57 INFO skein.ApplicationMaster: Shutting down: Application driver completed successfully.
> 19/08/29 20:51:57 INFO skein.ApplicationMaster: Unregistering application with status SUCCEEDED
> 19/08/29 20:51:57 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
> 19/08/29 20:51:58 INFO skein.ApplicationMaster: Deleted application directory hdfs://IP.ec2.internal:8020/user/hadoop/.skein/application_1567110830725_0001
> 19/08/29 20:51:58 INFO skein.ApplicationMaster: WebUI server shut down
> 19/08/29 20:51:58 INFO skein.ApplicationMaster: gRPC server shut down
> End of LogType:application.master.log}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)