You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2019/11/14 12:05:00 UTC
[jira] [Updated] (ARROW-6389) java.io.IOException: No FileSystem
for scheme: hdfs [On AWS EMR]
[ https://issues.apache.org/jira/browse/ARROW-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou updated ARROW-6389:
----------------------------------
Priority: Major (was: Blocker)
> java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR]
> ----------------------------------------------------------------
>
> Key: ARROW-6389
> URL: https://issues.apache.org/jira/browse/ARROW-6389
> Project: Apache Arrow
> Issue Type: Bug
> Components: Java, Python
> Affects Versions: 0.14.1
> Environment: Hadoop 2.85
> EMR 5.24.1
> python version: 3.7.4
> skein version: 0.8.0
> Reporter: Ben Schreck
> Priority: Major
>
> I can't access hdfs through pyarrow ( from inside a yarn container created by skein)
> This code works in a jupyter notebook running on the master node, or in an ipython terminal on a worker when given the ARROW_LIBHDFS_DIR env var:
> ```{{import pyarrow; pyarrow.hdfs.connect()```}}
>
> However, when running on yarn by submitting the following skein application, I get a Java error.
>
> {{name: test_conn
> queue: default
> master:
> env:
> ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native
> JAVA_HOME: /etc/alternatives/jre
> resources:
> vcores: 1
> memory: 10 GiB
> files:
> conda_env: /home/hadoop/environment.tar.gz
> script: |
> echo $HADOOP_HOME
> echo $JAVA_HOME
> echo $HADOOP_CLASSPATH
> echo $ARROW_LIBHDFS_DIR
> source conda_env/bin/activate
> python -c "import pyarrow; pyarrow.hdfs.connect(); print(fs.open('test.txt').read())"
> echo "Hello World!"}}
> FYI I tried with/without all those extra env vars, to no effect. I also tried modifying the EMR cluster with any of the following
>
> {{"fs.hdfs.impl": "org.apache.hadoop.fs.Hdfs"
> "fs.AbstractFileSystem.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"
> "fs.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"}}
> The {{fs.AbstractFileSystem.hdfs.impl}} one gave a slightly different error- it was able to find which class by name to use for the "hdfs://" prefix, namely {{org.apache.hadoop.hdfs.DistributedFileSystem}}, but not able to find that class.
> Logs:
>
> {{=========================================================================================
> LogType:application.driver.log
> Log Upload Time:Thu Aug 29 20:51:59 +0000 2019
> LogLength:2635
> Log Contents:
> /usr/lib/hadoop
> /usr/lib/jvm/java-openjdk
> :/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*
> hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error:
> java.io.IOException: No FileSystem for scheme: hdfs
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2846)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
> at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2884)
> at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:439)
> at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:414)
> at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:411)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:411)
> Traceback (most recent call last):
> File "<string>", line 1, in <module>
> File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 215, in connect
> extra_conf=extra_conf)
> File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 40, in __init__
> self._connect(host, port, user, kerb_ticket, driver, extra_conf)
> File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: HDFS connection failed
> Hello World!
> End of LogType:application.driver.log
> LogType:application.master.log
> Log Upload Time:Thu Aug 29 20:51:59 +0000 2019
> LogLength:1588
> Log Contents:
> 19/08/29 20:51:55 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
> 19/08/29 20:51:55 INFO skein.ApplicationMaster: Running as user hadoop
> 19/08/29 20:51:55 INFO skein.ApplicationMaster: Application specification successfully loaded
> 19/08/29 20:51:56 INFO client.RMProxy: Connecting to ResourceManager at IP.ec2.internal/IP:8030
> 19/08/29 20:51:56 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
> 19/08/29 20:51:56 INFO skein.ApplicationMaster: gRPC server started at IP.ec2.internal:39361
> 19/08/29 20:51:57 INFO skein.ApplicationMaster: WebUI server started at IP.ec2.internal:36511
> 19/08/29 20:51:57 INFO skein.ApplicationMaster: Registering application with resource manager
> 19/08/29 20:51:57 INFO client.RMProxy: Connecting to ResourceManager at IP.ec2.internal/IP:8032
> 19/08/29 20:51:57 INFO skein.ApplicationMaster: Starting application driver
> 19/08/29 20:51:57 INFO skein.ApplicationMaster: Shutting down: Application driver completed successfully.
> 19/08/29 20:51:57 INFO skein.ApplicationMaster: Unregistering application with status SUCCEEDED
> 19/08/29 20:51:57 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
> 19/08/29 20:51:58 INFO skein.ApplicationMaster: Deleted application directory hdfs://IP.ec2.internal:8020/user/hadoop/.skein/application_1567110830725_0001
> 19/08/29 20:51:58 INFO skein.ApplicationMaster: WebUI server shut down
> 19/08/29 20:51:58 INFO skein.ApplicationMaster: gRPC server shut down
> End of LogType:application.master.log}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)