You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dhruve Ashar (JIRA)" <ji...@apache.org> on 2019/05/24 19:48:01 UTC
[jira] [Commented] (SPARK-24149) Automatic namespaces discovery in HDFS federation

    [ https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847854#comment-16847854 ] 

Dhruve Ashar commented on SPARK-24149:
--------------------------------------

[~mgaido] [~vanzin],

The change this PR introduced is trying to explicitly figure out the list of namenodes from the hadoop configs. I think we are duplicating the logic here and this makes it confusing to understand as the client should be transparent to figuring out the necessary namenodes.

 

Rationale:

- HDFS Federation is used to store the data from two different namespaces on the same data node (mostly used with unrelated namespaces).

- ViewFS on the other hand is used for better namespace management by having different namespaces on different namenodes. But in that case, you should always be using it with viewfs:// which takes care of getting the tokens for you. (Note: this may use HDFS federation or may be not).

In either case we should rely on hadoop to give us the requested namenodes.

In the use case where we want to access unrelated namespaces (often used in scenarios where different hive tables are stored in different namespaces), we already have a config to pass in the other namenodes and we really don't need this change.

 

There was a follow-up PR to fix an issue because of this behavior to get the FS only for the specified namenodes. IMHO both of these changes are unnecessary and we should revert them to the original behavior.

 

Thoughts, comments?

> Automatic namespaces discovery in HDFS federation
> -------------------------------------------------
>
>                 Key: SPARK-24149
>                 URL: https://issues.apache.org/jira/browse/SPARK-24149
>             Project: Spark
>          Issue Type: Improvement
>          Components: YARN
>    Affects Versions: 2.4.0
>            Reporter: Marco Gaido
>            Assignee: Marco Gaido
>            Priority: Minor
>             Fix For: 2.4.0
>
>
> Hadoop 3 introduced HDFS federation.
> Spark fails to write on different namespaces when Hadoop federation is turned on and the cluster is secure. This happens because Spark looks for the delegation token only for the defaultFS configured and not for all the available namespaces. A workaround is the usage of the property {{spark.yarn.access.hadoopFileSystems}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org