You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "BELUGA BEHR (JIRA)" <ji...@apache.org> on 2019/01/05 01:40:00 UTC

[jira] [Commented] (NIFI-5452) Enable HDFS-13448 in HDFS Sink

    [ https://issues.apache.org/jira/browse/NIFI-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734726#comment-16734726 ] 

BELUGA BEHR commented on NIFI-5452:
-----------------------------------

Need to pass {{IGNORE_CLIENT_LOCALITY}} to the HDFS Client.

 

https://github.com/apache/hadoop/blob/7b57f2f71fbaa5af4897309597cca70a95b04edd/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CreateFlag.java#L126

 

https://github.com/apache/hadoop/blob/788e7473a404fa074b3af522416ee3d2fae865a0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java#L525

> Enable HDFS-13448 in HDFS Sink
> ------------------------------
>
>                 Key: NIFI-5452
>                 URL: https://issues.apache.org/jira/browse/NIFI-5452
>             Project: Apache NiFi
>          Issue Type: New Feature
>            Reporter: BELUGA BEHR
>            Priority: Major
>
> Now that [HDFS-13448] is available, add a new boolean configuration to the HDFS Sink configuration that enabled this.
> The basic issue is, as it currently stands, is the following:
> Imagine a cluster has four racks of hardware
> # Rack A is half management nodes and half datanodes
> # Rack B, C, D are all datanodes
> Now consider the following scenarios:
> If an instance of NiFi is located on a server outside of these racks, the data will be evenly distributed to each DataNode.
> If an instance of NiFi is running on Rack A, and is running co-located with a DataNode, then all of the HDFS Sink writes will first go to the local DataNode, thus overloading this single DataNode and filling it faster than all other DataNodes in the cluster.
> If an instance of NiFi is running on Rack A, on its own server, then all of the HDFS Sink writes will first go to a DataNode on Rack A, thus overloading the DataNodes on Rack A and filling those DataNodes faster than all other DataNodes in the cluster.  The issue here is compounded using many racks.  Rack A will always receive one copy of the each block, and the other two copies are scattered equally across the other racks.
> [HDFS-13448] adds a new flag to the HDFS client that requests to the NameNode that the first block should always be randomly placed.  Thus, if a NiFi instance is located on Rack A, the local node (or local rack) will not be overloaded.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)