You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Tibor Kiss (JIRA)" <ji...@apache.org> on 2017/02/12 10:14:41 UTC
[jira] [Comment Edited] (STORM-2355) Storm-HDFS: inotify support
[ https://issues.apache.org/jira/browse/STORM-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862722#comment-15862722 ]
Tibor Kiss edited comment on STORM-2355 at 2/12/17 10:13 AM:
-------------------------------------------------------------
Initial implementation for 1.0.x-branch could be found here: https://github.com/tibkiss/storm/commit/d916d6f904ea085ebdaf5ada2a9c0607794d3c50
Note that I needed to lower guava version to be hdfs compatible (14.0.1).
I have also bumped Hadoop version to 2.7.3.
The implementation was tested using UTs and in a three node dockerized cluster using Flux and simple passthrough topology via Storm-Spout & Storm Bolt.
Using inotify the load on HDFS was reduced by 15%. Nonetheless more precise performance measurement would have been needed in a non-dockerized environment.
was (Author: tibor.kiss@gmail.com):
Initial implementation for 1.0.x-branch could be found here: https://github.com/tibkiss/storm/commit/d916d6f904ea085ebdaf5ada2a9c0607794d3c50
Note that I needed to lower guava version to be hdfs compatible (14.0.1).
I have also bumped Hadoop version to 2.7.3.
> Storm-HDFS: inotify support
> ---------------------------
>
> Key: STORM-2355
> URL: https://issues.apache.org/jira/browse/STORM-2355
> Project: Apache Storm
> Issue Type: New Feature
> Components: storm-hdfs
> Reporter: Tibor Kiss
> Assignee: Tibor Kiss
> Fix For: 2.0.0, 1.1.0
>
>
> This is a proposal to implement inotify based watch dir monitoring in Storm-HDFS Spout.
> *Motivation*
> Storm-HDFS currently polls the input directory using Hadoop's {{FileSystem.listFiles}}. This operation is expensive since it returns the block locations and all stat information of the files inside the watch directory. Storm-HDFS currently uses only one element's Path of the returned list which is inefficient.
> *Proposed improvement*
> Provide a way to monitor the input directory through HDFS's inotify API.
> In order to have backward compatibility with the poll based solution I propose a new class ({{HdfsDirectoryMonitor}}) which implements both the inotify and poll based solution through a iterator. The user can enable inotify based polling through a configuration parameter.
> *Caveat*
> HDFS inotify is currently only available for root user, but there is ongoing discussion in Hadoop community to extend its support to users. See: HDFS-8940
> *Testing related changes*
> The {{TestHdfsSpout}} testcase should be parametrized to check for both the poll & inotify based solution.
> *Further work*
> If the design is accepted the poll based solution could easily improved through {{HdfsDirectoryMonitor}} to properly use all the returned items from the work directory (similar to inotify based solution). Such improvement will reduce the number of calls made to {{FileSystem.listFiles}}.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)