You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Tibor Kiss (JIRA)" <ji...@apache.org> on 2017/02/12 10:14:41 UTC

[jira] [Comment Edited] (STORM-2355) Storm-HDFS: inotify support

    [ https://issues.apache.org/jira/browse/STORM-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862722#comment-15862722 ] 

Tibor Kiss edited comment on STORM-2355 at 2/12/17 10:13 AM:
-------------------------------------------------------------

Initial implementation for 1.0.x-branch could be found here: https://github.com/tibkiss/storm/commit/d916d6f904ea085ebdaf5ada2a9c0607794d3c50

Note that I needed to lower guava version to be hdfs compatible (14.0.1).
I have also bumped Hadoop version to 2.7.3.

The implementation was tested using UTs and in a three node dockerized cluster using Flux and simple passthrough topology via Storm-Spout & Storm Bolt. 
Using inotify the load on HDFS was reduced by 15%. Nonetheless more precise performance measurement would have been needed in a non-dockerized environment.


was (Author: tibor.kiss@gmail.com):
Initial implementation for 1.0.x-branch could be found here: https://github.com/tibkiss/storm/commit/d916d6f904ea085ebdaf5ada2a9c0607794d3c50

Note that I needed to lower guava version to be hdfs compatible (14.0.1).
I have also bumped Hadoop version to 2.7.3.


> Storm-HDFS: inotify support
> ---------------------------
>
>                 Key: STORM-2355
>                 URL: https://issues.apache.org/jira/browse/STORM-2355
>             Project: Apache Storm
>          Issue Type: New Feature
>          Components: storm-hdfs
>            Reporter: Tibor Kiss
>            Assignee: Tibor Kiss
>             Fix For: 2.0.0, 1.1.0
>
>
> This is a proposal to implement inotify based watch dir monitoring in Storm-HDFS Spout.
> *Motivation*
> Storm-HDFS currently polls the input directory using Hadoop's {{FileSystem.listFiles}}. This operation is expensive since it returns the block locations and all stat information of the files inside the watch directory. Storm-HDFS currently uses only one element's Path of the returned list which is inefficient.
> *Proposed improvement*
> Provide a way to monitor the input directory through HDFS's inotify API.
> In order to have backward compatibility with the poll based solution I propose a new class ({{HdfsDirectoryMonitor}}) which implements both the inotify and poll based solution through a iterator. The user can enable inotify based polling through a configuration parameter.
> *Caveat*
> HDFS inotify is currently only available for root user, but there is ongoing discussion in Hadoop community to extend its support to users. See: HDFS-8940 
> *Testing related changes*
> The {{TestHdfsSpout}} testcase should be parametrized to check for both the poll & inotify based solution.
> *Further work*
> If the design is accepted the poll based solution could easily improved through {{HdfsDirectoryMonitor}} to properly use all the returned items from the work directory (similar to inotify based solution). Such improvement will reduce the number of calls made to {{FileSystem.listFiles}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)