You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apex.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/03/09 09:28:40 UTC
[jira] [Commented] (APEXMALHAR-2008) Create hdfs file input module

    [ https://issues.apache.org/jira/browse/APEXMALHAR-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186748#comment-15186748 ] 

ASF GitHub Bot commented on APEXMALHAR-2008:
--------------------------------------------

GitHub user DT-Priyanka opened a pull request:

    https://github.com/apache/incubator-apex-malhar/pull/207

    APEXMALHAR-2008: Create HDFS File Reader module

    Code to add HDFS file reader module. 
    1. The module reads file/list of files (directory is also accepted) and emit the file blocks. 
    2. The module can be configured to emit blocks in order or out of order.
    3. Module reads file blocks in parallel. The number of parallel readers is configurable, if not configured it will increase or decrease readers dynamically as per input data rate.
    
    Also updated code of FileSplitterInput to add some improvements:
    1. Tracking last file reference times of each folder differently, to avoid duplicates (duplicates could be due to same relative paths of multiple files/sub dir)
    2. Small improvements in code.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/DT-Priyanka/incubator-apex-malhar APEXMALHAR-2008-hdfs-input-module

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-apex-malhar/pull/207.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #207
    
----
commit 8ffb34abe48f525d401c3932d79ada6c71214e88
Author: Priyanka Gugale <pr...@datatorrent.com>
Date:   2016-03-08T08:42:13Z

    APEXMALHAR-2008: Create HDFS File Reader module

----


> Create hdfs file input module 
> ------------------------------
>
>                 Key: APEXMALHAR-2008
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2008
>             Project: Apache Apex Malhar
>          Issue Type: Task
>            Reporter: Priyanka Gugale
>            Assignee: Priyanka Gugale
>            Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> To read HDFS files in parallel using Apex we normally use FileSplitter and FileReader module. It would be a good idea to combine those operators as a unit in module. Having a module will give us readily usable set of operators to read HDFS files. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)