You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@apex.apache.org by Priyanka Gugale <pr...@datatorrent.com> on 2016/03/02 12:20:46 UTC

Re: HDFS File Reader Module

I am planning to put this module in malhar-library project in
package: com.datatorrent.lib.io.fs
Let me know if this is acceptable?

-Priyanka

On Tue, Feb 23, 2016 at 6:45 PM, Priyanka Gugale <pr...@datatorrent.com>
wrote:

> I haven't created any branch yet, should share it with you as soon as I
> add the code for module.
> Surely would be happy to help :)
>
> -Priyanka
>
> On Tue, Feb 23, 2016 at 6:26 PM, Yogi Devendra <yo...@apache.org>
> wrote:
>
>> Priyanka,
>>
>> Thanks for the update. I will consider these ports during the design phase
>> of my proposal for HDFS file copy module.
>>
>> I believe you are planning to add this to Apex Malhar. Please post any
>> link
>> / private branch (if any) where I can have a look at the first cut.
>>
>> I will ask for your help if I come across any questions, uncertainties
>> etc.
>>
>> ~ Yogi
>>
>> On 23 February 2016 at 17:59, Priyanka Gugale <pr...@datatorrent.com>
>> wrote:
>>
>> > I am planning to have following ports to this module:
>> >
>> > Ports
>> > Input port: None
>> >
>> > Output port:
>> >
>> >    1. FileMetadata
>> >    2. BlockMetadata
>> >    3. Block bytes
>> >
>> > -Priyanka
>> >
>> > On Tue, Feb 23, 2016 at 2:16 PM, Yogi Devendra <yogidevendra@apache.org
>> >
>> > wrote:
>> >
>> > > Priyanka,
>> > >
>> > > Can you please share details about what would be the output ports from
>> > this
>> > > module?
>> > >
>> > > I am thinking of HDFS File Copy Module which can be used in
>> conjunction
>> > > with this module to copy files from HDFS to HDFS.
>> > >
>> > > ~ Yogi
>> > >
>> > > On 18 February 2016 at 10:29, Mohit Jotwani <mo...@datatorrent.com>
>> > wrote:
>> > >
>> > > > +1 to add this.
>> > > >
>> > > > Regards,
>> > > > Mohit
>> > > > On 17 Feb 2016 23:30, "Pramod Immaneni" <pr...@datatorrent.com>
>> > wrote:
>> > > >
>> > > > > +1 to add this module
>> > > > >
>> > > > > On Wed, Feb 17, 2016 at 9:21 AM, Priyanka Gugale <
>> > > > priyanka@datatorrent.com
>> > > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > We need partitions for parallel read but how will the reader
>> > > partition
>> > > > > know
>> > > > > > which offset of the file it should read from. Normally
>> FileSplitter
>> > > > > creates
>> > > > > > this metadata, let's call them as reader task, and forwards
>> them to
>> > > > next
>> > > > > > operator which is block reader. Block reader will receive one of
>> > the
>> > > > > tasks
>> > > > > > and read from specified offset in file. If FileSplitter is
>> absent
>> > one
>> > > > > > reader partition will have to consume one file entirely, which
>> > means
>> > > we
>> > > > > > can't have parallel reading over one file. I hope this answers
>> your
>> > > > > > question.
>> > > > > >
>> > > > > > Advantage of having this module is having a reusable component
>> made
>> > > up
>> > > > of
>> > > > > > operators which are frequently used together to do file reading.
>> > > > > >
>> > > > > > -Priyanka
>> > > > > >
>> > > > > > On Wed, Feb 17, 2016 at 11:31 AM, Yogi Devendra <
>> > > > yogidevendra@apache.org
>> > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Let me rephrase Ram's question to make it clear:
>> > > > > > >
>> > > > > > > For an application developer using Malhar:
>> > > > > > > What are the advantages / disadvantages of using the proposed
>> > HDFS
>> > > > File
>> > > > > > > input Module as compared to directly using FileSplitter,
>> > > BlockReader
>> > > > > > > Operators available in Malhar?
>> > > > > > >
>> > > > > > > ~ Yogi
>> > > > > > >
>> > > > > > > On 16 February 2016 at 21:56, Munagala Ramanath <
>> > > ram@datatorrent.com
>> > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Can parallel read not be achieved by partitioning ?
>> > > > > > > >
>> > > > > > > > Ram
>> > > > > > > >
>> > > > > > > > On Tue, Feb 16, 2016 at 1:01 AM, Priyanka Gugale <
>> > > > > > > priyanka@datatorrent.com
>> > > > > > > > >
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi,
>> > > > > > > > >
>> > > > > > > > > It is a common usecase to read big files on HDFS in
>> parallel
>> > > > > fashion
>> > > > > > > i.e.
>> > > > > > > > > many reader thread are used to read the file in parallel.
>> We
>> > > can
>> > > > > > > achieve
>> > > > > > > > > this on top of Apex using following Malhar operators:
>> > > > > > > > >
>> > > > > > > > > 1. AbstractFileSplitter
>> > > > > > > > > 2. AbstractBlockReader
>> > > > > > > > >
>> > > > > > > > > where FileSplitter, as per file metadata, creates small
>> > reader
>> > > > > > tasks(to
>> > > > > > > > > read file in parts). Those reader tasks are run by
>> > BlockReaders
>> > > > in
>> > > > > > > > parallel
>> > > > > > > > > to read the file.
>> > > > > > > > >
>> > > > > > > > > As these operators are generally used together to achieve
>> > file
>> > > > read
>> > > > > > > > > operation, I propose we create a module, called
>> > HDFSFileReader
>> > > > for
>> > > > > > > this.
>> > > > > > > > >
>> > > > > > > > > Please provide your suggestions on same.
>> > > > > > > > >
>> > > > > > > > > -Priyanka
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: HDFS File Reader Module

Posted by Thomas Weise <th...@datatorrent.com>.

For new code we should use org.apache.apex

I prefer not to use "module" in the package name but keep them together
with related operators (modules and operators are not different from users
perspective).

On Wed, Mar 2, 2016 at 9:59 PM, Chinmay Kolhatkar <ch...@apache.org>
wrote:

> +1 for seperate namespace for modules.
>
> On Thu, Mar 3, 2016 at 10:58 AM, Priyanka Gugale <priyanka@datatorrent.com
> >
> wrote:
>
> > That is also a option but then I have a question, do we want to treat
> > modules separately or it is just a type of operator, may be a super
> > operator?
> > Also I believe it would be good if we have feature wise packages than
> using
> > our custom terms to create package, so anyone can easily locate the
> > classes.
> >
> >
> > -Priyanka
> >
> > On Thu, Mar 3, 2016 at 12:20 AM, Sandesh Hegde <sa...@datatorrent.com>
> > wrote:
> >
> > > My vote is to have a separate namespace for modules.
> > >
> > > Is it time to introduce
> > > org.apache.apex.module.io.fs ?
> > >
> > > On Wed, Mar 2, 2016 at 3:25 AM Priyanka Gugale <
> priyanka@datatorrent.com
> > >
> > > wrote:
> > >
> > > > I am planning to put this module in malhar-library project in
> > > > package: com.datatorrent.lib.io.fs
> > > > Let me know if this is acceptable?
> > > >
> > > > -Priyanka
> > > >
> > > > On Tue, Feb 23, 2016 at 6:45 PM, Priyanka Gugale <
> > > priyanka@datatorrent.com
> > > > >
> > > > wrote:
> > > >
> > > > > I haven't created any branch yet, should share it with you as soon
> > as I
> > > > > add the code for module.
> > > > > Surely would be happy to help :)
> > > > >
> > > > > -Priyanka
> > > > >
> > > > > On Tue, Feb 23, 2016 at 6:26 PM, Yogi Devendra <
> > > yogidevendra@apache.org>
> > > > > wrote:
> > > > >
> > > > >> Priyanka,
> > > > >>
> > > > >> Thanks for the update. I will consider these ports during the
> design
> > > > phase
> > > > >> of my proposal for HDFS file copy module.
> > > > >>
> > > > >> I believe you are planning to add this to Apex Malhar. Please post
> > any
> > > > >> link
> > > > >> / private branch (if any) where I can have a look at the first
> cut.
> > > > >>
> > > > >> I will ask for your help if I come across any questions,
> > uncertainties
> > > > >> etc.
> > > > >>
> > > > >> ~ Yogi
> > > > >>
> > > > >> On 23 February 2016 at 17:59, Priyanka Gugale <
> > > priyanka@datatorrent.com
> > > > >
> > > > >> wrote:
> > > > >>
> > > > >> > I am planning to have following ports to this module:
> > > > >> >
> > > > >> > Ports
> > > > >> > Input port: None
> > > > >> >
> > > > >> > Output port:
> > > > >> >
> > > > >> >    1. FileMetadata
> > > > >> >    2. BlockMetadata
> > > > >> >    3. Block bytes
> > > > >> >
> > > > >> > -Priyanka
> > > > >> >
> > > > >> > On Tue, Feb 23, 2016 at 2:16 PM, Yogi Devendra <
> > > > yogidevendra@apache.org
> > > > >> >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Priyanka,
> > > > >> > >
> > > > >> > > Can you please share details about what would be the output
> > ports
> > > > from
> > > > >> > this
> > > > >> > > module?
> > > > >> > >
> > > > >> > > I am thinking of HDFS File Copy Module which can be used in
> > > > >> conjunction
> > > > >> > > with this module to copy files from HDFS to HDFS.
> > > > >> > >
> > > > >> > > ~ Yogi
> > > > >> > >
> > > > >> > > On 18 February 2016 at 10:29, Mohit Jotwani <
> > > mohit@datatorrent.com>
> > > > >> > wrote:
> > > > >> > >
> > > > >> > > > +1 to add this.
> > > > >> > > >
> > > > >> > > > Regards,
> > > > >> > > > Mohit
> > > > >> > > > On 17 Feb 2016 23:30, "Pramod Immaneni" <
> > pramod@datatorrent.com
> > > >
> > > > >> > wrote:
> > > > >> > > >
> > > > >> > > > > +1 to add this module
> > > > >> > > > >
> > > > >> > > > > On Wed, Feb 17, 2016 at 9:21 AM, Priyanka Gugale <
> > > > >> > > > priyanka@datatorrent.com
> > > > >> > > > > >
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > We need partitions for parallel read but how will the
> > reader
> > > > >> > > partition
> > > > >> > > > > know
> > > > >> > > > > > which offset of the file it should read from. Normally
> > > > >> FileSplitter
> > > > >> > > > > creates
> > > > >> > > > > > this metadata, let's call them as reader task, and
> > forwards
> > > > >> them to
> > > > >> > > > next
> > > > >> > > > > > operator which is block reader. Block reader will
> receive
> > > one
> > > > of
> > > > >> > the
> > > > >> > > > > tasks
> > > > >> > > > > > and read from specified offset in file. If FileSplitter
> is
> > > > >> absent
> > > > >> > one
> > > > >> > > > > > reader partition will have to consume one file entirely,
> > > which
> > > > >> > means
> > > > >> > > we
> > > > >> > > > > > can't have parallel reading over one file. I hope this
> > > answers
> > > > >> your
> > > > >> > > > > > question.
> > > > >> > > > > >
> > > > >> > > > > > Advantage of having this module is having a reusable
> > > component
> > > > >> made
> > > > >> > > up
> > > > >> > > > of
> > > > >> > > > > > operators which are frequently used together to do file
> > > > reading.
> > > > >> > > > > >
> > > > >> > > > > > -Priyanka
> > > > >> > > > > >
> > > > >> > > > > > On Wed, Feb 17, 2016 at 11:31 AM, Yogi Devendra <
> > > > >> > > > yogidevendra@apache.org
> > > > >> > > > > >
> > > > >> > > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > Let me rephrase Ram's question to make it clear:
> > > > >> > > > > > >
> > > > >> > > > > > > For an application developer using Malhar:
> > > > >> > > > > > > What are the advantages / disadvantages of using the
> > > > proposed
> > > > >> > HDFS
> > > > >> > > > File
> > > > >> > > > > > > input Module as compared to directly using
> FileSplitter,
> > > > >> > > BlockReader
> > > > >> > > > > > > Operators available in Malhar?
> > > > >> > > > > > >
> > > > >> > > > > > > ~ Yogi
> > > > >> > > > > > >
> > > > >> > > > > > > On 16 February 2016 at 21:56, Munagala Ramanath <
> > > > >> > > ram@datatorrent.com
> > > > >> > > > >
> > > > >> > > > > > > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > > Can parallel read not be achieved by partitioning ?
> > > > >> > > > > > > >
> > > > >> > > > > > > > Ram
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Tue, Feb 16, 2016 at 1:01 AM, Priyanka Gugale <
> > > > >> > > > > > > priyanka@datatorrent.com
> > > > >> > > > > > > > >
> > > > >> > > > > > > > wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > Hi,
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > It is a common usecase to read big files on HDFS
> in
> > > > >> parallel
> > > > >> > > > > fashion
> > > > >> > > > > > > i.e.
> > > > >> > > > > > > > > many reader thread are used to read the file in
> > > > parallel.
> > > > >> We
> > > > >> > > can
> > > > >> > > > > > > achieve
> > > > >> > > > > > > > > this on top of Apex using following Malhar
> > operators:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > 1. AbstractFileSplitter
> > > > >> > > > > > > > > 2. AbstractBlockReader
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > where FileSplitter, as per file metadata, creates
> > > small
> > > > >> > reader
> > > > >> > > > > > tasks(to
> > > > >> > > > > > > > > read file in parts). Those reader tasks are run by
> > > > >> > BlockReaders
> > > > >> > > > in
> > > > >> > > > > > > > parallel
> > > > >> > > > > > > > > to read the file.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > As these operators are generally used together to
> > > > achieve
> > > > >> > file
> > > > >> > > > read
> > > > >> > > > > > > > > operation, I propose we create a module, called
> > > > >> > HDFSFileReader
> > > > >> > > > for
> > > > >> > > > > > > this.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Please provide your suggestions on same.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > -Priyanka
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: HDFS File Reader Module

Posted by Chinmay Kolhatkar <ch...@apache.org>.

+1 for seperate namespace for modules.

On Thu, Mar 3, 2016 at 10:58 AM, Priyanka Gugale <pr...@datatorrent.com>
wrote:

> That is also a option but then I have a question, do we want to treat
> modules separately or it is just a type of operator, may be a super
> operator?
> Also I believe it would be good if we have feature wise packages than using
> our custom terms to create package, so anyone can easily locate the
> classes.
>
>
> -Priyanka
>
> On Thu, Mar 3, 2016 at 12:20 AM, Sandesh Hegde <sa...@datatorrent.com>
> wrote:
>
> > My vote is to have a separate namespace for modules.
> >
> > Is it time to introduce
> > org.apache.apex.module.io.fs ?
> >
> > On Wed, Mar 2, 2016 at 3:25 AM Priyanka Gugale <priyanka@datatorrent.com
> >
> > wrote:
> >
> > > I am planning to put this module in malhar-library project in
> > > package: com.datatorrent.lib.io.fs
> > > Let me know if this is acceptable?
> > >
> > > -Priyanka
> > >
> > > On Tue, Feb 23, 2016 at 6:45 PM, Priyanka Gugale <
> > priyanka@datatorrent.com
> > > >
> > > wrote:
> > >
> > > > I haven't created any branch yet, should share it with you as soon
> as I
> > > > add the code for module.
> > > > Surely would be happy to help :)
> > > >
> > > > -Priyanka
> > > >
> > > > On Tue, Feb 23, 2016 at 6:26 PM, Yogi Devendra <
> > yogidevendra@apache.org>
> > > > wrote:
> > > >
> > > >> Priyanka,
> > > >>
> > > >> Thanks for the update. I will consider these ports during the design
> > > phase
> > > >> of my proposal for HDFS file copy module.
> > > >>
> > > >> I believe you are planning to add this to Apex Malhar. Please post
> any
> > > >> link
> > > >> / private branch (if any) where I can have a look at the first cut.
> > > >>
> > > >> I will ask for your help if I come across any questions,
> uncertainties
> > > >> etc.
> > > >>
> > > >> ~ Yogi
> > > >>
> > > >> On 23 February 2016 at 17:59, Priyanka Gugale <
> > priyanka@datatorrent.com
> > > >
> > > >> wrote:
> > > >>
> > > >> > I am planning to have following ports to this module:
> > > >> >
> > > >> > Ports
> > > >> > Input port: None
> > > >> >
> > > >> > Output port:
> > > >> >
> > > >> >    1. FileMetadata
> > > >> >    2. BlockMetadata
> > > >> >    3. Block bytes
> > > >> >
> > > >> > -Priyanka
> > > >> >
> > > >> > On Tue, Feb 23, 2016 at 2:16 PM, Yogi Devendra <
> > > yogidevendra@apache.org
> > > >> >
> > > >> > wrote:
> > > >> >
> > > >> > > Priyanka,
> > > >> > >
> > > >> > > Can you please share details about what would be the output
> ports
> > > from
> > > >> > this
> > > >> > > module?
> > > >> > >
> > > >> > > I am thinking of HDFS File Copy Module which can be used in
> > > >> conjunction
> > > >> > > with this module to copy files from HDFS to HDFS.
> > > >> > >
> > > >> > > ~ Yogi
> > > >> > >
> > > >> > > On 18 February 2016 at 10:29, Mohit Jotwani <
> > mohit@datatorrent.com>
> > > >> > wrote:
> > > >> > >
> > > >> > > > +1 to add this.
> > > >> > > >
> > > >> > > > Regards,
> > > >> > > > Mohit
> > > >> > > > On 17 Feb 2016 23:30, "Pramod Immaneni" <
> pramod@datatorrent.com
> > >
> > > >> > wrote:
> > > >> > > >
> > > >> > > > > +1 to add this module
> > > >> > > > >
> > > >> > > > > On Wed, Feb 17, 2016 at 9:21 AM, Priyanka Gugale <
> > > >> > > > priyanka@datatorrent.com
> > > >> > > > > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > We need partitions for parallel read but how will the
> reader
> > > >> > > partition
> > > >> > > > > know
> > > >> > > > > > which offset of the file it should read from. Normally
> > > >> FileSplitter
> > > >> > > > > creates
> > > >> > > > > > this metadata, let's call them as reader task, and
> forwards
> > > >> them to
> > > >> > > > next
> > > >> > > > > > operator which is block reader. Block reader will receive
> > one
> > > of
> > > >> > the
> > > >> > > > > tasks
> > > >> > > > > > and read from specified offset in file. If FileSplitter is
> > > >> absent
> > > >> > one
> > > >> > > > > > reader partition will have to consume one file entirely,
> > which
> > > >> > means
> > > >> > > we
> > > >> > > > > > can't have parallel reading over one file. I hope this
> > answers
> > > >> your
> > > >> > > > > > question.
> > > >> > > > > >
> > > >> > > > > > Advantage of having this module is having a reusable
> > component
> > > >> made
> > > >> > > up
> > > >> > > > of
> > > >> > > > > > operators which are frequently used together to do file
> > > reading.
> > > >> > > > > >
> > > >> > > > > > -Priyanka
> > > >> > > > > >
> > > >> > > > > > On Wed, Feb 17, 2016 at 11:31 AM, Yogi Devendra <
> > > >> > > > yogidevendra@apache.org
> > > >> > > > > >
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Let me rephrase Ram's question to make it clear:
> > > >> > > > > > >
> > > >> > > > > > > For an application developer using Malhar:
> > > >> > > > > > > What are the advantages / disadvantages of using the
> > > proposed
> > > >> > HDFS
> > > >> > > > File
> > > >> > > > > > > input Module as compared to directly using FileSplitter,
> > > >> > > BlockReader
> > > >> > > > > > > Operators available in Malhar?
> > > >> > > > > > >
> > > >> > > > > > > ~ Yogi
> > > >> > > > > > >
> > > >> > > > > > > On 16 February 2016 at 21:56, Munagala Ramanath <
> > > >> > > ram@datatorrent.com
> > > >> > > > >
> > > >> > > > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Can parallel read not be achieved by partitioning ?
> > > >> > > > > > > >
> > > >> > > > > > > > Ram
> > > >> > > > > > > >
> > > >> > > > > > > > On Tue, Feb 16, 2016 at 1:01 AM, Priyanka Gugale <
> > > >> > > > > > > priyanka@datatorrent.com
> > > >> > > > > > > > >
> > > >> > > > > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > > Hi,
> > > >> > > > > > > > >
> > > >> > > > > > > > > It is a common usecase to read big files on HDFS in
> > > >> parallel
> > > >> > > > > fashion
> > > >> > > > > > > i.e.
> > > >> > > > > > > > > many reader thread are used to read the file in
> > > parallel.
> > > >> We
> > > >> > > can
> > > >> > > > > > > achieve
> > > >> > > > > > > > > this on top of Apex using following Malhar
> operators:
> > > >> > > > > > > > >
> > > >> > > > > > > > > 1. AbstractFileSplitter
> > > >> > > > > > > > > 2. AbstractBlockReader
> > > >> > > > > > > > >
> > > >> > > > > > > > > where FileSplitter, as per file metadata, creates
> > small
> > > >> > reader
> > > >> > > > > > tasks(to
> > > >> > > > > > > > > read file in parts). Those reader tasks are run by
> > > >> > BlockReaders
> > > >> > > > in
> > > >> > > > > > > > parallel
> > > >> > > > > > > > > to read the file.
> > > >> > > > > > > > >
> > > >> > > > > > > > > As these operators are generally used together to
> > > achieve
> > > >> > file
> > > >> > > > read
> > > >> > > > > > > > > operation, I propose we create a module, called
> > > >> > HDFSFileReader
> > > >> > > > for
> > > >> > > > > > > this.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Please provide your suggestions on same.
> > > >> > > > > > > > >
> > > >> > > > > > > > > -Priyanka
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: HDFS File Reader Module

Posted by Priyanka Gugale <pr...@datatorrent.com>.

That is also a option but then I have a question, do we want to treat
modules separately or it is just a type of operator, may be a super
operator?
Also I believe it would be good if we have feature wise packages than using
our custom terms to create package, so anyone can easily locate the classes.


-Priyanka

On Thu, Mar 3, 2016 at 12:20 AM, Sandesh Hegde <sa...@datatorrent.com>
wrote:

> My vote is to have a separate namespace for modules.
>
> Is it time to introduce
> org.apache.apex.module.io.fs ?
>
> On Wed, Mar 2, 2016 at 3:25 AM Priyanka Gugale <pr...@datatorrent.com>
> wrote:
>
> > I am planning to put this module in malhar-library project in
> > package: com.datatorrent.lib.io.fs
> > Let me know if this is acceptable?
> >
> > -Priyanka
> >
> > On Tue, Feb 23, 2016 at 6:45 PM, Priyanka Gugale <
> priyanka@datatorrent.com
> > >
> > wrote:
> >
> > > I haven't created any branch yet, should share it with you as soon as I
> > > add the code for module.
> > > Surely would be happy to help :)
> > >
> > > -Priyanka
> > >
> > > On Tue, Feb 23, 2016 at 6:26 PM, Yogi Devendra <
> yogidevendra@apache.org>
> > > wrote:
> > >
> > >> Priyanka,
> > >>
> > >> Thanks for the update. I will consider these ports during the design
> > phase
> > >> of my proposal for HDFS file copy module.
> > >>
> > >> I believe you are planning to add this to Apex Malhar. Please post any
> > >> link
> > >> / private branch (if any) where I can have a look at the first cut.
> > >>
> > >> I will ask for your help if I come across any questions, uncertainties
> > >> etc.
> > >>
> > >> ~ Yogi
> > >>
> > >> On 23 February 2016 at 17:59, Priyanka Gugale <
> priyanka@datatorrent.com
> > >
> > >> wrote:
> > >>
> > >> > I am planning to have following ports to this module:
> > >> >
> > >> > Ports
> > >> > Input port: None
> > >> >
> > >> > Output port:
> > >> >
> > >> >    1. FileMetadata
> > >> >    2. BlockMetadata
> > >> >    3. Block bytes
> > >> >
> > >> > -Priyanka
> > >> >
> > >> > On Tue, Feb 23, 2016 at 2:16 PM, Yogi Devendra <
> > yogidevendra@apache.org
> > >> >
> > >> > wrote:
> > >> >
> > >> > > Priyanka,
> > >> > >
> > >> > > Can you please share details about what would be the output ports
> > from
> > >> > this
> > >> > > module?
> > >> > >
> > >> > > I am thinking of HDFS File Copy Module which can be used in
> > >> conjunction
> > >> > > with this module to copy files from HDFS to HDFS.
> > >> > >
> > >> > > ~ Yogi
> > >> > >
> > >> > > On 18 February 2016 at 10:29, Mohit Jotwani <
> mohit@datatorrent.com>
> > >> > wrote:
> > >> > >
> > >> > > > +1 to add this.
> > >> > > >
> > >> > > > Regards,
> > >> > > > Mohit
> > >> > > > On 17 Feb 2016 23:30, "Pramod Immaneni" <pramod@datatorrent.com
> >
> > >> > wrote:
> > >> > > >
> > >> > > > > +1 to add this module
> > >> > > > >
> > >> > > > > On Wed, Feb 17, 2016 at 9:21 AM, Priyanka Gugale <
> > >> > > > priyanka@datatorrent.com
> > >> > > > > >
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > We need partitions for parallel read but how will the reader
> > >> > > partition
> > >> > > > > know
> > >> > > > > > which offset of the file it should read from. Normally
> > >> FileSplitter
> > >> > > > > creates
> > >> > > > > > this metadata, let's call them as reader task, and forwards
> > >> them to
> > >> > > > next
> > >> > > > > > operator which is block reader. Block reader will receive
> one
> > of
> > >> > the
> > >> > > > > tasks
> > >> > > > > > and read from specified offset in file. If FileSplitter is
> > >> absent
> > >> > one
> > >> > > > > > reader partition will have to consume one file entirely,
> which
> > >> > means
> > >> > > we
> > >> > > > > > can't have parallel reading over one file. I hope this
> answers
> > >> your
> > >> > > > > > question.
> > >> > > > > >
> > >> > > > > > Advantage of having this module is having a reusable
> component
> > >> made
> > >> > > up
> > >> > > > of
> > >> > > > > > operators which are frequently used together to do file
> > reading.
> > >> > > > > >
> > >> > > > > > -Priyanka
> > >> > > > > >
> > >> > > > > > On Wed, Feb 17, 2016 at 11:31 AM, Yogi Devendra <
> > >> > > > yogidevendra@apache.org
> > >> > > > > >
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Let me rephrase Ram's question to make it clear:
> > >> > > > > > >
> > >> > > > > > > For an application developer using Malhar:
> > >> > > > > > > What are the advantages / disadvantages of using the
> > proposed
> > >> > HDFS
> > >> > > > File
> > >> > > > > > > input Module as compared to directly using FileSplitter,
> > >> > > BlockReader
> > >> > > > > > > Operators available in Malhar?
> > >> > > > > > >
> > >> > > > > > > ~ Yogi
> > >> > > > > > >
> > >> > > > > > > On 16 February 2016 at 21:56, Munagala Ramanath <
> > >> > > ram@datatorrent.com
> > >> > > > >
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Can parallel read not be achieved by partitioning ?
> > >> > > > > > > >
> > >> > > > > > > > Ram
> > >> > > > > > > >
> > >> > > > > > > > On Tue, Feb 16, 2016 at 1:01 AM, Priyanka Gugale <
> > >> > > > > > > priyanka@datatorrent.com
> > >> > > > > > > > >
> > >> > > > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Hi,
> > >> > > > > > > > >
> > >> > > > > > > > > It is a common usecase to read big files on HDFS in
> > >> parallel
> > >> > > > > fashion
> > >> > > > > > > i.e.
> > >> > > > > > > > > many reader thread are used to read the file in
> > parallel.
> > >> We
> > >> > > can
> > >> > > > > > > achieve
> > >> > > > > > > > > this on top of Apex using following Malhar operators:
> > >> > > > > > > > >
> > >> > > > > > > > > 1. AbstractFileSplitter
> > >> > > > > > > > > 2. AbstractBlockReader
> > >> > > > > > > > >
> > >> > > > > > > > > where FileSplitter, as per file metadata, creates
> small
> > >> > reader
> > >> > > > > > tasks(to
> > >> > > > > > > > > read file in parts). Those reader tasks are run by
> > >> > BlockReaders
> > >> > > > in
> > >> > > > > > > > parallel
> > >> > > > > > > > > to read the file.
> > >> > > > > > > > >
> > >> > > > > > > > > As these operators are generally used together to
> > achieve
> > >> > file
> > >> > > > read
> > >> > > > > > > > > operation, I propose we create a module, called
> > >> > HDFSFileReader
> > >> > > > for
> > >> > > > > > > this.
> > >> > > > > > > > >
> > >> > > > > > > > > Please provide your suggestions on same.
> > >> > > > > > > > >
> > >> > > > > > > > > -Priyanka
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: HDFS File Reader Module

Posted by Sandesh Hegde <sa...@datatorrent.com>.

My vote is to have a separate namespace for modules.

Is it time to introduce
org.apache.apex.module.io.fs ?

On Wed, Mar 2, 2016 at 3:25 AM Priyanka Gugale <pr...@datatorrent.com>
wrote:

> I am planning to put this module in malhar-library project in
> package: com.datatorrent.lib.io.fs
> Let me know if this is acceptable?
>
> -Priyanka
>
> On Tue, Feb 23, 2016 at 6:45 PM, Priyanka Gugale <priyanka@datatorrent.com
> >
> wrote:
>
> > I haven't created any branch yet, should share it with you as soon as I
> > add the code for module.
> > Surely would be happy to help :)
> >
> > -Priyanka
> >
> > On Tue, Feb 23, 2016 at 6:26 PM, Yogi Devendra <yo...@apache.org>
> > wrote:
> >
> >> Priyanka,
> >>
> >> Thanks for the update. I will consider these ports during the design
> phase
> >> of my proposal for HDFS file copy module.
> >>
> >> I believe you are planning to add this to Apex Malhar. Please post any
> >> link
> >> / private branch (if any) where I can have a look at the first cut.
> >>
> >> I will ask for your help if I come across any questions, uncertainties
> >> etc.
> >>
> >> ~ Yogi
> >>
> >> On 23 February 2016 at 17:59, Priyanka Gugale <priyanka@datatorrent.com
> >
> >> wrote:
> >>
> >> > I am planning to have following ports to this module:
> >> >
> >> > Ports
> >> > Input port: None
> >> >
> >> > Output port:
> >> >
> >> >    1. FileMetadata
> >> >    2. BlockMetadata
> >> >    3. Block bytes
> >> >
> >> > -Priyanka
> >> >
> >> > On Tue, Feb 23, 2016 at 2:16 PM, Yogi Devendra <
> yogidevendra@apache.org
> >> >
> >> > wrote:
> >> >
> >> > > Priyanka,
> >> > >
> >> > > Can you please share details about what would be the output ports
> from
> >> > this
> >> > > module?
> >> > >
> >> > > I am thinking of HDFS File Copy Module which can be used in
> >> conjunction
> >> > > with this module to copy files from HDFS to HDFS.
> >> > >
> >> > > ~ Yogi
> >> > >
> >> > > On 18 February 2016 at 10:29, Mohit Jotwani <mo...@datatorrent.com>
> >> > wrote:
> >> > >
> >> > > > +1 to add this.
> >> > > >
> >> > > > Regards,
> >> > > > Mohit
> >> > > > On 17 Feb 2016 23:30, "Pramod Immaneni" <pr...@datatorrent.com>
> >> > wrote:
> >> > > >
> >> > > > > +1 to add this module
> >> > > > >
> >> > > > > On Wed, Feb 17, 2016 at 9:21 AM, Priyanka Gugale <
> >> > > > priyanka@datatorrent.com
> >> > > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > We need partitions for parallel read but how will the reader
> >> > > partition
> >> > > > > know
> >> > > > > > which offset of the file it should read from. Normally
> >> FileSplitter
> >> > > > > creates
> >> > > > > > this metadata, let's call them as reader task, and forwards
> >> them to
> >> > > > next
> >> > > > > > operator which is block reader. Block reader will receive one
> of
> >> > the
> >> > > > > tasks
> >> > > > > > and read from specified offset in file. If FileSplitter is
> >> absent
> >> > one
> >> > > > > > reader partition will have to consume one file entirely, which
> >> > means
> >> > > we
> >> > > > > > can't have parallel reading over one file. I hope this answers
> >> your
> >> > > > > > question.
> >> > > > > >
> >> > > > > > Advantage of having this module is having a reusable component
> >> made
> >> > > up
> >> > > > of
> >> > > > > > operators which are frequently used together to do file
> reading.
> >> > > > > >
> >> > > > > > -Priyanka
> >> > > > > >
> >> > > > > > On Wed, Feb 17, 2016 at 11:31 AM, Yogi Devendra <
> >> > > > yogidevendra@apache.org
> >> > > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Let me rephrase Ram's question to make it clear:
> >> > > > > > >
> >> > > > > > > For an application developer using Malhar:
> >> > > > > > > What are the advantages / disadvantages of using the
> proposed
> >> > HDFS
> >> > > > File
> >> > > > > > > input Module as compared to directly using FileSplitter,
> >> > > BlockReader
> >> > > > > > > Operators available in Malhar?
> >> > > > > > >
> >> > > > > > > ~ Yogi
> >> > > > > > >
> >> > > > > > > On 16 February 2016 at 21:56, Munagala Ramanath <
> >> > > ram@datatorrent.com
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Can parallel read not be achieved by partitioning ?
> >> > > > > > > >
> >> > > > > > > > Ram
> >> > > > > > > >
> >> > > > > > > > On Tue, Feb 16, 2016 at 1:01 AM, Priyanka Gugale <
> >> > > > > > > priyanka@datatorrent.com
> >> > > > > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi,
> >> > > > > > > > >
> >> > > > > > > > > It is a common usecase to read big files on HDFS in
> >> parallel
> >> > > > > fashion
> >> > > > > > > i.e.
> >> > > > > > > > > many reader thread are used to read the file in
> parallel.
> >> We
> >> > > can
> >> > > > > > > achieve
> >> > > > > > > > > this on top of Apex using following Malhar operators:
> >> > > > > > > > >
> >> > > > > > > > > 1. AbstractFileSplitter
> >> > > > > > > > > 2. AbstractBlockReader
> >> > > > > > > > >
> >> > > > > > > > > where FileSplitter, as per file metadata, creates small
> >> > reader
> >> > > > > > tasks(to
> >> > > > > > > > > read file in parts). Those reader tasks are run by
> >> > BlockReaders
> >> > > > in
> >> > > > > > > > parallel
> >> > > > > > > > > to read the file.
> >> > > > > > > > >
> >> > > > > > > > > As these operators are generally used together to
> achieve
> >> > file
> >> > > > read
> >> > > > > > > > > operation, I propose we create a module, called
> >> > HDFSFileReader
> >> > > > for
> >> > > > > > > this.
> >> > > > > > > > >
> >> > > > > > > > > Please provide your suggestions on same.
> >> > > > > > > > >
> >> > > > > > > > > -Priyanka
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>