You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@apex.apache.org by Yogi Devendra <de...@gmail.com> on 2016/04/28 12:59:43 UTC

Reading large HDFS files record by record

Hi,

My usecase involves reading from HDFS and emit each record as a separate
tuple. Record can be either fixed length record or separator based record
(such as newline).  Expected output is byte[] for each record.

I am planning to solve this as follows:
- New operator which extends BlockReader.
- It will have configuration option to select mode for FIXED_LENGTH,
SEPARATOR_BASED.
- Use appropriate ReaderContext based on mode.

Reason for having different operator than BlockReader is because output
port signature is different than BlockReader. This new operator can be used
in conjunction with FileSplitter.

Any feedback?

~ Yogi

Re: Reading large HDFS files record by record

Posted by Yogi Devendra <de...@gmail.com>.

Created APEXMALHAR-2116 for this functionality. Please give your feedback
on the JIRA ticket.

~ Yogi

On 29 April 2016 at 15:58, Sandeep Deshmukh <sa...@datatorrent.com> wrote:

> +1
>
> Will this support reading a single file in parallel?
> On 29-Apr-2016 3:27 pm, "Mohit Jotwani" <mo...@datatorrent.com> wrote:
>
> > +1
> >
> > Regards,
> > Mohit
> >
> > On Thu, Apr 28, 2016 at 4:29 PM, Yogi Devendra <
> > devendra.vyavahare@gmail.com
> > > wrote:
> >
> > > Hi,
> > >
> > > My usecase involves reading from HDFS and emit each record as a
> separate
> > > tuple. Record can be either fixed length record or separator based
> record
> > > (such as newline).  Expected output is byte[] for each record.
> > >
> > > I am planning to solve this as follows:
> > > - New operator which extends BlockReader.
> > > - It will have configuration option to select mode for FIXED_LENGTH,
> > > SEPARATOR_BASED.
> > > - Use appropriate ReaderContext based on mode.
> > >
> > > Reason for having different operator than BlockReader is because output
> > > port signature is different than BlockReader. This new operator can be
> > used
> > > in conjunction with FileSplitter.
> > >
> > > Any feedback?
> > >
> > > ~ Yogi
> > >
> >
>

Re: Reading large HDFS files record by record

Posted by Yogi Devendra <de...@gmail.com>.

Yes. Single file read in parallel will be supported similar to
FileSplitter+BlockReader combination.

~ Yogi

On 29 April 2016 at 15:58, Sandeep Deshmukh <sa...@datatorrent.com> wrote:

> +1
>
> Will this support reading a single file in parallel?
> On 29-Apr-2016 3:27 pm, "Mohit Jotwani" <mo...@datatorrent.com> wrote:
>
> > +1
> >
> > Regards,
> > Mohit
> >
> > On Thu, Apr 28, 2016 at 4:29 PM, Yogi Devendra <
> > devendra.vyavahare@gmail.com
> > > wrote:
> >
> > > Hi,
> > >
> > > My usecase involves reading from HDFS and emit each record as a
> separate
> > > tuple. Record can be either fixed length record or separator based
> record
> > > (such as newline).  Expected output is byte[] for each record.
> > >
> > > I am planning to solve this as follows:
> > > - New operator which extends BlockReader.
> > > - It will have configuration option to select mode for FIXED_LENGTH,
> > > SEPARATOR_BASED.
> > > - Use appropriate ReaderContext based on mode.
> > >
> > > Reason for having different operator than BlockReader is because output
> > > port signature is different than BlockReader. This new operator can be
> > used
> > > in conjunction with FileSplitter.
> > >
> > > Any feedback?
> > >
> > > ~ Yogi
> > >
> >
>

Re: Reading large HDFS files record by record

Posted by Sandeep Deshmukh <sa...@datatorrent.com>.

+1

Will this support reading a single file in parallel?
On 29-Apr-2016 3:27 pm, "Mohit Jotwani" <mo...@datatorrent.com> wrote:

> +1
>
> Regards,
> Mohit
>
> On Thu, Apr 28, 2016 at 4:29 PM, Yogi Devendra <
> devendra.vyavahare@gmail.com
> > wrote:
>
> > Hi,
> >
> > My usecase involves reading from HDFS and emit each record as a separate
> > tuple. Record can be either fixed length record or separator based record
> > (such as newline).  Expected output is byte[] for each record.
> >
> > I am planning to solve this as follows:
> > - New operator which extends BlockReader.
> > - It will have configuration option to select mode for FIXED_LENGTH,
> > SEPARATOR_BASED.
> > - Use appropriate ReaderContext based on mode.
> >
> > Reason for having different operator than BlockReader is because output
> > port signature is different than BlockReader. This new operator can be
> used
> > in conjunction with FileSplitter.
> >
> > Any feedback?
> >
> > ~ Yogi
> >
>

Re: Reading large HDFS files record by record

Posted by Mohit Jotwani <mo...@datatorrent.com>.

+1

Regards,
Mohit

On Thu, Apr 28, 2016 at 4:29 PM, Yogi Devendra <devendra.vyavahare@gmail.com
> wrote:

> Hi,
>
> My usecase involves reading from HDFS and emit each record as a separate
> tuple. Record can be either fixed length record or separator based record
> (such as newline).  Expected output is byte[] for each record.
>
> I am planning to solve this as follows:
> - New operator which extends BlockReader.
> - It will have configuration option to select mode for FIXED_LENGTH,
> SEPARATOR_BASED.
> - Use appropriate ReaderContext based on mode.
>
> Reason for having different operator than BlockReader is because output
> port signature is different than BlockReader. This new operator can be used
> in conjunction with FileSplitter.
>
> Any feedback?
>
> ~ Yogi
>