You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@apex.apache.org by Devendra Tagare <de...@datatorrent.com> on 2016/03/23 19:11:43 UTC

Aligning FileSplitter and BlocReader with hadoop.mapreduce InputFormats

Hi All,

Initiating this thread to get the community's opinion on aligning the
FileSplitter with InputSplit & the BlockReader with the RecordReader from
org.apache.hadoop.mapreduce.InputSplit &
org.apache.hadoop.mapreduce.RecordReader respectively.

Some more details and rationale on the approach,

InputFormat lets MR create Input Splits ie individual chunks of bytes.
The ability to correctly create these splits is determined by the Input
Format itself.eg SequenceFile format or Avro.

Internally these formats are organized as a sequence of blocks.Each block
can be compressed with a compression codec & it does not matter if this
codec in itself is splittable.
When they are set as an Input format, the MR framework creates input splits
based on the block boundaries given by the metadata object packed with the
file.

Each InputFormat has a specific block definition. eg for Avro the block
definition is as below,

Avro file data block consists of:

A long indicating the count of objects in this block.
A long indicating the size in bytes of the serialized objects in the
current block, after any codec is applied
The serialized objects. If a codec is specified, this is compressed by that
codec.
The file's 16-byte sync marker.
Thus, each block's binary data can be efficiently extracted or skipped
without deserializing the contents. The combination of block size, object
counts, and sync markers enable detection of corrupt blocks and help ensure
data integrity.

Each map task gets an entire block to read.RecordReader is used to read the
individual records for the block and generates key,val pairs.
The records could be fixed length or use a schema as in the case of parquet
or Avro.

We can extend the BlockReader to work with RecordReader based on the sync
markers to correctly identify & parse the individual records.

Please send across your thoughts on the same.

Thanks,
Dev

Re: Aligning FileSplitter and BlocReader with hadoop.mapreduce InputFormats

Posted by Shubham Pathak <sh...@datatorrent.com>.

+1 . Will certainly be a good addition.

On Fri, Mar 25, 2016 at 9:31 AM, Chandni Singh <ch...@datatorrent.com>
wrote:

> +1 for the idea
> On Mar 24, 2016 8:41 PM, "Thomas Weise" <th...@datatorrent.com> wrote:
>
> > +1 for the idea in general and extending existing implementation.
> >
> > In case this introduces a MapReduce dependency we will also need to
> > consider a separate module.
> >
> > Thomas
> >
> >
> > On Thu, Mar 24, 2016 at 2:35 AM, Devendra Tagare <
> > devendrat@datatorrent.com>
> > wrote:
> >
> > > Hi,
> > >
> > > We are thinking of extending the FileSplitter and BlockReader .
> > > Changing the existing code could have side effects.
> > >
> > > Thanks,
> > > Dev
> > > On Mar 24, 2016 1:16 AM, "Tushar Gosavi" <tu...@datatorrent.com>
> wrote:
> > >
> > > > My suggestion is to extend from FileSplitter and BlockReader without
> > > > changing them, and add support for InputFormat in derived classes.
> > > > FileSplitter and BlockReader already provides enough hooks to define
> > > splits
> > > > and read records.
> > > >
> > > > - Tushar.
> > > >
> > > >
> > > > On Thu, Mar 24, 2016 at 11:17 AM, Yogi Devendra <
> > yogidevendra@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > Aligning FileSplitter, BlockReader with respective counterparts
> from
> > > > > mapreduce will be excellent value addition.
> > > > >
> > > > > IMO, it has 2 advantages:
> > > > >
> > > > > 1. It will allow us to plug-in more formats for
> > > FileSplitter+BlockReader
> > > > > pattern use-cases.
> > > > > 2. It will be easy for end-users coming from mapreduce background
> if
> > > they
> > > > > get something equivalent in Apex.
> > > > >
> > > > > One question:
> > > > > Are you planning to refactor existing FileSplitter, BlockReader OR
> > plan
> > > > is
> > > > > to have this implementation as fresh classes?
> > > > > If these are fresh classes, are we saying that they will eventually
> > > > > deprecate the existing FileSplitter, BlockReader?
> > > > >
> > > > > We have other few other components dependent on existing
> > FileSplitter,
> > > > > BlockReader. Hence, would like to know about future direction for
> > these
> > > > > classes.
> > > > >
> > > > > ~ Yogi
> > > > >
> > > > > On 24 March 2016 at 10:47, Priyanka Gugale <
> priyanka@datatorrent.com
> > >
> > > > > wrote:
> > > > >
> > > > > > So as I understand splitter would be format aware, in that case
> > would
> > > > we
> > > > > > need different kinds of parser we have right now? Or the format
> > aware
> > > > > > splitter will take care of parsing different file formats e.g.
> csv
> > > etc?
> > > > > >
> > > > > > -Priyanka
> > > > > >
> > > > > > On Wed, Mar 23, 2016 at 11:41 PM, Devendra Tagare <
> > > > > > devendrat@datatorrent.com
> > > > > > > wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > Initiating this thread to get the community's opinion on
> aligning
> > > the
> > > > > > > FileSplitter with InputSplit & the BlockReader with the
> > > RecordReader
> > > > > from
> > > > > > > org.apache.hadoop.mapreduce.InputSplit &
> > > > > > > org.apache.hadoop.mapreduce.RecordReader respectively.
> > > > > > >
> > > > > > > Some more details and rationale on the approach,
> > > > > > >
> > > > > > > InputFormat lets MR create Input Splits ie individual chunks of
> > > > bytes.
> > > > > > > The ability to correctly create these splits is determined by
> the
> > > > Input
> > > > > > > Format itself.eg SequenceFile format or Avro.
> > > > > > >
> > > > > > > Internally these formats are organized as a sequence of
> > blocks.Each
> > > > > block
> > > > > > > can be compressed with a compression codec & it does not matter
> > if
> > > > this
> > > > > > > codec in itself is splittable.
> > > > > > > When they are set as an Input format, the MR framework creates
> > > input
> > > > > > splits
> > > > > > > based on the block boundaries given by the metadata object
> packed
> > > > with
> > > > > > the
> > > > > > > file.
> > > > > > >
> > > > > > > Each InputFormat has a specific block definition. eg for Avro
> the
> > > > block
> > > > > > > definition is as below,
> > > > > > >
> > > > > > > Avro file data block consists of:
> > > > > > >
> > > > > > > A long indicating the count of objects in this block.
> > > > > > > A long indicating the size in bytes of the serialized objects
> in
> > > the
> > > > > > > current block, after any codec is applied
> > > > > > > The serialized objects. If a codec is specified, this is
> > compressed
> > > > by
> > > > > > that
> > > > > > > codec.
> > > > > > > The file's 16-byte sync marker.
> > > > > > > Thus, each block's binary data can be efficiently extracted or
> > > > skipped
> > > > > > > without deserializing the contents. The combination of block
> > size,
> > > > > object
> > > > > > > counts, and sync markers enable detection of corrupt blocks and
> > > help
> > > > > > ensure
> > > > > > > data integrity.
> > > > > > >
> > > > > > > Each map task gets an entire block to read.RecordReader is used
> > to
> > > > read
> > > > > > the
> > > > > > > individual records for the block and generates key,val pairs.
> > > > > > > The records could be fixed length or use a schema as in the
> case
> > of
> > > > > > parquet
> > > > > > > or Avro.
> > > > > > >
> > > > > > > We can extend the BlockReader to work with RecordReader based
> on
> > > the
> > > > > sync
> > > > > > > markers to correctly identify & parse the individual records.
> > > > > > >
> > > > > > > Please send across your thoughts on the same.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Dev
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Aligning FileSplitter and BlocReader with hadoop.mapreduce InputFormats

Posted by Chandni Singh <ch...@datatorrent.com>.

+1 for the idea
On Mar 24, 2016 8:41 PM, "Thomas Weise" <th...@datatorrent.com> wrote:

> +1 for the idea in general and extending existing implementation.
>
> In case this introduces a MapReduce dependency we will also need to
> consider a separate module.
>
> Thomas
>
>
> On Thu, Mar 24, 2016 at 2:35 AM, Devendra Tagare <
> devendrat@datatorrent.com>
> wrote:
>
> > Hi,
> >
> > We are thinking of extending the FileSplitter and BlockReader .
> > Changing the existing code could have side effects.
> >
> > Thanks,
> > Dev
> > On Mar 24, 2016 1:16 AM, "Tushar Gosavi" <tu...@datatorrent.com> wrote:
> >
> > > My suggestion is to extend from FileSplitter and BlockReader without
> > > changing them, and add support for InputFormat in derived classes.
> > > FileSplitter and BlockReader already provides enough hooks to define
> > splits
> > > and read records.
> > >
> > > - Tushar.
> > >
> > >
> > > On Thu, Mar 24, 2016 at 11:17 AM, Yogi Devendra <
> yogidevendra@apache.org
> > >
> > > wrote:
> > >
> > > > Aligning FileSplitter, BlockReader with respective counterparts from
> > > > mapreduce will be excellent value addition.
> > > >
> > > > IMO, it has 2 advantages:
> > > >
> > > > 1. It will allow us to plug-in more formats for
> > FileSplitter+BlockReader
> > > > pattern use-cases.
> > > > 2. It will be easy for end-users coming from mapreduce background if
> > they
> > > > get something equivalent in Apex.
> > > >
> > > > One question:
> > > > Are you planning to refactor existing FileSplitter, BlockReader OR
> plan
> > > is
> > > > to have this implementation as fresh classes?
> > > > If these are fresh classes, are we saying that they will eventually
> > > > deprecate the existing FileSplitter, BlockReader?
> > > >
> > > > We have other few other components dependent on existing
> FileSplitter,
> > > > BlockReader. Hence, would like to know about future direction for
> these
> > > > classes.
> > > >
> > > > ~ Yogi
> > > >
> > > > On 24 March 2016 at 10:47, Priyanka Gugale <priyanka@datatorrent.com
> >
> > > > wrote:
> > > >
> > > > > So as I understand splitter would be format aware, in that case
> would
> > > we
> > > > > need different kinds of parser we have right now? Or the format
> aware
> > > > > splitter will take care of parsing different file formats e.g. csv
> > etc?
> > > > >
> > > > > -Priyanka
> > > > >
> > > > > On Wed, Mar 23, 2016 at 11:41 PM, Devendra Tagare <
> > > > > devendrat@datatorrent.com
> > > > > > wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > Initiating this thread to get the community's opinion on aligning
> > the
> > > > > > FileSplitter with InputSplit & the BlockReader with the
> > RecordReader
> > > > from
> > > > > > org.apache.hadoop.mapreduce.InputSplit &
> > > > > > org.apache.hadoop.mapreduce.RecordReader respectively.
> > > > > >
> > > > > > Some more details and rationale on the approach,
> > > > > >
> > > > > > InputFormat lets MR create Input Splits ie individual chunks of
> > > bytes.
> > > > > > The ability to correctly create these splits is determined by the
> > > Input
> > > > > > Format itself.eg SequenceFile format or Avro.
> > > > > >
> > > > > > Internally these formats are organized as a sequence of
> blocks.Each
> > > > block
> > > > > > can be compressed with a compression codec & it does not matter
> if
> > > this
> > > > > > codec in itself is splittable.
> > > > > > When they are set as an Input format, the MR framework creates
> > input
> > > > > splits
> > > > > > based on the block boundaries given by the metadata object packed
> > > with
> > > > > the
> > > > > > file.
> > > > > >
> > > > > > Each InputFormat has a specific block definition. eg for Avro the
> > > block
> > > > > > definition is as below,
> > > > > >
> > > > > > Avro file data block consists of:
> > > > > >
> > > > > > A long indicating the count of objects in this block.
> > > > > > A long indicating the size in bytes of the serialized objects in
> > the
> > > > > > current block, after any codec is applied
> > > > > > The serialized objects. If a codec is specified, this is
> compressed
> > > by
> > > > > that
> > > > > > codec.
> > > > > > The file's 16-byte sync marker.
> > > > > > Thus, each block's binary data can be efficiently extracted or
> > > skipped
> > > > > > without deserializing the contents. The combination of block
> size,
> > > > object
> > > > > > counts, and sync markers enable detection of corrupt blocks and
> > help
> > > > > ensure
> > > > > > data integrity.
> > > > > >
> > > > > > Each map task gets an entire block to read.RecordReader is used
> to
> > > read
> > > > > the
> > > > > > individual records for the block and generates key,val pairs.
> > > > > > The records could be fixed length or use a schema as in the case
> of
> > > > > parquet
> > > > > > or Avro.
> > > > > >
> > > > > > We can extend the BlockReader to work with RecordReader based on
> > the
> > > > sync
> > > > > > markers to correctly identify & parse the individual records.
> > > > > >
> > > > > > Please send across your thoughts on the same.
> > > > > >
> > > > > > Thanks,
> > > > > > Dev
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Aligning FileSplitter and BlocReader with hadoop.mapreduce InputFormats

Posted by Thomas Weise <th...@datatorrent.com>.

+1 for the idea in general and extending existing implementation.

In case this introduces a MapReduce dependency we will also need to
consider a separate module.

Thomas


On Thu, Mar 24, 2016 at 2:35 AM, Devendra Tagare <de...@datatorrent.com>
wrote:

> Hi,
>
> We are thinking of extending the FileSplitter and BlockReader .
> Changing the existing code could have side effects.
>
> Thanks,
> Dev
> On Mar 24, 2016 1:16 AM, "Tushar Gosavi" <tu...@datatorrent.com> wrote:
>
> > My suggestion is to extend from FileSplitter and BlockReader without
> > changing them, and add support for InputFormat in derived classes.
> > FileSplitter and BlockReader already provides enough hooks to define
> splits
> > and read records.
> >
> > - Tushar.
> >
> >
> > On Thu, Mar 24, 2016 at 11:17 AM, Yogi Devendra <yogidevendra@apache.org
> >
> > wrote:
> >
> > > Aligning FileSplitter, BlockReader with respective counterparts from
> > > mapreduce will be excellent value addition.
> > >
> > > IMO, it has 2 advantages:
> > >
> > > 1. It will allow us to plug-in more formats for
> FileSplitter+BlockReader
> > > pattern use-cases.
> > > 2. It will be easy for end-users coming from mapreduce background if
> they
> > > get something equivalent in Apex.
> > >
> > > One question:
> > > Are you planning to refactor existing FileSplitter, BlockReader OR plan
> > is
> > > to have this implementation as fresh classes?
> > > If these are fresh classes, are we saying that they will eventually
> > > deprecate the existing FileSplitter, BlockReader?
> > >
> > > We have other few other components dependent on existing FileSplitter,
> > > BlockReader. Hence, would like to know about future direction for these
> > > classes.
> > >
> > > ~ Yogi
> > >
> > > On 24 March 2016 at 10:47, Priyanka Gugale <pr...@datatorrent.com>
> > > wrote:
> > >
> > > > So as I understand splitter would be format aware, in that case would
> > we
> > > > need different kinds of parser we have right now? Or the format aware
> > > > splitter will take care of parsing different file formats e.g. csv
> etc?
> > > >
> > > > -Priyanka
> > > >
> > > > On Wed, Mar 23, 2016 at 11:41 PM, Devendra Tagare <
> > > > devendrat@datatorrent.com
> > > > > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > Initiating this thread to get the community's opinion on aligning
> the
> > > > > FileSplitter with InputSplit & the BlockReader with the
> RecordReader
> > > from
> > > > > org.apache.hadoop.mapreduce.InputSplit &
> > > > > org.apache.hadoop.mapreduce.RecordReader respectively.
> > > > >
> > > > > Some more details and rationale on the approach,
> > > > >
> > > > > InputFormat lets MR create Input Splits ie individual chunks of
> > bytes.
> > > > > The ability to correctly create these splits is determined by the
> > Input
> > > > > Format itself.eg SequenceFile format or Avro.
> > > > >
> > > > > Internally these formats are organized as a sequence of blocks.Each
> > > block
> > > > > can be compressed with a compression codec & it does not matter if
> > this
> > > > > codec in itself is splittable.
> > > > > When they are set as an Input format, the MR framework creates
> input
> > > > splits
> > > > > based on the block boundaries given by the metadata object packed
> > with
> > > > the
> > > > > file.
> > > > >
> > > > > Each InputFormat has a specific block definition. eg for Avro the
> > block
> > > > > definition is as below,
> > > > >
> > > > > Avro file data block consists of:
> > > > >
> > > > > A long indicating the count of objects in this block.
> > > > > A long indicating the size in bytes of the serialized objects in
> the
> > > > > current block, after any codec is applied
> > > > > The serialized objects. If a codec is specified, this is compressed
> > by
> > > > that
> > > > > codec.
> > > > > The file's 16-byte sync marker.
> > > > > Thus, each block's binary data can be efficiently extracted or
> > skipped
> > > > > without deserializing the contents. The combination of block size,
> > > object
> > > > > counts, and sync markers enable detection of corrupt blocks and
> help
> > > > ensure
> > > > > data integrity.
> > > > >
> > > > > Each map task gets an entire block to read.RecordReader is used to
> > read
> > > > the
> > > > > individual records for the block and generates key,val pairs.
> > > > > The records could be fixed length or use a schema as in the case of
> > > > parquet
> > > > > or Avro.
> > > > >
> > > > > We can extend the BlockReader to work with RecordReader based on
> the
> > > sync
> > > > > markers to correctly identify & parse the individual records.
> > > > >
> > > > > Please send across your thoughts on the same.
> > > > >
> > > > > Thanks,
> > > > > Dev
> > > > >
> > > >
> > >
> >
>

Re: Aligning FileSplitter and BlocReader with hadoop.mapreduce InputFormats

Posted by Devendra Tagare <de...@datatorrent.com>.

Hi,

We are thinking of extending the FileSplitter and BlockReader .
Changing the existing code could have side effects.

Thanks,
Dev
On Mar 24, 2016 1:16 AM, "Tushar Gosavi" <tu...@datatorrent.com> wrote:

> My suggestion is to extend from FileSplitter and BlockReader without
> changing them, and add support for InputFormat in derived classes.
> FileSplitter and BlockReader already provides enough hooks to define splits
> and read records.
>
> - Tushar.
>
>
> On Thu, Mar 24, 2016 at 11:17 AM, Yogi Devendra <yo...@apache.org>
> wrote:
>
> > Aligning FileSplitter, BlockReader with respective counterparts from
> > mapreduce will be excellent value addition.
> >
> > IMO, it has 2 advantages:
> >
> > 1. It will allow us to plug-in more formats for FileSplitter+BlockReader
> > pattern use-cases.
> > 2. It will be easy for end-users coming from mapreduce background if they
> > get something equivalent in Apex.
> >
> > One question:
> > Are you planning to refactor existing FileSplitter, BlockReader OR plan
> is
> > to have this implementation as fresh classes?
> > If these are fresh classes, are we saying that they will eventually
> > deprecate the existing FileSplitter, BlockReader?
> >
> > We have other few other components dependent on existing FileSplitter,
> > BlockReader. Hence, would like to know about future direction for these
> > classes.
> >
> > ~ Yogi
> >
> > On 24 March 2016 at 10:47, Priyanka Gugale <pr...@datatorrent.com>
> > wrote:
> >
> > > So as I understand splitter would be format aware, in that case would
> we
> > > need different kinds of parser we have right now? Or the format aware
> > > splitter will take care of parsing different file formats e.g. csv etc?
> > >
> > > -Priyanka
> > >
> > > On Wed, Mar 23, 2016 at 11:41 PM, Devendra Tagare <
> > > devendrat@datatorrent.com
> > > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > Initiating this thread to get the community's opinion on aligning the
> > > > FileSplitter with InputSplit & the BlockReader with the RecordReader
> > from
> > > > org.apache.hadoop.mapreduce.InputSplit &
> > > > org.apache.hadoop.mapreduce.RecordReader respectively.
> > > >
> > > > Some more details and rationale on the approach,
> > > >
> > > > InputFormat lets MR create Input Splits ie individual chunks of
> bytes.
> > > > The ability to correctly create these splits is determined by the
> Input
> > > > Format itself.eg SequenceFile format or Avro.
> > > >
> > > > Internally these formats are organized as a sequence of blocks.Each
> > block
> > > > can be compressed with a compression codec & it does not matter if
> this
> > > > codec in itself is splittable.
> > > > When they are set as an Input format, the MR framework creates input
> > > splits
> > > > based on the block boundaries given by the metadata object packed
> with
> > > the
> > > > file.
> > > >
> > > > Each InputFormat has a specific block definition. eg for Avro the
> block
> > > > definition is as below,
> > > >
> > > > Avro file data block consists of:
> > > >
> > > > A long indicating the count of objects in this block.
> > > > A long indicating the size in bytes of the serialized objects in the
> > > > current block, after any codec is applied
> > > > The serialized objects. If a codec is specified, this is compressed
> by
> > > that
> > > > codec.
> > > > The file's 16-byte sync marker.
> > > > Thus, each block's binary data can be efficiently extracted or
> skipped
> > > > without deserializing the contents. The combination of block size,
> > object
> > > > counts, and sync markers enable detection of corrupt blocks and help
> > > ensure
> > > > data integrity.
> > > >
> > > > Each map task gets an entire block to read.RecordReader is used to
> read
> > > the
> > > > individual records for the block and generates key,val pairs.
> > > > The records could be fixed length or use a schema as in the case of
> > > parquet
> > > > or Avro.
> > > >
> > > > We can extend the BlockReader to work with RecordReader based on the
> > sync
> > > > markers to correctly identify & parse the individual records.
> > > >
> > > > Please send across your thoughts on the same.
> > > >
> > > > Thanks,
> > > > Dev
> > > >
> > >
> >
>

Re: Aligning FileSplitter and BlocReader with hadoop.mapreduce InputFormats

Posted by Tushar Gosavi <tu...@datatorrent.com>.

My suggestion is to extend from FileSplitter and BlockReader without
changing them, and add support for InputFormat in derived classes.
FileSplitter and BlockReader already provides enough hooks to define splits
and read records.

- Tushar.


On Thu, Mar 24, 2016 at 11:17 AM, Yogi Devendra <yo...@apache.org>
wrote:

> Aligning FileSplitter, BlockReader with respective counterparts from
> mapreduce will be excellent value addition.
>
> IMO, it has 2 advantages:
>
> 1. It will allow us to plug-in more formats for FileSplitter+BlockReader
> pattern use-cases.
> 2. It will be easy for end-users coming from mapreduce background if they
> get something equivalent in Apex.
>
> One question:
> Are you planning to refactor existing FileSplitter, BlockReader OR plan is
> to have this implementation as fresh classes?
> If these are fresh classes, are we saying that they will eventually
> deprecate the existing FileSplitter, BlockReader?
>
> We have other few other components dependent on existing FileSplitter,
> BlockReader. Hence, would like to know about future direction for these
> classes.
>
> ~ Yogi
>
> On 24 March 2016 at 10:47, Priyanka Gugale <pr...@datatorrent.com>
> wrote:
>
> > So as I understand splitter would be format aware, in that case would we
> > need different kinds of parser we have right now? Or the format aware
> > splitter will take care of parsing different file formats e.g. csv etc?
> >
> > -Priyanka
> >
> > On Wed, Mar 23, 2016 at 11:41 PM, Devendra Tagare <
> > devendrat@datatorrent.com
> > > wrote:
> >
> > > Hi All,
> > >
> > > Initiating this thread to get the community's opinion on aligning the
> > > FileSplitter with InputSplit & the BlockReader with the RecordReader
> from
> > > org.apache.hadoop.mapreduce.InputSplit &
> > > org.apache.hadoop.mapreduce.RecordReader respectively.
> > >
> > > Some more details and rationale on the approach,
> > >
> > > InputFormat lets MR create Input Splits ie individual chunks of bytes.
> > > The ability to correctly create these splits is determined by the Input
> > > Format itself.eg SequenceFile format or Avro.
> > >
> > > Internally these formats are organized as a sequence of blocks.Each
> block
> > > can be compressed with a compression codec & it does not matter if this
> > > codec in itself is splittable.
> > > When they are set as an Input format, the MR framework creates input
> > splits
> > > based on the block boundaries given by the metadata object packed with
> > the
> > > file.
> > >
> > > Each InputFormat has a specific block definition. eg for Avro the block
> > > definition is as below,
> > >
> > > Avro file data block consists of:
> > >
> > > A long indicating the count of objects in this block.
> > > A long indicating the size in bytes of the serialized objects in the
> > > current block, after any codec is applied
> > > The serialized objects. If a codec is specified, this is compressed by
> > that
> > > codec.
> > > The file's 16-byte sync marker.
> > > Thus, each block's binary data can be efficiently extracted or skipped
> > > without deserializing the contents. The combination of block size,
> object
> > > counts, and sync markers enable detection of corrupt blocks and help
> > ensure
> > > data integrity.
> > >
> > > Each map task gets an entire block to read.RecordReader is used to read
> > the
> > > individual records for the block and generates key,val pairs.
> > > The records could be fixed length or use a schema as in the case of
> > parquet
> > > or Avro.
> > >
> > > We can extend the BlockReader to work with RecordReader based on the
> sync
> > > markers to correctly identify & parse the individual records.
> > >
> > > Please send across your thoughts on the same.
> > >
> > > Thanks,
> > > Dev
> > >
> >
>

Re: Aligning FileSplitter and BlocReader with hadoop.mapreduce InputFormats

Posted by Yogi Devendra <yo...@apache.org>.

Aligning FileSplitter, BlockReader with respective counterparts from
mapreduce will be excellent value addition.

IMO, it has 2 advantages:

1. It will allow us to plug-in more formats for FileSplitter+BlockReader
pattern use-cases.
2. It will be easy for end-users coming from mapreduce background if they
get something equivalent in Apex.

One question:
Are you planning to refactor existing FileSplitter, BlockReader OR plan is
to have this implementation as fresh classes?
If these are fresh classes, are we saying that they will eventually
deprecate the existing FileSplitter, BlockReader?

We have other few other components dependent on existing FileSplitter,
BlockReader. Hence, would like to know about future direction for these
classes.

~ Yogi

On 24 March 2016 at 10:47, Priyanka Gugale <pr...@datatorrent.com> wrote:

> So as I understand splitter would be format aware, in that case would we
> need different kinds of parser we have right now? Or the format aware
> splitter will take care of parsing different file formats e.g. csv etc?
>
> -Priyanka
>
> On Wed, Mar 23, 2016 at 11:41 PM, Devendra Tagare <
> devendrat@datatorrent.com
> > wrote:
>
> > Hi All,
> >
> > Initiating this thread to get the community's opinion on aligning the
> > FileSplitter with InputSplit & the BlockReader with the RecordReader from
> > org.apache.hadoop.mapreduce.InputSplit &
> > org.apache.hadoop.mapreduce.RecordReader respectively.
> >
> > Some more details and rationale on the approach,
> >
> > InputFormat lets MR create Input Splits ie individual chunks of bytes.
> > The ability to correctly create these splits is determined by the Input
> > Format itself.eg SequenceFile format or Avro.
> >
> > Internally these formats are organized as a sequence of blocks.Each block
> > can be compressed with a compression codec & it does not matter if this
> > codec in itself is splittable.
> > When they are set as an Input format, the MR framework creates input
> splits
> > based on the block boundaries given by the metadata object packed with
> the
> > file.
> >
> > Each InputFormat has a specific block definition. eg for Avro the block
> > definition is as below,
> >
> > Avro file data block consists of:
> >
> > A long indicating the count of objects in this block.
> > A long indicating the size in bytes of the serialized objects in the
> > current block, after any codec is applied
> > The serialized objects. If a codec is specified, this is compressed by
> that
> > codec.
> > The file's 16-byte sync marker.
> > Thus, each block's binary data can be efficiently extracted or skipped
> > without deserializing the contents. The combination of block size, object
> > counts, and sync markers enable detection of corrupt blocks and help
> ensure
> > data integrity.
> >
> > Each map task gets an entire block to read.RecordReader is used to read
> the
> > individual records for the block and generates key,val pairs.
> > The records could be fixed length or use a schema as in the case of
> parquet
> > or Avro.
> >
> > We can extend the BlockReader to work with RecordReader based on the sync
> > markers to correctly identify & parse the individual records.
> >
> > Please send across your thoughts on the same.
> >
> > Thanks,
> > Dev
> >
>

Re: Aligning FileSplitter and BlocReader with hadoop.mapreduce InputFormats

Posted by Priyanka Gugale <pr...@datatorrent.com>.

So as I understand splitter would be format aware, in that case would we
need different kinds of parser we have right now? Or the format aware
splitter will take care of parsing different file formats e.g. csv etc?

-Priyanka

On Wed, Mar 23, 2016 at 11:41 PM, Devendra Tagare <devendrat@datatorrent.com
> wrote:

> Hi All,
>
> Initiating this thread to get the community's opinion on aligning the
> FileSplitter with InputSplit & the BlockReader with the RecordReader from
> org.apache.hadoop.mapreduce.InputSplit &
> org.apache.hadoop.mapreduce.RecordReader respectively.
>
> Some more details and rationale on the approach,
>
> InputFormat lets MR create Input Splits ie individual chunks of bytes.
> The ability to correctly create these splits is determined by the Input
> Format itself.eg SequenceFile format or Avro.
>
> Internally these formats are organized as a sequence of blocks.Each block
> can be compressed with a compression codec & it does not matter if this
> codec in itself is splittable.
> When they are set as an Input format, the MR framework creates input splits
> based on the block boundaries given by the metadata object packed with the
> file.
>
> Each InputFormat has a specific block definition. eg for Avro the block
> definition is as below,
>
> Avro file data block consists of:
>
> A long indicating the count of objects in this block.
> A long indicating the size in bytes of the serialized objects in the
> current block, after any codec is applied
> The serialized objects. If a codec is specified, this is compressed by that
> codec.
> The file's 16-byte sync marker.
> Thus, each block's binary data can be efficiently extracted or skipped
> without deserializing the contents. The combination of block size, object
> counts, and sync markers enable detection of corrupt blocks and help ensure
> data integrity.
>
> Each map task gets an entire block to read.RecordReader is used to read the
> individual records for the block and generates key,val pairs.
> The records could be fixed length or use a schema as in the case of parquet
> or Avro.
>
> We can extend the BlockReader to work with RecordReader based on the sync
> markers to correctly identify & parse the individual records.
>
> Please send across your thoughts on the same.
>
> Thanks,
> Dev
>