You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@apex.apache.org by Yogi Devendra <yo...@apache.org> on 2015/12/11 09:37:10 UTC

AbstractFileOutputOperator maxLength roll over handling

Hi,

I am using AbstractFileOutputOperator in my application for writing
incoming tuples into a file on HDFS.

Considering that there could be failover scenarios; I am using
fileOutputOperator.setMaxLength() for rolling over the files after
specified length. Assuming that, rolled over files would have faster
recovery from the failure (since recovery is only for the last part of the
file and not for the entire file).

To set the maxLength; there is no specific recommended value from the
usecase. Hence, I would prefer the rolled over file sizes to be equal to
Block size for HDFS (say 64 MB).

With the current implementation of AbstractFileOutputOperator; actual file
sizes for the rolled over file would be slightly greater than 64MB. This is
because, file is being rolled over after the incoming tuple is written to
to the file. The check for file size (for roll over) happens after the
tuple is written to the file.

I believe that, files slightly greater than 64MB would result in 2 entries
on the NameNode. This can be avoided if we flip the sequence of checking
the file size (adding incoming tuple) and then rolling over to new file
*before* writing the incoming tuple.

Do you think that, this improvement should be considered? If yes; I will
create a JIRA and work on it.

Also, does this code change break backward compatibility? Although,
signature of the API remains same; but there is slight change in the
semantics. Thus, wanted to get feedback from the community.

~ Yogi

Re: AbstractFileOutputOperator maxLength roll over handling

Posted by Sandeep Deshmukh <sa...@datatorrent.com>.

File size just more than the block size will create two hdfs blocks and
hence slight performance hit. Second block is likely to be very small, few
bytes, which is not advisable on HDFS.

I would vote for flipping the check but taking into account Ram's point.
On 11 Dec 2015 20:43, "Munagala Ramanath" <ra...@datatorrent.com> wrote:

> Guess we don't need to worry about the case when the tuple size itself is
> larger than the
> HDFS block size :-)
>
> Ram
>
> On Fri, Dec 11, 2015 at 12:37 AM, Yogi Devendra <yo...@apache.org>
> wrote:
>
> > Hi,
> >
> > I am using AbstractFileOutputOperator in my application for writing
> > incoming tuples into a file on HDFS.
> >
> > Considering that there could be failover scenarios; I am using
> > fileOutputOperator.setMaxLength() for rolling over the files after
> > specified length. Assuming that, rolled over files would have faster
> > recovery from the failure (since recovery is only for the last part of
> the
> > file and not for the entire file).
> >
> > To set the maxLength; there is no specific recommended value from the
> > usecase. Hence, I would prefer the rolled over file sizes to be equal to
> > Block size for HDFS (say 64 MB).
> >
> > With the current implementation of AbstractFileOutputOperator; actual
> file
> > sizes for the rolled over file would be slightly greater than 64MB. This
> is
> > because, file is being rolled over after the incoming tuple is written to
> > to the file. The check for file size (for roll over) happens after the
> > tuple is written to the file.
> >
> > I believe that, files slightly greater than 64MB would result in 2
> entries
> > on the NameNode. This can be avoided if we flip the sequence of checking
> > the file size (adding incoming tuple) and then rolling over to new file
> > *before* writing the incoming tuple.
> >
> > Do you think that, this improvement should be considered? If yes; I will
> > create a JIRA and work on it.
> >
> > Also, does this code change break backward compatibility? Although,
> > signature of the API remains same; but there is slight change in the
> > semantics. Thus, wanted to get feedback from the community.
> >
> > ~ Yogi
> >
>

Re: AbstractFileOutputOperator maxLength roll over handling

Posted by Munagala Ramanath <ra...@datatorrent.com>.

Guess we don't need to worry about the case when the tuple size itself is
larger than the
HDFS block size :-)

Ram

On Fri, Dec 11, 2015 at 12:37 AM, Yogi Devendra <yo...@apache.org>
wrote:

> Hi,
>
> I am using AbstractFileOutputOperator in my application for writing
> incoming tuples into a file on HDFS.
>
> Considering that there could be failover scenarios; I am using
> fileOutputOperator.setMaxLength() for rolling over the files after
> specified length. Assuming that, rolled over files would have faster
> recovery from the failure (since recovery is only for the last part of the
> file and not for the entire file).
>
> To set the maxLength; there is no specific recommended value from the
> usecase. Hence, I would prefer the rolled over file sizes to be equal to
> Block size for HDFS (say 64 MB).
>
> With the current implementation of AbstractFileOutputOperator; actual file
> sizes for the rolled over file would be slightly greater than 64MB. This is
> because, file is being rolled over after the incoming tuple is written to
> to the file. The check for file size (for roll over) happens after the
> tuple is written to the file.
>
> I believe that, files slightly greater than 64MB would result in 2 entries
> on the NameNode. This can be avoided if we flip the sequence of checking
> the file size (adding incoming tuple) and then rolling over to new file
> *before* writing the incoming tuple.
>
> Do you think that, this improvement should be considered? If yes; I will
> create a JIRA and work on it.
>
> Also, does this code change break backward compatibility? Although,
> signature of the API remains same; but there is slight change in the
> semantics. Thus, wanted to get feedback from the community.
>
> ~ Yogi
>

Re: AbstractFileOutputOperator maxLength roll over handling

Posted by Pramod Immaneni <pr...@datatorrent.com>.

Yes we should flip the check.

On Fri, Dec 11, 2015 at 12:37 AM, Yogi Devendra <yo...@apache.org>
wrote:

> Hi,
>
> I am using AbstractFileOutputOperator in my application for writing
> incoming tuples into a file on HDFS.
>
> Considering that there could be failover scenarios; I am using
> fileOutputOperator.setMaxLength() for rolling over the files after
> specified length. Assuming that, rolled over files would have faster
> recovery from the failure (since recovery is only for the last part of the
> file and not for the entire file).
>
> To set the maxLength; there is no specific recommended value from the
> usecase. Hence, I would prefer the rolled over file sizes to be equal to
> Block size for HDFS (say 64 MB).
>
> With the current implementation of AbstractFileOutputOperator; actual file
> sizes for the rolled over file would be slightly greater than 64MB. This is
> because, file is being rolled over after the incoming tuple is written to
> to the file. The check for file size (for roll over) happens after the
> tuple is written to the file.
>
> I believe that, files slightly greater than 64MB would result in 2 entries
> on the NameNode. This can be avoided if we flip the sequence of checking
> the file size (adding incoming tuple) and then rolling over to new file
> *before* writing the incoming tuple.
>
> Do you think that, this improvement should be considered? If yes; I will
> create a JIRA and work on it.
>
> Also, does this code change break backward compatibility? Although,
> signature of the API remains same; but there is slight change in the
> semantics. Thus, wanted to get feedback from the community.
>
> ~ Yogi
>