You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by "Pritchard, Charles X. -ND" <Ch...@disney.com> on 2014/04/09 03:35:58 UTC

flume and hadoop append

Exploring the idea of using “append” instead of creating new files with HDFS every few minutes.
Are there particular design decisions / considerations?

There’s certainly a history of append with HDFS, mainly, earlier versions of Hadoop warn strongly against using file append semantics.


-Charles

Re: flume and hadoop append

Posted by Brock Noland <br...@cloudera.com>.

On Wed, Apr 9, 2014 at 12:54 PM, Pritchard, Charles X. -ND <
Charles.X.Pritchard.-ND@disney.com> wrote:

>
> On Apr 9, 2014, at 8:06 AM, Brock Noland <br...@cloudera.com> wrote:
>
> Hi Charles,
>
> > Exploring the idea of using "append" instead of creating new files with
> > HDFS every few minutes.
> ...
> it's possible the client would write a partial line without a newline.
> Then the client on restart would append to that existing line. The
> subsequent line would be correctly formatted.
>
>
> Is this an issue with Hadoop architecture or an issue with the way flume
> calls/does not call some kind of fsync/sync interface?
> Hadoop has append but there's no merge -- be wonderful to just write data
> then atomically call "merge this". Never a corrupt file!
>
> Having a partially appended record would have that unfortunate consequence
> of causing fastidious MR jobs to throw errors on occasion.
>

"Atomic Record Append" is a feature gap between GFS and HDFS. AFAIK there
is nothing in HDFS that precludes implementing the feature. As with most
items in the storage layer, It's a sizable amount of implementation work.


>
>
> On Tue, Apr 8, 2014 at 9:00 PM, Christopher Shannon <cshannon108@gmail.com
> > wrote:
>
>> Not sure what you are trying to do, but the HDFS sink appends. It's just
>> that you have to determine what your roll-over strategy will be. Instead of
>> every few minutes, you can set the hdfs.rollInterval=0 (disables) and set
>> the hdfs.rollSize to however large you want your files before you roll over
>> to appending to a new file. You can also use hdfs.rollCount to set your
>> roll-over for a certain number of records. I use rollSize for my roll-over
>> strategy.
>>
>
> Sounds like a good strategy. Do you also access those HDFS files while
> they're still being written to -- that is -- do you hit the edge case that
> Brock brought up?
>
>
> -Charles
>

Re: flume and hadoop append

Posted by "Pritchard, Charles X. -ND" <Ch...@disney.com>.

On Apr 9, 2014, at 8:06 AM, Brock Noland <br...@cloudera.com>> wrote:

Hi Charles,

> Exploring the idea of using “append” instead of creating new files with
> HDFS every few minutes.
...
it's possible the client would write a partial line without a newline. Then the client on restart would append to that existing line. The subsequent line would be correctly formatted.

Is this an issue with Hadoop architecture or an issue with the way flume calls/does not call some kind of fsync/sync interface?
Hadoop has append but there’s no merge — be wonderful to just write data then atomically call “merge this”. Never a corrupt file!

Having a partially appended record would have that unfortunate consequence of causing fastidious MR jobs to throw errors on occasion.

On Tue, Apr 8, 2014 at 9:00 PM, Christopher Shannon <cs...@gmail.com>> wrote:
Not sure what you are trying to do, but the HDFS sink appends. It's just that you have to determine what your roll-over strategy will be. Instead of every few minutes, you can set the hdfs.rollInterval=0 (disables) and set the hdfs.rollSize to however large you want your files before you roll over to appending to a new file. You can also use hdfs.rollCount to set your roll-over for a certain number of records. I use rollSize for my roll-over strategy.

Sounds like a good strategy. Do you also access those HDFS files while they’re still being written to — that is — do you hit the edge case that Brock brought up?

-Charles

Re: flume and hadoop append

Posted by Brock Noland <br...@cloudera.com>.

Hi Charles,

> Exploring the idea of using "append" instead of creating new files with
> HDFS every few minutes.

I wonder if this is doable by setting rollCount to 0 and then using
rollInterval (or alternatively rollSize)?

> There's certainly a history of append with HDFS, mainly, earlier
> versions of Hadoop warn strongly against using file append semantics.

Correct, HDFS 1 append did not work and would result in corrupt data. Many
users have been using append in HDFS 2 for some time. The only
consideration with append is that in certain scenarios a small portion of
the file can be corrupted. For example, when writing to a text file, it's
possible the client would write a partial line without a newline. Then the
client on restart would append to that existing line. The subsequent line
would be correctly formatted.

Cheers!
Brock

On Tue, Apr 8, 2014 at 9:00 PM, Christopher Shannon
<cs...@gmail.com>wrote:

> Not sure what you are trying to do, but the HDFS sink appends. It's just
> that you have to determine what your roll-over strategy will be. Instead of
> every few minutes, you can set the hdfs.rollInterval=0 (disables) and set
> the hdfs.rollSize to however large you want your files before you roll over
> to appending to a new file. You can also use hdfs.rollCount to set your
> roll-over for a certain number of records. I use rollSize for my roll-over
> strategy.
>
>
> On Tue, Apr 8, 2014 at 8:35 PM, Pritchard, Charles X. -ND <
> Charles.X.Pritchard.-ND@disney.com> wrote:
>
>> Exploring the idea of using "append" instead of creating new files with
>> HDFS every few minutes.
>> Are there particular design decisions / considerations?
>>
>> There's certainly a history of append with HDFS, mainly, earlier versions
>> of Hadoop warn strongly against using file append semantics.
>>
>>
>> -Charles
>
>
>

Re: flume and hadoop append

Posted by Christopher Shannon <cs...@gmail.com>.

Not sure what you are trying to do, but the HDFS sink appends. It's just
that you have to determine what your roll-over strategy will be. Instead of
every few minutes, you can set the hdfs.rollInterval=0 (disables) and set
the hdfs.rollSize to however large you want your files before you roll over
to appending to a new file. You can also use hdfs.rollCount to set your
roll-over for a certain number of records. I use rollSize for my roll-over
strategy.

On Tue, Apr 8, 2014 at 8:35 PM, Pritchard, Charles X. -ND <
Charles.X.Pritchard.-ND@disney.com> wrote:

> Exploring the idea of using "append" instead of creating new files with
> HDFS every few minutes.
> Are there particular design decisions / considerations?
>
> There's certainly a history of append with HDFS, mainly, earlier versions
> of Hadoop warn strongly against using file append semantics.
>
>
> -Charles