You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by バーチャル　クリストファー <bi...@infoscience.co.jp> on 2012/08/02 04:07:08 UTC

Writing reliably to HDFS

Hi,

I'm trying to write events to HDFS using Flume 1.2.0 and I have a couple
of questions.

Firstly, about the reliability semantics of the HdfsEventSink.

My number one requirement is reliability, i.e. not losing any events.
Ideally, by the time the HdfsEventSink commits the transaction, all
events should be safely written to HDFS and visible to other clients, so
that no data is lost even if the agent dies after that point. But what
is actually happening in my tests is as follows:

1. The HDFS sink takes some events from the FileChannel and writes them
to a SequenceFile on HDFS
2. The sink commits the transaction, and the FileChannel updates its
checkpoint. As far as FileChannel is concerned, the events have been
safely written to the sink.
3. Kill the agent.

Result: I'm left with a weird zero-byte, non-zero-byte tmp file on HDFS.
The SequenceFile has not yet been closed and rolled over, so it is still
a ".tmp" file. The data is actually in the HDFS blocks, but because the
file was not closed, the NameNode thinks it has a length of 0 bytes. I'm
not sure how to recover from this.

Is this the expected behaviour of the HDFS sink, or am I doing something
wrong? Do I need to explicitly enable HDFS append? (I am using HDFS
2.0.0-alpha)

I guess the problem is that data is not "safely" written until file
rollover occurs, but the timing of file rollover (by time, log count,
file size, etc.) is unrelated to the timing of transactions. Is there
any way to put these in sync with each other?

Second question: Could somebody please explain the reasoning behind the
default values of the HDFS sink configuration? If I use the defaults,
the sink generates zillions of tiny files (max 10 events per file),
which as I understand it is not a recommended way to use HDFS.

Is it OK to change these settings to generate much larger files (MB, GB
scale)? Or should I write a script that periodically combines these tiny
files into larger ones?

Thanks for any advice,

Chris Birchall.

Re: Writing reliably to HDFS

Posted by バーチャル　クリストファー <bi...@infoscience.co.jp>.

Juhani,

Thanks for the advice.

Just to clarify, when I talk about the agent "dying", I mean crashing or
being killed unexpectedly. I'm worried about how the HDFS writing works
in these cases. When the agent is shutdown cleanly, I can confirm that
all HDFS files are closed correctly and no .tmp files are left lying around.

In the case where the agent dies suddenly and zero-byte .tmp files are
left over, I still haven't found a way to get Hadoop to fix those files
for me.

Chris.


On 2012/08/02 12:45, Juhani Connolly wrote:
> Hi Chris,
>
> Answers inline
>
> On 08/02/2012 11:07 AM, バーチャル　クリストファー wrote:
>> Hi,
>>
>> I'm trying to write events to HDFS using Flume 1.2.0 and I have a couple
>> of questions.
>>
>> Firstly, about the reliability semantics of the HdfsEventSink.
>>
>> My number one requirement is reliability, i.e. not losing any events.
>> Ideally, by the time the HdfsEventSink commits the transaction, all
>> events should be safely written to HDFS and visible to other clients, so
>> that no data is lost even if the agent dies after that point. But what
>> is actually happening in my tests is as follows:
>>
>> 1. The HDFS sink takes some events from the FileChannel and writes them
>> to a SequenceFile on HDFS
>> 2. The sink commits the transaction, and the FileChannel updates its
>> checkpoint. As far as FileChannel is concerned, the events have been
>> safely written to the sink.
>> 3. Kill the agent.
>>
>> Result: I'm left with a weird zero-byte, non-zero-byte tmp file on HDFS.
>> The SequenceFile has not yet been closed and rolled over, so it is still
>> a ".tmp" file. The data is actually in the HDFS blocks, but because the
>> file was not closed, the NameNode thinks it has a length of 0 bytes. I'm
>> not sure how to recover from this.
>>
>> Is this the expected behaviour of the HDFS sink, or am I doing something
>> wrong? Do I need to explicitly enable HDFS append? (I am using HDFS
>> 2.0.0-alpha)
>>
>> I guess the problem is that data is not "safely" written until file
>> rollover occurs, but the timing of file rollover (by time, log count,
>> file size, etc.) is unrelated to the timing of transactions. Is there
>> any way to put these in sync with each other?
> Regarding reliability, I believe that while the file may not be closed,
> you're not actually at risk of losing data. I suspect that adding in
> some code to the sink shutdown to close up any temp files may be a good
> idea. To deal with unexpected failure it may even be an idea to try
> scanning the dest path for any unclosed files on startup.
>
> I'm not really too familiar with the workings of hdfs sink so maybe
> someone else can add more detail. In our test setup we have yet to have
> any data loss from it.
>> Second question: Could somebody please explain the reasoning behind the
>> default values of the HDFS sink configuration? If I use the defaults,
>> the sink generates zillions of tiny files (max 10 events per file),
>> which as I understand it is not a recommended way to use HDFS.
>>
>> Is it OK to change these settings to generate much larger files (MB, GB
>> scale)? Or should I write a script that periodically combines these tiny
>> files into larger ones?
>>
>> Thanks for any advice,
>>
>> Chris Birchall.
>>
> There's no harm in changing those defaults and I'd strongly recommend
> doing so. We have most of the rolls switched off(set to 0) and we just
> roll hourly(because that's how we want to separate our logs). You may
> also want to change the hdfs.batchSize which defaults to 1... Which is
> gong to cause a bottleneck if you have even a moderate amount of
> traffic. One thing to note is that with large batches, it's possible for
> events to be duplicated(if the batch got partially written and then had
> an error, it will get rollbacked at the channel and then rewritten).
>
>
>

Re: Writing reliably to HDFS

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.

Hi Chris,

Answers inline

On 08/02/2012 11:07 AM, バーチャル　クリストファー wrote:
> Hi,
>
> I'm trying to write events to HDFS using Flume 1.2.0 and I have a couple
> of questions.
>
> Firstly, about the reliability semantics of the HdfsEventSink.
>
> My number one requirement is reliability, i.e. not losing any events.
> Ideally, by the time the HdfsEventSink commits the transaction, all
> events should be safely written to HDFS and visible to other clients, so
> that no data is lost even if the agent dies after that point. But what
> is actually happening in my tests is as follows:
>
> 1. The HDFS sink takes some events from the FileChannel and writes them
> to a SequenceFile on HDFS
> 2. The sink commits the transaction, and the FileChannel updates its
> checkpoint. As far as FileChannel is concerned, the events have been
> safely written to the sink.
> 3. Kill the agent.
>
> Result: I'm left with a weird zero-byte, non-zero-byte tmp file on HDFS.
> The SequenceFile has not yet been closed and rolled over, so it is still
> a ".tmp" file. The data is actually in the HDFS blocks, but because the
> file was not closed, the NameNode thinks it has a length of 0 bytes. I'm
> not sure how to recover from this.
>
> Is this the expected behaviour of the HDFS sink, or am I doing something
> wrong? Do I need to explicitly enable HDFS append? (I am using HDFS
> 2.0.0-alpha)
>
> I guess the problem is that data is not "safely" written until file
> rollover occurs, but the timing of file rollover (by time, log count,
> file size, etc.) is unrelated to the timing of transactions. Is there
> any way to put these in sync with each other?
Regarding reliability, I believe that while the file may not be closed,
you're not actually at risk of losing data. I suspect that adding in
some code to the sink shutdown to close up any temp files may be a good
idea. To deal with unexpected failure it may even be an idea to try
scanning the dest path for any unclosed files on startup.

I'm not really too familiar with the workings of hdfs sink so maybe
someone else can add more detail. In our test setup we have yet to have
any data loss from it.
> Second question: Could somebody please explain the reasoning behind the
> default values of the HDFS sink configuration? If I use the defaults,
> the sink generates zillions of tiny files (max 10 events per file),
> which as I understand it is not a recommended way to use HDFS.
>
> Is it OK to change these settings to generate much larger files (MB, GB
> scale)? Or should I write a script that periodically combines these tiny
> files into larger ones?
>
> Thanks for any advice,
>
> Chris Birchall.
>
There's no harm in changing those defaults and I'd strongly recommend
doing so. We have most of the rolls switched off(set to 0) and we just
roll hourly(because that's how we want to separate our logs). You may
also want to change the hdfs.batchSize which defaults to 1... Which is
gong to cause a bottleneck if you have even a moderate amount of
traffic. One thing to note is that with large batches, it's possible for
events to be duplicated(if the batch got partially written and then had
an error, it will get rollbacked at the channel and then rewritten).