You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Mike Percy <mp...@apache.org> on 2013/02/01 10:56:29 UTC

Re: SpoolDir marks item as completed, when sink fails

Tzur, that is expected, because the data is committed by the source onto
the channel. Sources and sinks are decoupled, they only interact via the
channel, which buffers the data and serves to mitigate impedance mismatches.



On Thu, Jan 31, 2013 at 2:35 PM, Tzur Turkenitz <tz...@vision.bi> wrote:

> Hello all,
>
> I am running HDP 1.2 and Flume 1.3. I have a flume setup which includes a
> (1) -  Load Balancer that uses SpoolDir adapter and sends events to Avro
> sinks
> (2) - Agents which consume the data using an avro source and writing to
> hdfs.
>
> During testing I noticed that there's a dissonance between the Load
> Balancer and the Consumers...
> When a Load Balancer process a file it marks it as COMPLETED, even if the
> consumer has crashed while writing to HDFS.
>
> A preferred behavior would be the Load Balancer to wait until the consumer
> commits its transaction and reports it as successful before the file is
> marked as COMPLETED. This does not allow me to verify which files has been
> loaded successfully if an agent has crashed and recovery is in process.
>
> Have I miss-configured my Agents or this is actually the desired behavior?
>
>
> Kind Regards,
> Tzur
>

Re: SpoolDir marks item as completed, when sink fails

Posted by Tzur Turkenitz <tz...@vision.bi>.
Thank you Mike, you've been a great help.
I have conducted additional tests and verified event data is not lost, as
you stated in your prior comment.

I appreciate it.

Kind Regards,
Tzur


On Tue, Feb 5, 2013 at 3:31 AM, Mike Percy <mp...@apache.org> wrote:

> Hmm in case I didn't answer the whole question:
>
> Yes the file channel is durable and the data will persist across restarts.
>
> Any data written by the sink will be removed from the channel. Since Flume
> is event oriented then the remaining events in the channel will be drained
> when they are taken from the sink at the next opportunity.
>
> Regards
> Mike
>
>
> On Tuesday, February 5, 2013, Mike Percy wrote:
>
>> Tzur,
>> The source and sink are decoupled completely. The source will fill the
>> channel until there is no more work or the channel is full. So the data is
>> sitting buffered in the channel until the sink removes it.
>>
>> Hope that explains things. Let me know if anything is unclear.
>>
>> Regards,
>> Mike
>>
>> On Friday, February 1, 2013, Tzur Turkenitz wrote:
>>
>>> Mike, so when the data is committed to the channel, and the channel is
>>> of type "File" then when the agent will be restarted the data will continue
>>> to flow onto the sink?
>>> And if only 20% of the data passed onto the sink before it crashed then
>>> a "Replay" will be done to resend the whole data?
>>>
>>> Just trying to grasp the basics....
>>>
>>>
>>>
>>>
>>> On Fri, Feb 1, 2013 at 4:56 AM, Mike Percy <mp...@apache.org> wrote:
>>>
>>>> Tzur, that is expected, because the data is committed by the source
>>>> onto the channel. Sources and sinks are decoupled, they only interact via
>>>> the channel, which buffers the data and serves to mitigate impedance
>>>> mismatches.
>>>>
>>>>
>>>>
>>>> On Thu, Jan 31, 2013 at 2:35 PM, Tzur Turkenitz <tz...@vision.bi>wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I am running HDP 1.2 and Flume 1.3. I have a flume setup which
>>>>> includes a
>>>>> (1) -  Load Balancer that uses SpoolDir adapter and sends events to
>>>>> Avro sinks
>>>>> (2) - Agents which consume the data using an avro source and writing
>>>>> to hdfs.
>>>>>
>>>>> During testing I noticed that there's a dissonance between the Load
>>>>> Balancer and the Consumers...
>>>>> When a Load Balancer process a file it marks it as COMPLETED, even if
>>>>> the consumer has crashed while writing to HDFS.
>>>>>
>>>>> A preferred behavior would be the Load Balancer to wait until the
>>>>> consumer commits its transaction and reports it as successful before the
>>>>> file is marked as COMPLETED. This does not allow me to verify which files
>>>>> has been loaded successfully if an agent has crashed and recovery is in
>>>>> process.
>>>>>
>>>>> Have I miss-configured my Agents or this is actually the desired
>>>>> behavior?
>>>>>
>>>>>
>>>>> Kind Regards,
>>>>> Tzur
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Tzur Turkenitz
>>> Vision.BI
>>> http://www.vision.bi/
>>>
>>> "*Facts are stubborn things, but statistics are more pliable*"
>>> -Mark Twain
>>>
>>


-- 
Regards,
Tzur Turkenitz
Vision.BI
http://www.vision.bi/

"*Facts are stubborn things, but statistics are more pliable*"
-Mark Twain

Re: SpoolDir marks item as completed, when sink fails

Posted by Mike Percy <mp...@apache.org>.
Hmm in case I didn't answer the whole question:

Yes the file channel is durable and the data will persist across restarts.

Any data written by the sink will be removed from the channel. Since Flume
is event oriented then the remaining events in the channel will be drained
when they are taken from the sink at the next opportunity.

Regards
Mike

On Tuesday, February 5, 2013, Mike Percy wrote:

> Tzur,
> The source and sink are decoupled completely. The source will fill the
> channel until there is no more work or the channel is full. So the data is
> sitting buffered in the channel until the sink removes it.
>
> Hope that explains things. Let me know if anything is unclear.
>
> Regards,
> Mike
>
> On Friday, February 1, 2013, Tzur Turkenitz wrote:
>
>> Mike, so when the data is committed to the channel, and the channel is of
>> type "File" then when the agent will be restarted the data will continue to
>> flow onto the sink?
>> And if only 20% of the data passed onto the sink before it crashed then a
>> "Replay" will be done to resend the whole data?
>>
>> Just trying to grasp the basics....
>>
>>
>>
>>
>> On Fri, Feb 1, 2013 at 4:56 AM, Mike Percy <mp...@apache.org> wrote:
>>
>>> Tzur, that is expected, because the data is committed by the source onto
>>> the channel. Sources and sinks are decoupled, they only interact via the
>>> channel, which buffers the data and serves to mitigate impedance mismatches.
>>>
>>>
>>>
>>> On Thu, Jan 31, 2013 at 2:35 PM, Tzur Turkenitz <tz...@vision.bi> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am running HDP 1.2 and Flume 1.3. I have a flume setup which includes
>>>> a
>>>> (1) -  Load Balancer that uses SpoolDir adapter and sends events to
>>>> Avro sinks
>>>> (2) - Agents which consume the data using an avro source and writing to
>>>> hdfs.
>>>>
>>>> During testing I noticed that there's a dissonance between the Load
>>>> Balancer and the Consumers...
>>>> When a Load Balancer process a file it marks it as COMPLETED, even if
>>>> the consumer has crashed while writing to HDFS.
>>>>
>>>> A preferred behavior would be the Load Balancer to wait until the
>>>> consumer commits its transaction and reports it as successful before the
>>>> file is marked as COMPLETED. This does not allow me to verify which files
>>>> has been loaded successfully if an agent has crashed and recovery is in
>>>> process.
>>>>
>>>> Have I miss-configured my Agents or this is actually the desired
>>>> behavior?
>>>>
>>>>
>>>> Kind Regards,
>>>> Tzur
>>>>
>>>
>>>
>>
>>
>> --
>> Regards,
>> Tzur Turkenitz
>> Vision.BI
>> http://www.vision.bi/
>>
>> "*Facts are stubborn things, but statistics are more pliable*"
>> -Mark Twain
>>
>

Re: SpoolDir marks item as completed, when sink fails

Posted by Mike Percy <mp...@apache.org>.
Tzur,
The source and sink are decoupled completely. The source will fill the
channel until there is no more work or the channel is full. So the data is
sitting buffered in the channel until the sink removes it.

Hope that explains things. Let me know if anything is unclear.

Regards,
Mike

On Friday, February 1, 2013, Tzur Turkenitz wrote:

> Mike, so when the data is committed to the channel, and the channel is of
> type "File" then when the agent will be restarted the data will continue to
> flow onto the sink?
> And if only 20% of the data passed onto the sink before it crashed then a
> "Replay" will be done to resend the whole data?
>
> Just trying to grasp the basics....
>
>
>
>
> On Fri, Feb 1, 2013 at 4:56 AM, Mike Percy <mpercy@apache.org<javascript:_e({}, 'cvml', 'mpercy@apache.org');>
> > wrote:
>
>> Tzur, that is expected, because the data is committed by the source onto
>> the channel. Sources and sinks are decoupled, they only interact via the
>> channel, which buffers the data and serves to mitigate impedance mismatches.
>>
>>
>>
>> On Thu, Jan 31, 2013 at 2:35 PM, Tzur Turkenitz <tzurt@vision.bi<javascript:_e({}, 'cvml', 'tzurt@vision.bi');>
>> > wrote:
>>
>>> Hello all,
>>>
>>> I am running HDP 1.2 and Flume 1.3. I have a flume setup which includes a
>>> (1) -  Load Balancer that uses SpoolDir adapter and sends events to Avro
>>> sinks
>>> (2) - Agents which consume the data using an avro source and writing to
>>> hdfs.
>>>
>>> During testing I noticed that there's a dissonance between the Load
>>> Balancer and the Consumers...
>>> When a Load Balancer process a file it marks it as COMPLETED, even if
>>> the consumer has crashed while writing to HDFS.
>>>
>>> A preferred behavior would be the Load Balancer to wait until the
>>> consumer commits its transaction and reports it as successful before the
>>> file is marked as COMPLETED. This does not allow me to verify which files
>>> has been loaded successfully if an agent has crashed and recovery is in
>>> process.
>>>
>>> Have I miss-configured my Agents or this is actually the desired
>>> behavior?
>>>
>>>
>>> Kind Regards,
>>> Tzur
>>>
>>
>>
>
>
> --
> Regards,
> Tzur Turkenitz
> Vision.BI
> http://www.vision.bi/
>
> "*Facts are stubborn things, but statistics are more pliable*"
> -Mark Twain
>

Re: SpoolDir marks item as completed, when sink fails

Posted by Tzur Turkenitz <tz...@vision.bi>.
Mike, so when the data is committed to the channel, and the channel is of
type "File" then when the agent will be restarted the data will continue to
flow onto the sink?
And if only 20% of the data passed onto the sink before it crashed then a
"Replay" will be done to resend the whole data?

Just trying to grasp the basics....




On Fri, Feb 1, 2013 at 4:56 AM, Mike Percy <mp...@apache.org> wrote:

> Tzur, that is expected, because the data is committed by the source onto
> the channel. Sources and sinks are decoupled, they only interact via the
> channel, which buffers the data and serves to mitigate impedance mismatches.
>
>
>
> On Thu, Jan 31, 2013 at 2:35 PM, Tzur Turkenitz <tz...@vision.bi> wrote:
>
>> Hello all,
>>
>> I am running HDP 1.2 and Flume 1.3. I have a flume setup which includes a
>> (1) -  Load Balancer that uses SpoolDir adapter and sends events to Avro
>> sinks
>> (2) - Agents which consume the data using an avro source and writing to
>> hdfs.
>>
>> During testing I noticed that there's a dissonance between the Load
>> Balancer and the Consumers...
>> When a Load Balancer process a file it marks it as COMPLETED, even if the
>> consumer has crashed while writing to HDFS.
>>
>> A preferred behavior would be the Load Balancer to wait until the
>> consumer commits its transaction and reports it as successful before the
>> file is marked as COMPLETED. This does not allow me to verify which files
>> has been loaded successfully if an agent has crashed and recovery is in
>> process.
>>
>> Have I miss-configured my Agents or this is actually the desired behavior?
>>
>>
>> Kind Regards,
>> Tzur
>>
>
>


-- 
Regards,
Tzur Turkenitz
Vision.BI
http://www.vision.bi/

"*Facts are stubborn things, but statistics are more pliable*"
-Mark Twain