You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by David Sinclair <ds...@chariotsolutions.com> on 2013/10/04 20:59:34 UTC

File Sink/Source

Hi,

I have a question regarding the RollingFileSink and
SpoolingDirectorySource. I was trying to write everything from an AMQP
source to a file sink, then have the spooling directory source pick up
these files. This won't work as the files aren't immutable.

If I use a File Channel to store the events between my source and sink, is
there a concern about the number of events in the channel if the sink is
unable to deliver said events? For example, I will be getting around 5K
messages/sec and the size is about 2K. So roughly 10MB a second. If the
sink is unable to deliver the messages for 2 hours, that would be 36
million events in the channel.

Is the file channel designed to handle this? Or should I have a file sink
in between.

thanks

dave

Re: File Sink/Source

Posted by David Sinclair <ds...@chariotsolutions.com>.
Thanks much Jeff. This is exactly what I needed to know. Much appreciated.

I have been experimenting with having multiple flows on the same agent just
writing to different disks to improve the throughput as well.


On Mon, Oct 7, 2013 at 10:16 PM, Jeff Lord <jl...@cloudera.com> wrote:

> Yes the file channel is designed to handle this and is what you should be
> using.
> You are also on the right track regarding sizing your file channel to
> account for the number of events that could accumulate in the event that
> your terminal sink is unable to complete transactions. With the amount of
> data that you would like to buffer it will take a file channel somewhere
> around 72GB.
> So some other things you should consider here are the size of your hard
> drives, the drain rate of a single sink on that channel once the terminal
> destination is up again, durability in the event of a drive failure and so
> on. For these reasons you may decide that you want to have a few agents on
> separate hosts that can help to spread the load.
>
> Hope this is helpful.
>
> -Jeff
>
>
> On Mon, Oct 7, 2013 at 6:54 AM, David Sinclair <
> dsinclair@chariotsolutions.com> wrote:
>
>> I am using a AMQP Souce, so I don't know how changing to a JMS source
>> would have any difference.
>>
>> I am concerned about the volume of data and the file channel. Even if I
>> switched to JMS, my question would be the same.
>>
>>
>> On Fri, Oct 4, 2013 at 4:46 PM, Hari Shreedharan <
>> hshreedharan@cloudera.com> wrote:
>>
>>>  Have you tried the JMS Source? It can pick up data directly into Flume.
>>>
>>>
>>> Thanks,
>>> Hari
>>>
>>> On Friday, October 4, 2013 at 11:59 AM, David Sinclair wrote:
>>>
>>> Hi,
>>>
>>> I have a question regarding the RollingFileSink and
>>> SpoolingDirectorySource. I was trying to write everything from an AMQP
>>> source to a file sink, then have the spooling directory source pick up
>>> these files. This won't work as the files aren't immutable.
>>>
>>> If I use a File Channel to store the events between my source and sink,
>>> is there a concern about the number of events in the channel if the sink is
>>> unable to deliver said events? For example, I will be getting around 5K
>>> messages/sec and the size is about 2K. So roughly 10MB a second. If the
>>> sink is unable to deliver the messages for 2 hours, that would be 36
>>> million events in the channel.
>>>
>>> Is the file channel designed to handle this? Or should I have a file
>>> sink in between.
>>>
>>> thanks
>>>
>>> dave
>>>
>>>
>>>
>>
>

Re: File Sink/Source

Posted by Jeff Lord <jl...@cloudera.com>.
Yes the file channel is designed to handle this and is what you should be
using.
You are also on the right track regarding sizing your file channel to
account for the number of events that could accumulate in the event that
your terminal sink is unable to complete transactions. With the amount of
data that you would like to buffer it will take a file channel somewhere
around 72GB.
So some other things you should consider here are the size of your hard
drives, the drain rate of a single sink on that channel once the terminal
destination is up again, durability in the event of a drive failure and so
on. For these reasons you may decide that you want to have a few agents on
separate hosts that can help to spread the load.

Hope this is helpful.

-Jeff


On Mon, Oct 7, 2013 at 6:54 AM, David Sinclair <
dsinclair@chariotsolutions.com> wrote:

> I am using a AMQP Souce, so I don't know how changing to a JMS source
> would have any difference.
>
> I am concerned about the volume of data and the file channel. Even if I
> switched to JMS, my question would be the same.
>
>
> On Fri, Oct 4, 2013 at 4:46 PM, Hari Shreedharan <
> hshreedharan@cloudera.com> wrote:
>
>>  Have you tried the JMS Source? It can pick up data directly into Flume.
>>
>>
>> Thanks,
>> Hari
>>
>> On Friday, October 4, 2013 at 11:59 AM, David Sinclair wrote:
>>
>> Hi,
>>
>> I have a question regarding the RollingFileSink and
>> SpoolingDirectorySource. I was trying to write everything from an AMQP
>> source to a file sink, then have the spooling directory source pick up
>> these files. This won't work as the files aren't immutable.
>>
>> If I use a File Channel to store the events between my source and sink,
>> is there a concern about the number of events in the channel if the sink is
>> unable to deliver said events? For example, I will be getting around 5K
>> messages/sec and the size is about 2K. So roughly 10MB a second. If the
>> sink is unable to deliver the messages for 2 hours, that would be 36
>> million events in the channel.
>>
>> Is the file channel designed to handle this? Or should I have a file sink
>> in between.
>>
>> thanks
>>
>> dave
>>
>>
>>
>

Re: File Sink/Source

Posted by David Sinclair <ds...@chariotsolutions.com>.
I am using a AMQP Souce, so I don't know how changing to a JMS source would
have any difference.

I am concerned about the volume of data and the file channel. Even if I
switched to JMS, my question would be the same.


On Fri, Oct 4, 2013 at 4:46 PM, Hari Shreedharan
<hs...@cloudera.com>wrote:

>  Have you tried the JMS Source? It can pick up data directly into Flume.
>
>
> Thanks,
> Hari
>
> On Friday, October 4, 2013 at 11:59 AM, David Sinclair wrote:
>
> Hi,
>
> I have a question regarding the RollingFileSink and
> SpoolingDirectorySource. I was trying to write everything from an AMQP
> source to a file sink, then have the spooling directory source pick up
> these files. This won't work as the files aren't immutable.
>
> If I use a File Channel to store the events between my source and sink, is
> there a concern about the number of events in the channel if the sink is
> unable to deliver said events? For example, I will be getting around 5K
> messages/sec and the size is about 2K. So roughly 10MB a second. If the
> sink is unable to deliver the messages for 2 hours, that would be 36
> million events in the channel.
>
> Is the file channel designed to handle this? Or should I have a file sink
> in between.
>
> thanks
>
> dave
>
>
>

Re: File Sink/Source

Posted by Hari Shreedharan <hs...@cloudera.com>.
Have you tried the JMS Source? It can pick up data directly into Flume. 


Thanks,
Hari


On Friday, October 4, 2013 at 11:59 AM, David Sinclair wrote:

> Hi,
> 
> I have a question regarding the RollingFileSink and SpoolingDirectorySource. I was trying to write everything from an AMQP source to a file sink, then have the spooling directory source pick up these files. This won't work as the files aren't immutable.  
> 
> If I use a File Channel to store the events between my source and sink, is there a concern about the number of events in the channel if the sink is unable to deliver said events? For example, I will be getting around 5K messages/sec and the size is about 2K. So roughly 10MB a second. If the sink is unable to deliver the messages for 2 hours, that would be 36 million events in the channel. 
> 
> Is the file channel designed to handle this? Or should I have a file sink in between.
> 
> thanks
> 
> dave