You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Abhijeet Shipure <ab...@gmail.com> on 2013/10/10 07:33:50 UTC

How to read multiples files getting continuously updated

Hi,

I am looking for Flume NG source that can be used for reading many files
which are getting continuously updated.
I trued Spool Dir source but it does not work if file to be read gets
modified.

Here is the scenario:
100 files are getting generated at one time and these files
are continuously  updated for fixed interval say 5 mins, after 5 mins new
100 files get generated and being written again for 5 mins.

Which flume source is most suitable and how it should be used effectively
without any data loss.

Any help is greatly appreciated.


Thanks
Abhijeet Shipure

Re: How to read multiples files getting continuously updated

Posted by DSuiter RDX <ds...@rdx.com>.
Abhijeet,

The built-in Avro client might be good to look into. In my experience, it
grabs what is in a directory and converts it to Avro format to send to a
Flume Avro source, which can then be used to sink it to storage in Avro
format. You might need to set a cron or trigger event to make it run every
time your files roll.

Alternatively, if the application writing these files has the ability to
send the same output over TCP, you could look into the syslogTCP source -
you might need to override parts of it to force it to accept the
non-syslog-format data, but it will catch anything that comes in to a TCP
port assigned to it and Flume it.

Hope that helps,

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Thu, Oct 10, 2013 at 2:48 AM, Steve Morin <st...@stevemorin.com> wrote:

> I think that would be the best option
>
>
> On Wed, Oct 9, 2013 at 11:27 PM, Abhijeet Shipure <ab...@gmail.com>wrote:
>
>> Yes new files are created at fixed interval but write time is not fixed
>> and files written as and when request comes.
>> I was thinking of creating utility to copy files to new directory and use
>> Spool Dir source.
>>
>> Regards
>> Abhijeet
>>
>>
>>
>> On Thu, Oct 10, 2013 at 11:41 AM, Steve Morin <st...@stevemorin.com>wrote:
>>
>>> If the files are continually written to I don't think there is a good
>>> option.  Can new files be written to every time interval?
>>>
>>>
>>> On Wed, Oct 9, 2013 at 11:09 PM, Abhijeet Shipure <
>>> abhi.shipure@gmail.com> wrote:
>>>
>>>> Hi Steve,
>>>>
>>>> Thanks for quick reply, as you pointed out Exec Source does not provide
>>>> reliability, which is required in my case, and hence it is not suitable.
>>>>
>>>> So which other inbuilt source could be used to read from many files ?
>>>> Just one other requirement is file name s are also dynamically generated
>>>> using time stamp after every 5 mins.
>>>>
>>>>
>>>> Regards
>>>> Abhijeet
>>>>
>>>>
>>>> On Thu, Oct 10, 2013 at 11:22 AM, Steve Morin <st...@stevemorin.com>wrote:
>>>>
>>>>> If your read the Flume manual it doesn't support a tail source
>>>>>
>>>>> http://flume.apache.org/FlumeUserGuide.html#exec-source
>>>>>
>>>>> Warning
>>>>>
>>>>>
>>>>> The problem with ExecSource and other asynchronous sources is that the
>>>>> source can not guarantee that if there is a failure to put the event into
>>>>> the Channel the client knows about it. In such cases, the data will be
>>>>> lost. As a for instance, one of the most commonly requested features is the
>>>>> tail -F [file]-like use case where an application writes to a log
>>>>> file on disk and Flume tails the file, sending each line as an event. While
>>>>> this is possible, there’s an obvious problem; what happens if the channel
>>>>> fills up and Flume can’t send an event? Flume has no way of indicating to
>>>>> the application writing the log file that it needs to retain the log or
>>>>> that the event hasn’t been sent, for some reason. If this doesn’t make
>>>>> sense, you need only know this: Your application can never guarantee data
>>>>> has been received when using a unidirectional asynchronous interface such
>>>>> as ExecSource! As an extension of this warning - and to be completely clear
>>>>> - there is absolutely zero guarantee of event delivery when using this
>>>>> source. For stronger reliability guarantees, consider the Spooling
>>>>> Directory Source or direct integration with Flume via the SDK.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Oct 9, 2013 at 10:33 PM, Abhijeet Shipure <
>>>>> abhi.shipure@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am looking for Flume NG source that can be used for reading many
>>>>>> files which are getting continuously updated.
>>>>>> I trued Spool Dir source but it does not work if file to be read gets
>>>>>> modified.
>>>>>>
>>>>>> Here is the scenario:
>>>>>> 100 files are getting generated at one time and these files
>>>>>> are continuously  updated for fixed interval say 5 mins, after 5 mins new
>>>>>> 100 files get generated and being written again for 5 mins.
>>>>>>
>>>>>> Which flume source is most suitable and how it should be used
>>>>>> effectively without any data loss.
>>>>>>
>>>>>> Any help is greatly appreciated.
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>  Abhijeet Shipure
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to read multiples files getting continuously updated

Posted by Steve Morin <st...@stevemorin.com>.
I think that would be the best option


On Wed, Oct 9, 2013 at 11:27 PM, Abhijeet Shipure <ab...@gmail.com>wrote:

> Yes new files are created at fixed interval but write time is not fixed
> and files written as and when request comes.
> I was thinking of creating utility to copy files to new directory and use
> Spool Dir source.
>
> Regards
> Abhijeet
>
>
>
> On Thu, Oct 10, 2013 at 11:41 AM, Steve Morin <st...@stevemorin.com>wrote:
>
>> If the files are continually written to I don't think there is a good
>> option.  Can new files be written to every time interval?
>>
>>
>> On Wed, Oct 9, 2013 at 11:09 PM, Abhijeet Shipure <abhi.shipure@gmail.com
>> > wrote:
>>
>>> Hi Steve,
>>>
>>> Thanks for quick reply, as you pointed out Exec Source does not provide
>>> reliability, which is required in my case, and hence it is not suitable.
>>>
>>> So which other inbuilt source could be used to read from many files ?
>>> Just one other requirement is file name s are also dynamically generated
>>> using time stamp after every 5 mins.
>>>
>>>
>>> Regards
>>> Abhijeet
>>>
>>>
>>> On Thu, Oct 10, 2013 at 11:22 AM, Steve Morin <st...@stevemorin.com>wrote:
>>>
>>>> If your read the Flume manual it doesn't support a tail source
>>>>
>>>> http://flume.apache.org/FlumeUserGuide.html#exec-source
>>>>
>>>> Warning
>>>>
>>>>
>>>> The problem with ExecSource and other asynchronous sources is that the
>>>> source can not guarantee that if there is a failure to put the event into
>>>> the Channel the client knows about it. In such cases, the data will be
>>>> lost. As a for instance, one of the most commonly requested features is the
>>>> tail -F [file]-like use case where an application writes to a log file
>>>> on disk and Flume tails the file, sending each line as an event. While this
>>>> is possible, there’s an obvious problem; what happens if the channel fills
>>>> up and Flume can’t send an event? Flume has no way of indicating to the
>>>> application writing the log file that it needs to retain the log or that
>>>> the event hasn’t been sent, for some reason. If this doesn’t make sense,
>>>> you need only know this: Your application can never guarantee data has been
>>>> received when using a unidirectional asynchronous interface such as
>>>> ExecSource! As an extension of this warning - and to be completely clear -
>>>> there is absolutely zero guarantee of event delivery when using this
>>>> source. For stronger reliability guarantees, consider the Spooling
>>>> Directory Source or direct integration with Flume via the SDK.
>>>>
>>>>
>>>>
>>>> On Wed, Oct 9, 2013 at 10:33 PM, Abhijeet Shipure <
>>>> abhi.shipure@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am looking for Flume NG source that can be used for reading many
>>>>> files which are getting continuously updated.
>>>>> I trued Spool Dir source but it does not work if file to be read gets
>>>>> modified.
>>>>>
>>>>> Here is the scenario:
>>>>> 100 files are getting generated at one time and these files
>>>>> are continuously  updated for fixed interval say 5 mins, after 5 mins new
>>>>> 100 files get generated and being written again for 5 mins.
>>>>>
>>>>> Which flume source is most suitable and how it should be used
>>>>> effectively without any data loss.
>>>>>
>>>>> Any help is greatly appreciated.
>>>>>
>>>>>
>>>>> Thanks
>>>>>  Abhijeet Shipure
>>>>>
>>>>>
>>>>
>>>
>>
>

RE: How to read multiples files getting continuously updated

Posted by Paul Chavez <pc...@verticalsearchworks.com>.
That is exactly what I do for a similar scenario. In my case it's one big log file that gets written to all day on each server, so I developed a script that runs once a minute to grab the new lines off the file since last run, creates an incremental file with that data and then drops it in a directory for the spoolDir source to pick up.

Paul


From: Abhijeet Shipure [mailto:abhi.shipure@gmail.com]
Sent: Wednesday, October 09, 2013 11:28 PM
To: user@flume.apache.org
Subject: Re: How to read multiples files getting continuously updated

Yes new files are created at fixed interval but write time is not fixed and files written as and when request comes.
I was thinking of creating utility to copy files to new directory and use Spool Dir source.

Regards
Abhijeet


On Thu, Oct 10, 2013 at 11:41 AM, Steve Morin <st...@stevemorin.com>> wrote:
If the files are continually written to I don't think there is a good option.  Can new files be written to every time interval?

On Wed, Oct 9, 2013 at 11:09 PM, Abhijeet Shipure <ab...@gmail.com>> wrote:
Hi Steve,

Thanks for quick reply, as you pointed out Exec Source does not provide reliability, which is required in my case, and hence it is not suitable.

So which other inbuilt source could be used to read from many files ? Just one other requirement is file name s are also dynamically generated using time stamp after every 5 mins.


Regards
Abhijeet

On Thu, Oct 10, 2013 at 11:22 AM, Steve Morin <st...@stevemorin.com>> wrote:
If your read the Flume manual it doesn't support a tail source

http://flume.apache.org/FlumeUserGuide.html#exec-source


Warning


The problem with ExecSource and other asynchronous sources is that the source can not guarantee that if there is a failure to put the event into the Channel the client knows about it. In such cases, the data will be lost. As a for instance, one of the most commonly requested features is the tail -F [file]-like use case where an application writes to a log file on disk and Flume tails the file, sending each line as an event. While this is possible, there's an obvious problem; what happens if the channel fills up and Flume can't send an event? Flume has no way of indicating to the application writing the log file that it needs to retain the log or that the event hasn't been sent, for some reason. If this doesn't make sense, you need only know this: Your application can never guarantee data has been received when using a unidirectional asynchronous interface such as ExecSource! As an extension of this warning - and to be completely clear - there is absolutely zero guarantee of event delivery when using this source. For stronger reliability guarantees, consider the Spooling Directory Source or direct integration with Flume via the SDK.


On Wed, Oct 9, 2013 at 10:33 PM, Abhijeet Shipure <ab...@gmail.com>> wrote:
Hi,

I am looking for Flume NG source that can be used for reading many files which are getting continuously updated.
I trued Spool Dir source but it does not work if file to be read gets modified.

Here is the scenario:
100 files are getting generated at one time and these files are continuously  updated for fixed interval say 5 mins, after 5 mins new 100 files get generated and being written again for 5 mins.

Which flume source is most suitable and how it should be used effectively without any data loss.

Any help is greatly appreciated.


Thanks
Abhijeet Shipure






Re: How to read multiples files getting continuously updated

Posted by Abhijeet Shipure <ab...@gmail.com>.
Yes new files are created at fixed interval but write time is not fixed and
files written as and when request comes.
I was thinking of creating utility to copy files to new directory and use
Spool Dir source.

Regards
Abhijeet



On Thu, Oct 10, 2013 at 11:41 AM, Steve Morin <st...@stevemorin.com> wrote:

> If the files are continually written to I don't think there is a good
> option.  Can new files be written to every time interval?
>
>
> On Wed, Oct 9, 2013 at 11:09 PM, Abhijeet Shipure <ab...@gmail.com>wrote:
>
>> Hi Steve,
>>
>> Thanks for quick reply, as you pointed out Exec Source does not provide
>> reliability, which is required in my case, and hence it is not suitable.
>>
>> So which other inbuilt source could be used to read from many files ?
>> Just one other requirement is file name s are also dynamically generated
>> using time stamp after every 5 mins.
>>
>>
>> Regards
>> Abhijeet
>>
>>
>> On Thu, Oct 10, 2013 at 11:22 AM, Steve Morin <st...@stevemorin.com>wrote:
>>
>>> If your read the Flume manual it doesn't support a tail source
>>>
>>> http://flume.apache.org/FlumeUserGuide.html#exec-source
>>>
>>> Warning
>>>
>>>
>>> The problem with ExecSource and other asynchronous sources is that the
>>> source can not guarantee that if there is a failure to put the event into
>>> the Channel the client knows about it. In such cases, the data will be
>>> lost. As a for instance, one of the most commonly requested features is the
>>> tail -F [file]-like use case where an application writes to a log file
>>> on disk and Flume tails the file, sending each line as an event. While this
>>> is possible, there’s an obvious problem; what happens if the channel fills
>>> up and Flume can’t send an event? Flume has no way of indicating to the
>>> application writing the log file that it needs to retain the log or that
>>> the event hasn’t been sent, for some reason. If this doesn’t make sense,
>>> you need only know this: Your application can never guarantee data has been
>>> received when using a unidirectional asynchronous interface such as
>>> ExecSource! As an extension of this warning - and to be completely clear -
>>> there is absolutely zero guarantee of event delivery when using this
>>> source. For stronger reliability guarantees, consider the Spooling
>>> Directory Source or direct integration with Flume via the SDK.
>>>
>>>
>>>
>>> On Wed, Oct 9, 2013 at 10:33 PM, Abhijeet Shipure <
>>> abhi.shipure@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am looking for Flume NG source that can be used for reading many
>>>> files which are getting continuously updated.
>>>> I trued Spool Dir source but it does not work if file to be read gets
>>>> modified.
>>>>
>>>> Here is the scenario:
>>>> 100 files are getting generated at one time and these files
>>>> are continuously  updated for fixed interval say 5 mins, after 5 mins new
>>>> 100 files get generated and being written again for 5 mins.
>>>>
>>>> Which flume source is most suitable and how it should be used
>>>> effectively without any data loss.
>>>>
>>>> Any help is greatly appreciated.
>>>>
>>>>
>>>> Thanks
>>>>  Abhijeet Shipure
>>>>
>>>>
>>>
>>
>

Re: How to read multiples files getting continuously updated

Posted by Steve Morin <st...@stevemorin.com>.
If the files are continually written to I don't think there is a good
option.  Can new files be written to every time interval?


On Wed, Oct 9, 2013 at 11:09 PM, Abhijeet Shipure <ab...@gmail.com>wrote:

> Hi Steve,
>
> Thanks for quick reply, as you pointed out Exec Source does not provide
> reliability, which is required in my case, and hence it is not suitable.
>
> So which other inbuilt source could be used to read from many files ? Just
> one other requirement is file name s are also dynamically generated using
> time stamp after every 5 mins.
>
>
> Regards
> Abhijeet
>
>
> On Thu, Oct 10, 2013 at 11:22 AM, Steve Morin <st...@stevemorin.com>wrote:
>
>> If your read the Flume manual it doesn't support a tail source
>>
>> http://flume.apache.org/FlumeUserGuide.html#exec-source
>>
>> Warning
>>
>>
>> The problem with ExecSource and other asynchronous sources is that the
>> source can not guarantee that if there is a failure to put the event into
>> the Channel the client knows about it. In such cases, the data will be
>> lost. As a for instance, one of the most commonly requested features is the
>> tail -F [file]-like use case where an application writes to a log file
>> on disk and Flume tails the file, sending each line as an event. While this
>> is possible, there’s an obvious problem; what happens if the channel fills
>> up and Flume can’t send an event? Flume has no way of indicating to the
>> application writing the log file that it needs to retain the log or that
>> the event hasn’t been sent, for some reason. If this doesn’t make sense,
>> you need only know this: Your application can never guarantee data has been
>> received when using a unidirectional asynchronous interface such as
>> ExecSource! As an extension of this warning - and to be completely clear -
>> there is absolutely zero guarantee of event delivery when using this
>> source. For stronger reliability guarantees, consider the Spooling
>> Directory Source or direct integration with Flume via the SDK.
>>
>>
>>
>> On Wed, Oct 9, 2013 at 10:33 PM, Abhijeet Shipure <abhi.shipure@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> I am looking for Flume NG source that can be used for reading many files
>>> which are getting continuously updated.
>>> I trued Spool Dir source but it does not work if file to be read gets
>>> modified.
>>>
>>> Here is the scenario:
>>> 100 files are getting generated at one time and these files
>>> are continuously  updated for fixed interval say 5 mins, after 5 mins new
>>> 100 files get generated and being written again for 5 mins.
>>>
>>> Which flume source is most suitable and how it should be used
>>> effectively without any data loss.
>>>
>>> Any help is greatly appreciated.
>>>
>>>
>>> Thanks
>>>  Abhijeet Shipure
>>>
>>>
>>
>

Re: How to read multiples files getting continuously updated

Posted by Abhijeet Shipure <ab...@gmail.com>.
Hi Steve,

Thanks for quick reply, as you pointed out Exec Source does not provide
reliability, which is required in my case, and hence it is not suitable.

So which other inbuilt source could be used to read from many files ? Just
one other requirement is file name s are also dynamically generated using
time stamp after every 5 mins.


Regards
Abhijeet


On Thu, Oct 10, 2013 at 11:22 AM, Steve Morin <st...@stevemorin.com> wrote:

> If your read the Flume manual it doesn't support a tail source
>
> http://flume.apache.org/FlumeUserGuide.html#exec-source
>
> Warning
>
>
> The problem with ExecSource and other asynchronous sources is that the
> source can not guarantee that if there is a failure to put the event into
> the Channel the client knows about it. In such cases, the data will be
> lost. As a for instance, one of the most commonly requested features is the
> tail -F [file]-like use case where an application writes to a log file on
> disk and Flume tails the file, sending each line as an event. While this is
> possible, there’s an obvious problem; what happens if the channel fills up
> and Flume can’t send an event? Flume has no way of indicating to the
> application writing the log file that it needs to retain the log or that
> the event hasn’t been sent, for some reason. If this doesn’t make sense,
> you need only know this: Your application can never guarantee data has been
> received when using a unidirectional asynchronous interface such as
> ExecSource! As an extension of this warning - and to be completely clear -
> there is absolutely zero guarantee of event delivery when using this
> source. For stronger reliability guarantees, consider the Spooling
> Directory Source or direct integration with Flume via the SDK.
>
>
>
> On Wed, Oct 9, 2013 at 10:33 PM, Abhijeet Shipure <ab...@gmail.com>wrote:
>
>> Hi,
>>
>> I am looking for Flume NG source that can be used for reading many files
>> which are getting continuously updated.
>> I trued Spool Dir source but it does not work if file to be read gets
>> modified.
>>
>> Here is the scenario:
>> 100 files are getting generated at one time and these files
>> are continuously  updated for fixed interval say 5 mins, after 5 mins new
>> 100 files get generated and being written again for 5 mins.
>>
>> Which flume source is most suitable and how it should be used effectively
>> without any data loss.
>>
>> Any help is greatly appreciated.
>>
>>
>> Thanks
>>  Abhijeet Shipure
>>
>>
>

Re: How to read multiples files getting continuously updated

Posted by Steve Morin <st...@stevemorin.com>.
If your read the Flume manual it doesn't support a tail source

http://flume.apache.org/FlumeUserGuide.html#exec-source

Warning


The problem with ExecSource and other asynchronous sources is that the
source can not guarantee that if there is a failure to put the event into
the Channel the client knows about it. In such cases, the data will be
lost. As a for instance, one of the most commonly requested features is the
tail -F [file]-like use case where an application writes to a log file on
disk and Flume tails the file, sending each line as an event. While this is
possible, there’s an obvious problem; what happens if the channel fills up
and Flume can’t send an event? Flume has no way of indicating to the
application writing the log file that it needs to retain the log or that
the event hasn’t been sent, for some reason. If this doesn’t make sense,
you need only know this: Your application can never guarantee data has been
received when using a unidirectional asynchronous interface such as
ExecSource! As an extension of this warning - and to be completely clear -
there is absolutely zero guarantee of event delivery when using this
source. For stronger reliability guarantees, consider the Spooling
Directory Source or direct integration with Flume via the SDK.



On Wed, Oct 9, 2013 at 10:33 PM, Abhijeet Shipure <ab...@gmail.com>wrote:

> Hi,
>
> I am looking for Flume NG source that can be used for reading many files
> which are getting continuously updated.
> I trued Spool Dir source but it does not work if file to be read gets
> modified.
>
> Here is the scenario:
> 100 files are getting generated at one time and these files
> are continuously  updated for fixed interval say 5 mins, after 5 mins new
> 100 files get generated and being written again for 5 mins.
>
> Which flume source is most suitable and how it should be used effectively
> without any data loss.
>
> Any help is greatly appreciated.
>
>
> Thanks
> Abhijeet Shipure
>
>