You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by SaravanaKumar TR <sa...@gmail.com> on 2014/07/22 15:15:44 UTC

how spooling directory source identifies the complete file

Hi,

I am planning to use spooling directory to move logfiles in hdfs sink.

I like to know how flume identifies the file we are moving to spool
directory is complete file or partial & its move still in progress.

if suppose a file is of large size and we started moving it to spooler
directory , how flume identifies that the complete file is transferred or
is still in progress.

Please help me out here.

Thanks,
saravana

Re: how spooling directory source identifies the complete file

Posted by SaravanaKumar TR <sa...@gmail.com>.
thanks a lot.

This answer sounds perfect for my question.Let me have a try with mv
instead of cp.


On Wed, Jul 23, 2014 at 1:16 PM, Needham, Guy <Guy.Needham@virginmedia.co.uk
> wrote:

>  Hi Saravana,
>
> Flume will check the size and the time of the last edit to the file when
> it starts reading it and when it has finished reading. If the two sets of
> values differ between the start and end of the file reading process, Flume
> will fail noisily. This means that you must move a fully written file to
> the directory or it will not be ingested into your workflow. If you're
> running it on a unix system, you can't use a cp command to drop the file
> into the directory as cp uses incremental writes whereas mv will move the
> file in one go.
>
>
> Regards,
> Guy Needham | Data Discovery
> Virgin Media | Enterprise Data, Design & Management
> Bartley Wood Business Park, Hook, Hampshire RG27 9UP
> D 01256 75 3362
> I welcome VSRE emails. Learn more at http://vsre.info/
>
>
>  ------------------------------
> *From:* SaravanaKumar TR [mailto:saran0081986@gmail.com]
> *Sent:* 23 July 2014 06:38
> *To:* user@flume.apache.org
> *Subject:* Re: how spooling directory source identifies the complete file
>
>  Thanks Ashish , I already referred to this info.
>
>  But I couldn't see any explanation in flume user guide about how flume
> differentiates between copy-in progress file and fully copied file.
>
>
> On Wed, Jul 23, 2014 at 10:59 AM, Ashish <pa...@gmail.com> wrote:
>
>> This is specified in Flume's User Guide
>>
>>  "Unlike the Exec source, this source is reliable and will not miss
>> data, even if Flume is restarted or killed. In exchange for this
>> reliability, only immutable, uniquely-named files must be dropped into the
>> spooling directory. Flume tries to detect these problem conditions and will
>> fail loudly if they are violated:
>>
>>    1. If a file is written to after being placed into the spooling
>>    directory, Flume will print an error to its log file and stop processing.
>>    2. If a file name is reused at a later time, Flume will print an
>>    error to its log file and stop processing.
>>
>> To avoid the above issues, it may be useful to add a unique identifier
>> (such as a timestamp) to log file names when they are moved into the
>> spooling directory."
>>
>>
>> On Wed, Jul 23, 2014 at 10:17 AM, SaravanaKumar TR <
>> saran0081986@gmail.com> wrote:
>>
>>> Hi Jeff,
>>>
>>>  Thanks of your comments.But what I am really looking for is  ,
>>> consider we are copying a file of 1 GB to spool directory , if suppose copy
>>> is in progress , how flume recognize that the complete file is copied into
>>> the spool directory and the file is ready for processing ?
>>>
>>>  how flume make sure it doesnt start processing the partially copied
>>> file.
>>>
>>>
>>> On Tue, Jul 22, 2014 at 11:15 PM, Jeff Lord <jl...@cloudera.com> wrote:
>>>
>>>> I believe the way this works is that flume creates a meta directory to
>>>> track which file is being read.
>>>> In the event of a restart of the agent the entire file will be re-read
>>>> which will create some duplicate events.
>>>>
>>>>
>>>> https://github.com/apache/flume/blob/flume-1.5/flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java#L474
>>>>
>>>>
>>>> On Tue, Jul 22, 2014 at 6:15 AM, SaravanaKumar TR <
>>>> saran0081986@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>  I am planning to use spooling directory to move logfiles in hdfs
>>>>> sink.
>>>>>
>>>>>  I like to know how flume identifies the file we are moving to spool
>>>>> directory is complete file or partial & its move still in progress.
>>>>>
>>>>>  if suppose a file is of large size and we started moving it to
>>>>> spooler directory , how flume identifies that the complete file is
>>>>> transferred or is still in progress.
>>>>>
>>>>>  Please help me out here.
>>>>>
>>>>>  Thanks,
>>>>> saravana
>>>>>
>>>>
>>>>
>>>
>>
>>
>>   --
>> thanks
>> ashish
>>
>> Blog: http://www.ashishpaliwal.com/blog
>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>
>
>
> --------------------------------------------------------------------
> Save Paper - Do you really need to print this e-mail?
>
> Visit www.virginmedia.com for more information, and more fun.
>
> This email and any attachments are or may be confidential and legally
> privileged
> and are sent solely for the attention of the addressee(s). If you have
> received this
> email in error, please delete it from your system: its use, disclosure or
> copying is
> unauthorised. Statements and opinions expressed in this email may not
> represent
> those of Virgin Media. Any representations or commitments in this email are
> subject to contract.
>
> Registered office: Media House, Bartley Wood Business Park, Hook,
> Hampshire, RG27 9UP
> Registered in England and Wales with number 2591237
>

RE: how spooling directory source identifies the complete file

Posted by "Needham, Guy" <Gu...@virginmedia.co.uk>.
Hi Saravana,

Flume will check the size and the time of the last edit to the file when it starts reading it and when it has finished reading. If the two sets of values differ between the start and end of the file reading process, Flume will fail noisily. This means that you must move a fully written file to the directory or it will not be ingested into your workflow. If you're running it on a unix system, you can't use a cp command to drop the file into the directory as cp uses incremental writes whereas mv will move the file in one go.



Regards,
Guy Needham | Data Discovery
Virgin Media | Enterprise Data, Design & Management
Bartley Wood Business Park, Hook, Hampshire RG27 9UP
D 01256 75 3362
I welcome VSRE emails. Learn more at http://vsre.info/



________________________________
From: SaravanaKumar TR [mailto:saran0081986@gmail.com]
Sent: 23 July 2014 06:38
To: user@flume.apache.org
Subject: Re: how spooling directory source identifies the complete file

Thanks Ashish , I already referred to this info.

But I couldn't see any explanation in flume user guide about how flume differentiates between copy-in progress file and fully copied file.


On Wed, Jul 23, 2014 at 10:59 AM, Ashish <pa...@gmail.com>> wrote:
This is specified in Flume's User Guide

"Unlike the Exec source, this source is reliable and will not miss data, even if Flume is restarted or killed. In exchange for this reliability, only immutable, uniquely-named files must be dropped into the spooling directory. Flume tries to detect these problem conditions and will fail loudly if they are violated:

  1.  If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing.
  2.  If a file name is reused at a later time, Flume will print an error to its log file and stop processing.

To avoid the above issues, it may be useful to add a unique identifier (such as a timestamp) to log file names when they are moved into the spooling directory."


On Wed, Jul 23, 2014 at 10:17 AM, SaravanaKumar TR <sa...@gmail.com>> wrote:
Hi Jeff,

Thanks of your comments.But what I am really looking for is  , consider we are copying a file of 1 GB to spool directory , if suppose copy is in progress , how flume recognize that the complete file is copied into the spool directory and the file is ready for processing ?

how flume make sure it doesnt start processing the partially copied file.


On Tue, Jul 22, 2014 at 11:15 PM, Jeff Lord <jl...@cloudera.com>> wrote:
I believe the way this works is that flume creates a meta directory to track which file is being read.
In the event of a restart of the agent the entire file will be re-read which will create some duplicate events.

https://github.com/apache/flume/blob/flume-1.5/flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java#L474


On Tue, Jul 22, 2014 at 6:15 AM, SaravanaKumar TR <sa...@gmail.com>> wrote:
Hi,

I am planning to use spooling directory to move logfiles in hdfs sink.

I like to know how flume identifies the file we are moving to spool directory is complete file or partial & its move still in progress.

if suppose a file is of large size and we started moving it to spooler directory , how flume identifies that the complete file is transferred or is still in progress.

Please help me out here.

Thanks,
saravana





--
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal


--------------------------------------------------------------------
Save Paper - Do you really need to print this e-mail?

Visit www.virginmedia.com for more information, and more fun.

This email and any attachments are or may be confidential and legally privileged
and are sent solely for the attention of the addressee(s). If you have received this
email in error, please delete it from your system: its use, disclosure or copying is
unauthorised. Statements and opinions expressed in this email may not represent
those of Virgin Media. Any representations or commitments in this email are
subject to contract. 

Registered office: Media House, Bartley Wood Business Park, Hook, Hampshire, RG27 9UP
Registered in England and Wales with number 2591237

Re: how spooling directory source identifies the complete file

Posted by SaravanaKumar TR <sa...@gmail.com>.
Thanks Ashish , I already referred to this info.

But I couldn't see any explanation in flume user guide about how flume
differentiates between copy-in progress file and fully copied file.


On Wed, Jul 23, 2014 at 10:59 AM, Ashish <pa...@gmail.com> wrote:

> This is specified in Flume's User Guide
>
> "Unlike the Exec source, this source is reliable and will not miss data,
> even if Flume is restarted or killed. In exchange for this reliability,
> only immutable, uniquely-named files must be dropped into the spooling
> directory. Flume tries to detect these problem conditions and will fail
> loudly if they are violated:
>
>    1. If a file is written to after being placed into the spooling
>    directory, Flume will print an error to its log file and stop processing.
>    2. If a file name is reused at a later time, Flume will print an error
>    to its log file and stop processing.
>
> To avoid the above issues, it may be useful to add a unique identifier
> (such as a timestamp) to log file names when they are moved into the
> spooling directory."
>
>
> On Wed, Jul 23, 2014 at 10:17 AM, SaravanaKumar TR <saran0081986@gmail.com
> > wrote:
>
>> Hi Jeff,
>>
>> Thanks of your comments.But what I am really looking for is  , consider
>> we are copying a file of 1 GB to spool directory , if suppose copy is in
>> progress , how flume recognize that the complete file is copied into the
>> spool directory and the file is ready for processing ?
>>
>> how flume make sure it doesnt start processing the partially copied file.
>>
>>
>> On Tue, Jul 22, 2014 at 11:15 PM, Jeff Lord <jl...@cloudera.com> wrote:
>>
>>> I believe the way this works is that flume creates a meta directory to
>>> track which file is being read.
>>> In the event of a restart of the agent the entire file will be re-read
>>> which will create some duplicate events.
>>>
>>>
>>> https://github.com/apache/flume/blob/flume-1.5/flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java#L474
>>>
>>>
>>> On Tue, Jul 22, 2014 at 6:15 AM, SaravanaKumar TR <
>>> saran0081986@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am planning to use spooling directory to move logfiles in hdfs sink.
>>>>
>>>> I like to know how flume identifies the file we are moving to spool
>>>> directory is complete file or partial & its move still in progress.
>>>>
>>>> if suppose a file is of large size and we started moving it to spooler
>>>> directory , how flume identifies that the complete file is transferred or
>>>> is still in progress.
>>>>
>>>> Please help me out here.
>>>>
>>>> Thanks,
>>>> saravana
>>>>
>>>
>>>
>>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: how spooling directory source identifies the complete file

Posted by Ashish <pa...@gmail.com>.
This is specified in Flume's User Guide

"Unlike the Exec source, this source is reliable and will not miss data,
even if Flume is restarted or killed. In exchange for this reliability,
only immutable, uniquely-named files must be dropped into the spooling
directory. Flume tries to detect these problem conditions and will fail
loudly if they are violated:

   1. If a file is written to after being placed into the spooling
   directory, Flume will print an error to its log file and stop processing.
   2. If a file name is reused at a later time, Flume will print an error
   to its log file and stop processing.

To avoid the above issues, it may be useful to add a unique identifier
(such as a timestamp) to log file names when they are moved into the
spooling directory."


On Wed, Jul 23, 2014 at 10:17 AM, SaravanaKumar TR <sa...@gmail.com>
wrote:

> Hi Jeff,
>
> Thanks of your comments.But what I am really looking for is  , consider we
> are copying a file of 1 GB to spool directory , if suppose copy is in
> progress , how flume recognize that the complete file is copied into the
> spool directory and the file is ready for processing ?
>
> how flume make sure it doesnt start processing the partially copied file.
>
>
> On Tue, Jul 22, 2014 at 11:15 PM, Jeff Lord <jl...@cloudera.com> wrote:
>
>> I believe the way this works is that flume creates a meta directory to
>> track which file is being read.
>> In the event of a restart of the agent the entire file will be re-read
>> which will create some duplicate events.
>>
>>
>> https://github.com/apache/flume/blob/flume-1.5/flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java#L474
>>
>>
>> On Tue, Jul 22, 2014 at 6:15 AM, SaravanaKumar TR <saran0081986@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> I am planning to use spooling directory to move logfiles in hdfs sink.
>>>
>>> I like to know how flume identifies the file we are moving to spool
>>> directory is complete file or partial & its move still in progress.
>>>
>>> if suppose a file is of large size and we started moving it to spooler
>>> directory , how flume identifies that the complete file is transferred or
>>> is still in progress.
>>>
>>> Please help me out here.
>>>
>>> Thanks,
>>> saravana
>>>
>>
>>
>


-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Re: how spooling directory source identifies the complete file

Posted by SaravanaKumar TR <sa...@gmail.com>.
Hi Jeff,

Thanks of your comments.But what I am really looking for is  , consider we
are copying a file of 1 GB to spool directory , if suppose copy is in
progress , how flume recognize that the complete file is copied into the
spool directory and the file is ready for processing ?

how flume make sure it doesnt start processing the partially copied file.


On Tue, Jul 22, 2014 at 11:15 PM, Jeff Lord <jl...@cloudera.com> wrote:

> I believe the way this works is that flume creates a meta directory to
> track which file is being read.
> In the event of a restart of the agent the entire file will be re-read
> which will create some duplicate events.
>
>
> https://github.com/apache/flume/blob/flume-1.5/flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java#L474
>
>
> On Tue, Jul 22, 2014 at 6:15 AM, SaravanaKumar TR <sa...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am planning to use spooling directory to move logfiles in hdfs sink.
>>
>> I like to know how flume identifies the file we are moving to spool
>> directory is complete file or partial & its move still in progress.
>>
>> if suppose a file is of large size and we started moving it to spooler
>> directory , how flume identifies that the complete file is transferred or
>> is still in progress.
>>
>> Please help me out here.
>>
>> Thanks,
>> saravana
>>
>
>

Re: how spooling directory source identifies the complete file

Posted by Jeff Lord <jl...@cloudera.com>.
I believe the way this works is that flume creates a meta directory to
track which file is being read.
In the event of a restart of the agent the entire file will be re-read
which will create some duplicate events.

https://github.com/apache/flume/blob/flume-1.5/flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java#L474


On Tue, Jul 22, 2014 at 6:15 AM, SaravanaKumar TR <sa...@gmail.com>
wrote:

> Hi,
>
> I am planning to use spooling directory to move logfiles in hdfs sink.
>
> I like to know how flume identifies the file we are moving to spool
> directory is complete file or partial & its move still in progress.
>
> if suppose a file is of large size and we started moving it to spooler
> directory , how flume identifies that the complete file is transferred or
> is still in progress.
>
> Please help me out here.
>
> Thanks,
> saravana
>