You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Something Something <ma...@gmail.com> on 2014/04/17 02:14:18 UTC

Import files from a directory on remote machine

Hello,

Needless to say I am newbie to Flume, but I've got a basic flow working in
which I am importing a log file from my linux box to hdfs.  I am using

a1.sources.r1.command = tail -F /var/log/xyz.log

which is working like a stream of messages.  This is good!

Now what I want to do is copy log files from a directory on a remote
machine on a regular basis.  For example:

username@machinename:/var/log/logdir/<multiple files>

One way to do it is to simply 'scp' files from the remote directory into my
box on a regular basis, but what's the best way to do this in Flume?
Please let me know.

Thanks for the help.

RE: Import files from a directory on remote machine

Posted by Paul Chavez <pc...@verticalsearchworks.com>.

I would recommend using a scheduled script to create diff files off the log files. I have one that runs against large logs files that roll over on UTC day. It runs once a minute, checkpoints the log, creates a diff a drops it in the spool directory and then cleans up any completed files.

I agree it would be nice if there was a source that implemented this type of functionality in flume (checkpointing and picking new events off a file in use) but this works for now and is a pattern I’ve seen recommended on this list before.

Hope that helps,
Paul


From: Otis Gospodnetic [mailto:otis.gospodnetic@gmail.com]
Sent: Wednesday, April 23, 2014 9:14 AM
To: user@flume.apache.org
Subject: Re: Import files from a directory on remote machine

Hi,

Doesn't Spooling Directory Source require one to drop in files only once they are no longer being written to?  In other words, this is OK if you need to process data from files periodically, but what if you want to get and process data in RT, right after it's added to a file?

Is there an alternative to ExecSource + tail -F for this use case?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Wed, Apr 23, 2014 at 11:14 AM, Jeff Lord <jl...@cloudera.com>> wrote:
Hi Otis,

This is pretty clearly stated in the docs.
For production we would typically recommend the spooling directory source as an alternative.

http://flume.apache.org/FlumeUserGuide.html#exec-source

"Warning The problem with ExecSource and other asynchronous sources is that the source can not guarantee that if there is a failure to put the event into the Channel the client knows about it. In such cases, the data will be lost. As a for instance, one of the most commonly requested features is the tail -F [file]-like use case where an application writes to a log file on disk and Flume tails the file, sending each line as an event. While this is possible, there’s an obvious problem; what happens if the channel fills up and Flume can’t send an event? Flume has no way of indicating to the application writing the log file that it needs to retain the log or that the event hasn’t been sent, for some reason. If this doesn’t make sense, you need only know this: Your application can never guarantee data has been received when using a unidirectional asynchronous interface such as ExecSource! As an extension of this warning - and to be completely clear - there is absolutely zero guarantee of event delivery when using this source. For stronger reliability guarantees, consider the Spooling Directory Source or direct integration with Flume via the SDK."

-Jeff

On Wed, Apr 23, 2014 at 6:48 AM, Otis Gospodnetic <ot...@gmail.com>> wrote:
Hi Jeff,

On Thu, Apr 17, 2014 at 1:11 PM, Jeff Lord <jl...@cloudera.com>> wrote:
Using the exec source with a tail -f is not considered a production solution.
It mainly exists for testing purposes.

This statement surprised me.  Is that the general consensus among Flume developers or users or at Cloudera?

Is there an alternative recommended for production that provides equivalent functionality?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/





On Thu, Apr 17, 2014 at 7:03 AM, Laurance George <la...@gmail.com>> wrote:
If you can NFS mount that directory to your local machine with flume it sounds like what you've listed out would work well.

On Thu, Apr 17, 2014 at 2:54 AM, Something Something <ma...@gmail.com>> wrote:
If I am going to 'rsync' a file from remote host & copy it to hdfs via Flume, then why use Flume?  I can rsync & then just do a 'hadoop fs -put', no?  I must be missing something.  I guess, the only benefit of using Flume is that I can add Interceptors if I want to.  Current requirements don't need that.  We just want to copy data as is.
Here's the real use case:   An application is writing to xyz.log file.  Once this file gets over certain size it gets rolled over to xyz1.log & so on.  Kinda like Log4j.  What we really want is as soon as a line gets written to xyz.log, it should go to HDFS via Flume.
Can I do something like this?
1)  Share the log directory under Linux.
2)  Use
test1.sources.mylog.type = exec
test1.sources.mylog.command = tail -F /home/user1/shares/logs/xyz.log
I believe this will work, but is this the right way?  Thanks for your help.



On Wed, Apr 16, 2014 at 5:51 PM, Laurance George <la...@gmail.com>> wrote:
Agreed with Jeff.  Rsync + cron ( if it needs to be regular) is probably your best bet to ingest files from a remote machine that you only have read access to.  But then again you're sorta stepping outside of the use case of flume at some level here as rsync is now basically a part of your flume topology.  However, if you just need to back-fill old log data then this is perfect!  In fact, it's what I do myself.

On Wed, Apr 16, 2014 at 8:46 PM, Jeff Lord <jl...@cloudera.com>> wrote:
The spooling directory source runs as part of the agent.
The source also needs write access to the files as it renames them upon completion of ingest. Perhaps you could use rsync to copy the files somewhere that you have write access to?

On Wed, Apr 16, 2014 at 5:26 PM, Something Something <ma...@gmail.com>> wrote:
Thanks Jeff.  This is useful.  Can the spoolDir be on a different machine?  We may have to setup a different process to copy files into 'spoolDir', right?  Note:  We have 'read only' access to these files.  Any recommendations about this?

On Wed, Apr 16, 2014 at 5:16 PM, Jeff Lord <jl...@cloudera.com>> wrote:
http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source

On Wed, Apr 16, 2014 at 5:14 PM, Something Something <ma...@gmail.com>> wrote:
Hello,
Needless to say I am newbie to Flume, but I've got a basic flow working in which I am importing a log file from my linux box to hdfs.  I am using
a1.sources.r1.command = tail -F /var/log/xyz.log
which is working like a stream of messages.  This is good!

Now what I want to do is copy log files from a directory on a remote machine on a regular basis.  For example:
username@machinename:/var/log/logdir/<multiple files>
One way to do it is to simply 'scp' files from the remote directory into my box on a regular basis, but what's the best way to do this in Flume?  Please let me know.

Thanks for the help.






--
Laurance George



--
Laurance George

Re: Import files from a directory on remote machine

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

Doesn't Spooling Directory Source require one to drop in files only once
they are no longer being written to?  In other words, this is OK if you
need to process data from files periodically, but what if you want to get
and process data in RT, right after it's added to a file?

Is there an alternative to ExecSource + tail -F for this use case?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Apr 23, 2014 at 11:14 AM, Jeff Lord <jl...@cloudera.com> wrote:

> Hi Otis,
>
> This is pretty clearly stated in the docs.
> For production we would typically recommend the spooling directory source
> as an alternative.
>
> http://flume.apache.org/FlumeUserGuide.html#exec-source
>
> "Warning The problem with ExecSource and other asynchronous sources is
> that the source can not guarantee that if there is a failure to put the
> event into the Channel the client knows about it. In such cases, the data
> will be lost. As a for instance, one of the most commonly requested
> features is the tail -F [file]-like use case where an application writes to
> a log file on disk and Flume tails the file, sending each line as an event.
> While this is possible, there’s an obvious problem; what happens if the
> channel fills up and Flume can’t send an event? Flume has no way of
> indicating to the application writing the log file that it needs to retain
> the log or that the event hasn’t been sent, for some reason. If this
> doesn’t make sense, you need only know this: Your application can never
> guarantee data has been received when using a unidirectional asynchronous
> interface such as ExecSource! As an extension of this warning - and to be
> completely clear - there is absolutely zero guarantee of event delivery
> when using this source. For stronger reliability guarantees, consider the
> Spooling Directory Source or direct integration with Flume via the SDK."
>
> -Jeff
>
>
> On Wed, Apr 23, 2014 at 6:48 AM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
>> Hi Jeff,
>>
>> On Thu, Apr 17, 2014 at 1:11 PM, Jeff Lord <jl...@cloudera.com> wrote:
>>
>>> Using the exec source with a tail -f is not considered a production
>>> solution.
>>> It mainly exists for testing purposes.
>>>
>>
>> This statement surprised me.  Is that the general consensus among Flume
>> developers or users or at Cloudera?
>>
>> Is there an alternative recommended for production that provides
>> equivalent functionality?
>>
>> Thanks,
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>>
>>
>>
>>>
>>>
>>> On Thu, Apr 17, 2014 at 7:03 AM, Laurance George <
>>> laurance.w.george@gmail.com> wrote:
>>>
>>>> If you can NFS mount that directory to your local machine with flume it
>>>> sounds like what you've listed out would work well.
>>>>
>>>>
>>>> On Thu, Apr 17, 2014 at 2:54 AM, Something Something <
>>>> mailinglists19@gmail.com> wrote:
>>>>
>>>>> If I am going to 'rsync' a file from remote host & copy it to hdfs via
>>>>> Flume, then why use Flume?  I can rsync & then just do a 'hadoop fs -put',
>>>>> no?  I must be missing something.  I guess, the only benefit of using Flume
>>>>> is that I can add Interceptors if I want to.  Current requirements don't
>>>>> need that.  We just want to copy data as is.
>>>>>
>>>>> Here's the real use case:   An application is writing to xyz.log
>>>>> file.  Once this file gets over certain size it gets rolled over to
>>>>> xyz1.log & so on.  Kinda like Log4j.  What we really want is as soon as a
>>>>> line gets written to xyz.log, it should go to HDFS via Flume.
>>>>>
>>>>> Can I do something like this?
>>>>>
>>>>> 1)  Share the log directory under Linux.
>>>>> 2)  Use
>>>>> test1.sources.mylog.type = exec
>>>>> test1.sources.mylog.command = tail -F /home/user1/shares/logs/xyz.log
>>>>>
>>>>> I believe this will work, but is this the right way?  Thanks for your
>>>>> help.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 16, 2014 at 5:51 PM, Laurance George <
>>>>> laurance.w.george@gmail.com> wrote:
>>>>>
>>>>>> Agreed with Jeff.  Rsync + cron ( if it needs to be regular) is
>>>>>> probably your best bet to ingest files from a remote machine that you only
>>>>>> have read access to.  But then again you're sorta stepping outside of the
>>>>>> use case of flume at some level here as rsync is now basically a part of
>>>>>> your flume topology.  However, if you just need to back-fill old log data
>>>>>> then this is perfect!  In fact, it's what I do myself.
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 16, 2014 at 8:46 PM, Jeff Lord <jl...@cloudera.com>wrote:
>>>>>>
>>>>>>> The spooling directory source runs as part of the agent.
>>>>>>> The source also needs write access to the files as it renames them
>>>>>>> upon completion of ingest. Perhaps you could use rsync to copy the files
>>>>>>> somewhere that you have write access to?
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 16, 2014 at 5:26 PM, Something Something <
>>>>>>> mailinglists19@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Jeff.  This is useful.  Can the spoolDir be on a different
>>>>>>>> machine?  We may have to setup a different process to copy files into
>>>>>>>> 'spoolDir', right?  Note:  We have 'read only' access to these files.  Any
>>>>>>>> recommendations about this?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Apr 16, 2014 at 5:16 PM, Jeff Lord <jl...@cloudera.com>wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Apr 16, 2014 at 5:14 PM, Something Something <
>>>>>>>>> mailinglists19@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> Needless to say I am newbie to Flume, but I've got a basic flow
>>>>>>>>>> working in which I am importing a log file from my linux box to hdfs.  I am
>>>>>>>>>> using
>>>>>>>>>>
>>>>>>>>>> a1.sources.r1.command = tail -F /var/log/xyz.log
>>>>>>>>>>
>>>>>>>>>> which is working like a stream of messages.  This is good!
>>>>>>>>>>
>>>>>>>>>> Now what I want to do is copy log files from a directory on a
>>>>>>>>>> remote machine on a regular basis.  For example:
>>>>>>>>>>
>>>>>>>>>> username@machinename:/var/log/logdir/<multiple files>
>>>>>>>>>>
>>>>>>>>>> One way to do it is to simply 'scp' files from the remote
>>>>>>>>>> directory into my box on a regular basis, but what's the best way to do
>>>>>>>>>> this in Flume?  Please let me know.
>>>>>>>>>>
>>>>>>>>>> Thanks for the help.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Laurance George
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Laurance George
>>>>
>>>
>>>
>>
>

Re: Import files from a directory on remote machine

Posted by Jeff Lord <jl...@cloudera.com>.

Hi Otis,

This is pretty clearly stated in the docs.
For production we would typically recommend the spooling directory source
as an alternative.

http://flume.apache.org/FlumeUserGuide.html#exec-source

"Warning The problem with ExecSource and other asynchronous sources is that
the source can not guarantee that if there is a failure to put the event
into the Channel the client knows about it. In such cases, the data will be
lost. As a for instance, one of the most commonly requested features is the
tail -F [file]-like use case where an application writes to a log file on
disk and Flume tails the file, sending each line as an event. While this is
possible, there's an obvious problem; what happens if the channel fills up
and Flume can't send an event? Flume has no way of indicating to the
application writing the log file that it needs to retain the log or that
the event hasn't been sent, for some reason. If this doesn't make sense,
you need only know this: Your application can never guarantee data has been
received when using a unidirectional asynchronous interface such as
ExecSource! As an extension of this warning - and to be completely clear -
there is absolutely zero guarantee of event delivery when using this
source. For stronger reliability guarantees, consider the Spooling
Directory Source or direct integration with Flume via the SDK."

-Jeff


On Wed, Apr 23, 2014 at 6:48 AM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Hi Jeff,
>
> On Thu, Apr 17, 2014 at 1:11 PM, Jeff Lord <jl...@cloudera.com> wrote:
>
>> Using the exec source with a tail -f is not considered a production
>> solution.
>> It mainly exists for testing purposes.
>>
>
> This statement surprised me.  Is that the general consensus among Flume
> developers or users or at Cloudera?
>
> Is there an alternative recommended for production that provides
> equivalent functionality?
>
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
>
>>
>>
>> On Thu, Apr 17, 2014 at 7:03 AM, Laurance George <
>> laurance.w.george@gmail.com> wrote:
>>
>>> If you can NFS mount that directory to your local machine with flume it
>>> sounds like what you've listed out would work well.
>>>
>>>
>>> On Thu, Apr 17, 2014 at 2:54 AM, Something Something <
>>> mailinglists19@gmail.com> wrote:
>>>
>>>> If I am going to 'rsync' a file from remote host & copy it to hdfs via
>>>> Flume, then why use Flume?  I can rsync & then just do a 'hadoop fs -put',
>>>> no?  I must be missing something.  I guess, the only benefit of using Flume
>>>> is that I can add Interceptors if I want to.  Current requirements don't
>>>> need that.  We just want to copy data as is.
>>>>
>>>> Here's the real use case:   An application is writing to xyz.log file.
>>>> Once this file gets over certain size it gets rolled over to xyz1.log & so
>>>> on.  Kinda like Log4j.  What we really want is as soon as a line gets
>>>> written to xyz.log, it should go to HDFS via Flume.
>>>>
>>>> Can I do something like this?
>>>>
>>>> 1)  Share the log directory under Linux.
>>>> 2)  Use
>>>> test1.sources.mylog.type = exec
>>>> test1.sources.mylog.command = tail -F /home/user1/shares/logs/xyz.log
>>>>
>>>> I believe this will work, but is this the right way?  Thanks for your
>>>> help.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Apr 16, 2014 at 5:51 PM, Laurance George <
>>>> laurance.w.george@gmail.com> wrote:
>>>>
>>>>> Agreed with Jeff.  Rsync + cron ( if it needs to be regular) is
>>>>> probably your best bet to ingest files from a remote machine that you only
>>>>> have read access to.  But then again you're sorta stepping outside of the
>>>>> use case of flume at some level here as rsync is now basically a part of
>>>>> your flume topology.  However, if you just need to back-fill old log data
>>>>> then this is perfect!  In fact, it's what I do myself.
>>>>>
>>>>>
>>>>> On Wed, Apr 16, 2014 at 8:46 PM, Jeff Lord <jl...@cloudera.com> wrote:
>>>>>
>>>>>> The spooling directory source runs as part of the agent.
>>>>>> The source also needs write access to the files as it renames them
>>>>>> upon completion of ingest. Perhaps you could use rsync to copy the files
>>>>>> somewhere that you have write access to?
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 16, 2014 at 5:26 PM, Something Something <
>>>>>> mailinglists19@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Jeff.  This is useful.  Can the spoolDir be on a different
>>>>>>> machine?  We may have to setup a different process to copy files into
>>>>>>> 'spoolDir', right?  Note:  We have 'read only' access to these files.  Any
>>>>>>> recommendations about this?
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 16, 2014 at 5:16 PM, Jeff Lord <jl...@cloudera.com>wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Apr 16, 2014 at 5:14 PM, Something Something <
>>>>>>>> mailinglists19@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> Needless to say I am newbie to Flume, but I've got a basic flow
>>>>>>>>> working in which I am importing a log file from my linux box to hdfs.  I am
>>>>>>>>> using
>>>>>>>>>
>>>>>>>>> a1.sources.r1.command = tail -F /var/log/xyz.log
>>>>>>>>>
>>>>>>>>> which is working like a stream of messages.  This is good!
>>>>>>>>>
>>>>>>>>> Now what I want to do is copy log files from a directory on a
>>>>>>>>> remote machine on a regular basis.  For example:
>>>>>>>>>
>>>>>>>>> username@machinename:/var/log/logdir/<multiple files>
>>>>>>>>>
>>>>>>>>> One way to do it is to simply 'scp' files from the remote
>>>>>>>>> directory into my box on a regular basis, but what's the best way to do
>>>>>>>>> this in Flume?  Please let me know.
>>>>>>>>>
>>>>>>>>> Thanks for the help.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Laurance George
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Laurance George
>>>
>>
>>
>

Re: Import files from a directory on remote machine

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi Jeff,

On Thu, Apr 17, 2014 at 1:11 PM, Jeff Lord <jl...@cloudera.com> wrote:

> Using the exec source with a tail -f is not considered a production
> solution.
> It mainly exists for testing purposes.
>

This statement surprised me.  Is that the general consensus among Flume
developers or users or at Cloudera?

Is there an alternative recommended for production that provides equivalent
functionality?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/





>
>
> On Thu, Apr 17, 2014 at 7:03 AM, Laurance George <
> laurance.w.george@gmail.com> wrote:
>
>> If you can NFS mount that directory to your local machine with flume it
>> sounds like what you've listed out would work well.
>>
>>
>> On Thu, Apr 17, 2014 at 2:54 AM, Something Something <
>> mailinglists19@gmail.com> wrote:
>>
>>> If I am going to 'rsync' a file from remote host & copy it to hdfs via
>>> Flume, then why use Flume?  I can rsync & then just do a 'hadoop fs -put',
>>> no?  I must be missing something.  I guess, the only benefit of using Flume
>>> is that I can add Interceptors if I want to.  Current requirements don't
>>> need that.  We just want to copy data as is.
>>>
>>> Here's the real use case:   An application is writing to xyz.log file.
>>> Once this file gets over certain size it gets rolled over to xyz1.log & so
>>> on.  Kinda like Log4j.  What we really want is as soon as a line gets
>>> written to xyz.log, it should go to HDFS via Flume.
>>>
>>> Can I do something like this?
>>>
>>> 1)  Share the log directory under Linux.
>>> 2)  Use
>>> test1.sources.mylog.type = exec
>>> test1.sources.mylog.command = tail -F /home/user1/shares/logs/xyz.log
>>>
>>> I believe this will work, but is this the right way?  Thanks for your
>>> help.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Apr 16, 2014 at 5:51 PM, Laurance George <
>>> laurance.w.george@gmail.com> wrote:
>>>
>>>> Agreed with Jeff.  Rsync + cron ( if it needs to be regular) is
>>>> probably your best bet to ingest files from a remote machine that you only
>>>> have read access to.  But then again you're sorta stepping outside of the
>>>> use case of flume at some level here as rsync is now basically a part of
>>>> your flume topology.  However, if you just need to back-fill old log data
>>>> then this is perfect!  In fact, it's what I do myself.
>>>>
>>>>
>>>> On Wed, Apr 16, 2014 at 8:46 PM, Jeff Lord <jl...@cloudera.com> wrote:
>>>>
>>>>> The spooling directory source runs as part of the agent.
>>>>> The source also needs write access to the files as it renames them
>>>>> upon completion of ingest. Perhaps you could use rsync to copy the files
>>>>> somewhere that you have write access to?
>>>>>
>>>>>
>>>>> On Wed, Apr 16, 2014 at 5:26 PM, Something Something <
>>>>> mailinglists19@gmail.com> wrote:
>>>>>
>>>>>> Thanks Jeff.  This is useful.  Can the spoolDir be on a different
>>>>>> machine?  We may have to setup a different process to copy files into
>>>>>> 'spoolDir', right?  Note:  We have 'read only' access to these files.  Any
>>>>>> recommendations about this?
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 16, 2014 at 5:16 PM, Jeff Lord <jl...@cloudera.com>wrote:
>>>>>>
>>>>>>> http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 16, 2014 at 5:14 PM, Something Something <
>>>>>>> mailinglists19@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Needless to say I am newbie to Flume, but I've got a basic flow
>>>>>>>> working in which I am importing a log file from my linux box to hdfs.  I am
>>>>>>>> using
>>>>>>>>
>>>>>>>> a1.sources.r1.command = tail -F /var/log/xyz.log
>>>>>>>>
>>>>>>>> which is working like a stream of messages.  This is good!
>>>>>>>>
>>>>>>>> Now what I want to do is copy log files from a directory on a
>>>>>>>> remote machine on a regular basis.  For example:
>>>>>>>>
>>>>>>>> username@machinename:/var/log/logdir/<multiple files>
>>>>>>>>
>>>>>>>> One way to do it is to simply 'scp' files from the remote directory
>>>>>>>> into my box on a regular basis, but what's the best way to do this in
>>>>>>>> Flume?  Please let me know.
>>>>>>>>
>>>>>>>> Thanks for the help.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Laurance George
>>>>
>>>
>>>
>>
>>
>> --
>> Laurance George
>>
>
>

Re: Import files from a directory on remote machine

Posted by Something Something <ma...@gmail.com>.

Hmm... yeah.. I felt a bit uncomfortable with the 'tail -F' solution.  But
then going back to your original suggestion:

'Perhaps you could use rsync to copy the files somewhere that you have
write access to?':  This will work with files that have been populated
completely and will no longer change, correct?  What about the file that is
currently getting written to?  Is there some sort of 'file watching'
mechanism equivalent to 'tail -F' in Flume?


On Thu, Apr 17, 2014 at 10:11 AM, Jeff Lord <jl...@cloudera.com> wrote:

> Using the exec source with a tail -f is not considered a production
> solution.
> It mainly exists for testing purposes.
>
>
> On Thu, Apr 17, 2014 at 7:03 AM, Laurance George <
> laurance.w.george@gmail.com> wrote:
>
>> If you can NFS mount that directory to your local machine with flume it
>> sounds like what you've listed out would work well.
>>
>>
>> On Thu, Apr 17, 2014 at 2:54 AM, Something Something <
>> mailinglists19@gmail.com> wrote:
>>
>>> If I am going to 'rsync' a file from remote host & copy it to hdfs via
>>> Flume, then why use Flume?  I can rsync & then just do a 'hadoop fs -put',
>>> no?  I must be missing something.  I guess, the only benefit of using Flume
>>> is that I can add Interceptors if I want to.  Current requirements don't
>>> need that.  We just want to copy data as is.
>>>
>>> Here's the real use case:   An application is writing to xyz.log file.
>>> Once this file gets over certain size it gets rolled over to xyz1.log & so
>>> on.  Kinda like Log4j.  What we really want is as soon as a line gets
>>> written to xyz.log, it should go to HDFS via Flume.
>>>
>>> Can I do something like this?
>>>
>>> 1)  Share the log directory under Linux.
>>> 2)  Use
>>> test1.sources.mylog.type = exec
>>> test1.sources.mylog.command = tail -F /home/user1/shares/logs/xyz.log
>>>
>>> I believe this will work, but is this the right way?  Thanks for your
>>> help.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Apr 16, 2014 at 5:51 PM, Laurance George <
>>> laurance.w.george@gmail.com> wrote:
>>>
>>>> Agreed with Jeff.  Rsync + cron ( if it needs to be regular) is
>>>> probably your best bet to ingest files from a remote machine that you only
>>>> have read access to.  But then again you're sorta stepping outside of the
>>>> use case of flume at some level here as rsync is now basically a part of
>>>> your flume topology.  However, if you just need to back-fill old log data
>>>> then this is perfect!  In fact, it's what I do myself.
>>>>
>>>>
>>>> On Wed, Apr 16, 2014 at 8:46 PM, Jeff Lord <jl...@cloudera.com> wrote:
>>>>
>>>>> The spooling directory source runs as part of the agent.
>>>>> The source also needs write access to the files as it renames them
>>>>> upon completion of ingest. Perhaps you could use rsync to copy the files
>>>>> somewhere that you have write access to?
>>>>>
>>>>>
>>>>> On Wed, Apr 16, 2014 at 5:26 PM, Something Something <
>>>>> mailinglists19@gmail.com> wrote:
>>>>>
>>>>>> Thanks Jeff.  This is useful.  Can the spoolDir be on a different
>>>>>> machine?  We may have to setup a different process to copy files into
>>>>>> 'spoolDir', right?  Note:  We have 'read only' access to these files.  Any
>>>>>> recommendations about this?
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 16, 2014 at 5:16 PM, Jeff Lord <jl...@cloudera.com>wrote:
>>>>>>
>>>>>>> http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 16, 2014 at 5:14 PM, Something Something <
>>>>>>> mailinglists19@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Needless to say I am newbie to Flume, but I've got a basic flow
>>>>>>>> working in which I am importing a log file from my linux box to hdfs.  I am
>>>>>>>> using
>>>>>>>>
>>>>>>>> a1.sources.r1.command = tail -F /var/log/xyz.log
>>>>>>>>
>>>>>>>> which is working like a stream of messages.  This is good!
>>>>>>>>
>>>>>>>> Now what I want to do is copy log files from a directory on a
>>>>>>>> remote machine on a regular basis.  For example:
>>>>>>>>
>>>>>>>> username@machinename:/var/log/logdir/<multiple files>
>>>>>>>>
>>>>>>>> One way to do it is to simply 'scp' files from the remote directory
>>>>>>>> into my box on a regular basis, but what's the best way to do this in
>>>>>>>> Flume?  Please let me know.
>>>>>>>>
>>>>>>>> Thanks for the help.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Laurance George
>>>>
>>>
>>>
>>
>>
>> --
>> Laurance George
>>
>
>

Re: Import files from a directory on remote machine

Posted by Jeff Lord <jl...@cloudera.com>.

Using the exec source with a tail -f is not considered a production
solution.
It mainly exists for testing purposes.


On Thu, Apr 17, 2014 at 7:03 AM, Laurance George <
laurance.w.george@gmail.com> wrote:

> If you can NFS mount that directory to your local machine with flume it
> sounds like what you've listed out would work well.
>
>
> On Thu, Apr 17, 2014 at 2:54 AM, Something Something <
> mailinglists19@gmail.com> wrote:
>
>> If I am going to 'rsync' a file from remote host & copy it to hdfs via
>> Flume, then why use Flume?  I can rsync & then just do a 'hadoop fs -put',
>> no?  I must be missing something.  I guess, the only benefit of using Flume
>> is that I can add Interceptors if I want to.  Current requirements don't
>> need that.  We just want to copy data as is.
>>
>> Here's the real use case:   An application is writing to xyz.log file.
>> Once this file gets over certain size it gets rolled over to xyz1.log & so
>> on.  Kinda like Log4j.  What we really want is as soon as a line gets
>> written to xyz.log, it should go to HDFS via Flume.
>>
>> Can I do something like this?
>>
>> 1)  Share the log directory under Linux.
>> 2)  Use
>> test1.sources.mylog.type = exec
>> test1.sources.mylog.command = tail -F /home/user1/shares/logs/xyz.log
>>
>> I believe this will work, but is this the right way?  Thanks for your
>> help.
>>
>>
>>
>>
>>
>> On Wed, Apr 16, 2014 at 5:51 PM, Laurance George <
>> laurance.w.george@gmail.com> wrote:
>>
>>> Agreed with Jeff.  Rsync + cron ( if it needs to be regular) is probably
>>> your best bet to ingest files from a remote machine that you only have read
>>> access to.  But then again you're sorta stepping outside of the use case of
>>> flume at some level here as rsync is now basically a part of your flume
>>> topology.  However, if you just need to back-fill old log data then this is
>>> perfect!  In fact, it's what I do myself.
>>>
>>>
>>> On Wed, Apr 16, 2014 at 8:46 PM, Jeff Lord <jl...@cloudera.com> wrote:
>>>
>>>> The spooling directory source runs as part of the agent.
>>>> The source also needs write access to the files as it renames them upon
>>>> completion of ingest. Perhaps you could use rsync to copy the files
>>>> somewhere that you have write access to?
>>>>
>>>>
>>>> On Wed, Apr 16, 2014 at 5:26 PM, Something Something <
>>>> mailinglists19@gmail.com> wrote:
>>>>
>>>>> Thanks Jeff.  This is useful.  Can the spoolDir be on a different
>>>>> machine?  We may have to setup a different process to copy files into
>>>>> 'spoolDir', right?  Note:  We have 'read only' access to these files.  Any
>>>>> recommendations about this?
>>>>>
>>>>>
>>>>> On Wed, Apr 16, 2014 at 5:16 PM, Jeff Lord <jl...@cloudera.com> wrote:
>>>>>
>>>>>> http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 16, 2014 at 5:14 PM, Something Something <
>>>>>> mailinglists19@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Needless to say I am newbie to Flume, but I've got a basic flow
>>>>>>> working in which I am importing a log file from my linux box to hdfs.  I am
>>>>>>> using
>>>>>>>
>>>>>>> a1.sources.r1.command = tail -F /var/log/xyz.log
>>>>>>>
>>>>>>> which is working like a stream of messages.  This is good!
>>>>>>>
>>>>>>> Now what I want to do is copy log files from a directory on a remote
>>>>>>> machine on a regular basis.  For example:
>>>>>>>
>>>>>>> username@machinename:/var/log/logdir/<multiple files>
>>>>>>>
>>>>>>> One way to do it is to simply 'scp' files from the remote directory
>>>>>>> into my box on a regular basis, but what's the best way to do this in
>>>>>>> Flume?  Please let me know.
>>>>>>>
>>>>>>> Thanks for the help.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Laurance George
>>>
>>
>>
>
>
> --
> Laurance George
>

Re: Import files from a directory on remote machine

Posted by Laurance George <la...@gmail.com>.

If you can NFS mount that directory to your local machine with flume it
sounds like what you've listed out would work well.


On Thu, Apr 17, 2014 at 2:54 AM, Something Something <
mailinglists19@gmail.com> wrote:

> If I am going to 'rsync' a file from remote host & copy it to hdfs via
> Flume, then why use Flume?  I can rsync & then just do a 'hadoop fs -put',
> no?  I must be missing something.  I guess, the only benefit of using Flume
> is that I can add Interceptors if I want to.  Current requirements don't
> need that.  We just want to copy data as is.
>
> Here's the real use case:   An application is writing to xyz.log file.
> Once this file gets over certain size it gets rolled over to xyz1.log & so
> on.  Kinda like Log4j.  What we really want is as soon as a line gets
> written to xyz.log, it should go to HDFS via Flume.
>
> Can I do something like this?
>
> 1)  Share the log directory under Linux.
> 2)  Use
> test1.sources.mylog.type = exec
> test1.sources.mylog.command = tail -F /home/user1/shares/logs/xyz.log
>
> I believe this will work, but is this the right way?  Thanks for your help.
>
>
>
>
>
> On Wed, Apr 16, 2014 at 5:51 PM, Laurance George <
> laurance.w.george@gmail.com> wrote:
>
>> Agreed with Jeff.  Rsync + cron ( if it needs to be regular) is probably
>> your best bet to ingest files from a remote machine that you only have read
>> access to.  But then again you're sorta stepping outside of the use case of
>> flume at some level here as rsync is now basically a part of your flume
>> topology.  However, if you just need to back-fill old log data then this is
>> perfect!  In fact, it's what I do myself.
>>
>>
>> On Wed, Apr 16, 2014 at 8:46 PM, Jeff Lord <jl...@cloudera.com> wrote:
>>
>>> The spooling directory source runs as part of the agent.
>>> The source also needs write access to the files as it renames them upon
>>> completion of ingest. Perhaps you could use rsync to copy the files
>>> somewhere that you have write access to?
>>>
>>>
>>> On Wed, Apr 16, 2014 at 5:26 PM, Something Something <
>>> mailinglists19@gmail.com> wrote:
>>>
>>>> Thanks Jeff.  This is useful.  Can the spoolDir be on a different
>>>> machine?  We may have to setup a different process to copy files into
>>>> 'spoolDir', right?  Note:  We have 'read only' access to these files.  Any
>>>> recommendations about this?
>>>>
>>>>
>>>> On Wed, Apr 16, 2014 at 5:16 PM, Jeff Lord <jl...@cloudera.com> wrote:
>>>>
>>>>> http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source
>>>>>
>>>>>
>>>>> On Wed, Apr 16, 2014 at 5:14 PM, Something Something <
>>>>> mailinglists19@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Needless to say I am newbie to Flume, but I've got a basic flow
>>>>>> working in which I am importing a log file from my linux box to hdfs.  I am
>>>>>> using
>>>>>>
>>>>>> a1.sources.r1.command = tail -F /var/log/xyz.log
>>>>>>
>>>>>> which is working like a stream of messages.  This is good!
>>>>>>
>>>>>> Now what I want to do is copy log files from a directory on a remote
>>>>>> machine on a regular basis.  For example:
>>>>>>
>>>>>> username@machinename:/var/log/logdir/<multiple files>
>>>>>>
>>>>>> One way to do it is to simply 'scp' files from the remote directory
>>>>>> into my box on a regular basis, but what's the best way to do this in
>>>>>> Flume?  Please let me know.
>>>>>>
>>>>>> Thanks for the help.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Laurance George
>>
>
>


-- 
Laurance George

Re: Import files from a directory on remote machine

Posted by Something Something <ma...@gmail.com>.

If I am going to 'rsync' a file from remote host & copy it to hdfs via
Flume, then why use Flume?  I can rsync & then just do a 'hadoop fs -put',
no?  I must be missing something.  I guess, the only benefit of using Flume
is that I can add Interceptors if I want to.  Current requirements don't
need that.  We just want to copy data as is.

Here's the real use case:   An application is writing to xyz.log file.
Once this file gets over certain size it gets rolled over to xyz1.log & so
on.  Kinda like Log4j.  What we really want is as soon as a line gets
written to xyz.log, it should go to HDFS via Flume.

Can I do something like this?

1)  Share the log directory under Linux.
2)  Use
test1.sources.mylog.type = exec
test1.sources.mylog.command = tail -F /home/user1/shares/logs/xyz.log

I believe this will work, but is this the right way?  Thanks for your help.





On Wed, Apr 16, 2014 at 5:51 PM, Laurance George <
laurance.w.george@gmail.com> wrote:

> Agreed with Jeff.  Rsync + cron ( if it needs to be regular) is probably
> your best bet to ingest files from a remote machine that you only have read
> access to.  But then again you're sorta stepping outside of the use case of
> flume at some level here as rsync is now basically a part of your flume
> topology.  However, if you just need to back-fill old log data then this is
> perfect!  In fact, it's what I do myself.
>
>
> On Wed, Apr 16, 2014 at 8:46 PM, Jeff Lord <jl...@cloudera.com> wrote:
>
>> The spooling directory source runs as part of the agent.
>> The source also needs write access to the files as it renames them upon
>> completion of ingest. Perhaps you could use rsync to copy the files
>> somewhere that you have write access to?
>>
>>
>> On Wed, Apr 16, 2014 at 5:26 PM, Something Something <
>> mailinglists19@gmail.com> wrote:
>>
>>> Thanks Jeff.  This is useful.  Can the spoolDir be on a different
>>> machine?  We may have to setup a different process to copy files into
>>> 'spoolDir', right?  Note:  We have 'read only' access to these files.  Any
>>> recommendations about this?
>>>
>>>
>>> On Wed, Apr 16, 2014 at 5:16 PM, Jeff Lord <jl...@cloudera.com> wrote:
>>>
>>>> http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source
>>>>
>>>>
>>>> On Wed, Apr 16, 2014 at 5:14 PM, Something Something <
>>>> mailinglists19@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> Needless to say I am newbie to Flume, but I've got a basic flow
>>>>> working in which I am importing a log file from my linux box to hdfs.  I am
>>>>> using
>>>>>
>>>>> a1.sources.r1.command = tail -F /var/log/xyz.log
>>>>>
>>>>> which is working like a stream of messages.  This is good!
>>>>>
>>>>> Now what I want to do is copy log files from a directory on a remote
>>>>> machine on a regular basis.  For example:
>>>>>
>>>>> username@machinename:/var/log/logdir/<multiple files>
>>>>>
>>>>> One way to do it is to simply 'scp' files from the remote directory
>>>>> into my box on a regular basis, but what's the best way to do this in
>>>>> Flume?  Please let me know.
>>>>>
>>>>> Thanks for the help.
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Laurance George
>

Re: Import files from a directory on remote machine

Posted by Laurance George <la...@gmail.com>.

Agreed with Jeff.  Rsync + cron ( if it needs to be regular) is probably
your best bet to ingest files from a remote machine that you only have read
access to.  But then again you're sorta stepping outside of the use case of
flume at some level here as rsync is now basically a part of your flume
topology.  However, if you just need to back-fill old log data then this is
perfect!  In fact, it's what I do myself.


On Wed, Apr 16, 2014 at 8:46 PM, Jeff Lord <jl...@cloudera.com> wrote:

> The spooling directory source runs as part of the agent.
> The source also needs write access to the files as it renames them upon
> completion of ingest. Perhaps you could use rsync to copy the files
> somewhere that you have write access to?
>
>
> On Wed, Apr 16, 2014 at 5:26 PM, Something Something <
> mailinglists19@gmail.com> wrote:
>
>> Thanks Jeff.  This is useful.  Can the spoolDir be on a different
>> machine?  We may have to setup a different process to copy files into
>> 'spoolDir', right?  Note:  We have 'read only' access to these files.  Any
>> recommendations about this?
>>
>>
>> On Wed, Apr 16, 2014 at 5:16 PM, Jeff Lord <jl...@cloudera.com> wrote:
>>
>>> http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source
>>>
>>>
>>> On Wed, Apr 16, 2014 at 5:14 PM, Something Something <
>>> mailinglists19@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> Needless to say I am newbie to Flume, but I've got a basic flow working
>>>> in which I am importing a log file from my linux box to hdfs.  I am using
>>>>
>>>> a1.sources.r1.command = tail -F /var/log/xyz.log
>>>>
>>>> which is working like a stream of messages.  This is good!
>>>>
>>>> Now what I want to do is copy log files from a directory on a remote
>>>> machine on a regular basis.  For example:
>>>>
>>>> username@machinename:/var/log/logdir/<multiple files>
>>>>
>>>> One way to do it is to simply 'scp' files from the remote directory
>>>> into my box on a regular basis, but what's the best way to do this in
>>>> Flume?  Please let me know.
>>>>
>>>> Thanks for the help.
>>>>
>>>>
>>>>
>>>
>>
>


-- 
Laurance George

Re: Import files from a directory on remote machine

Posted by Jeff Lord <jl...@cloudera.com>.

The spooling directory source runs as part of the agent.
The source also needs write access to the files as it renames them upon
completion of ingest. Perhaps you could use rsync to copy the files
somewhere that you have write access to?


On Wed, Apr 16, 2014 at 5:26 PM, Something Something <
mailinglists19@gmail.com> wrote:

> Thanks Jeff.  This is useful.  Can the spoolDir be on a different
> machine?  We may have to setup a different process to copy files into
> 'spoolDir', right?  Note:  We have 'read only' access to these files.  Any
> recommendations about this?
>
>
> On Wed, Apr 16, 2014 at 5:16 PM, Jeff Lord <jl...@cloudera.com> wrote:
>
>> http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source
>>
>>
>> On Wed, Apr 16, 2014 at 5:14 PM, Something Something <
>> mailinglists19@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Needless to say I am newbie to Flume, but I've got a basic flow working
>>> in which I am importing a log file from my linux box to hdfs.  I am using
>>>
>>> a1.sources.r1.command = tail -F /var/log/xyz.log
>>>
>>> which is working like a stream of messages.  This is good!
>>>
>>> Now what I want to do is copy log files from a directory on a remote
>>> machine on a regular basis.  For example:
>>>
>>> username@machinename:/var/log/logdir/<multiple files>
>>>
>>> One way to do it is to simply 'scp' files from the remote directory into
>>> my box on a regular basis, but what's the best way to do this in Flume?
>>> Please let me know.
>>>
>>> Thanks for the help.
>>>
>>>
>>>
>>
>

Re: Import files from a directory on remote machine

Posted by Something Something <ma...@gmail.com>.

Thanks Jeff.  This is useful.  Can the spoolDir be on a different machine?
We may have to setup a different process to copy files into 'spoolDir',
right?  Note:  We have 'read only' access to these files.  Any
recommendations about this?


On Wed, Apr 16, 2014 at 5:16 PM, Jeff Lord <jl...@cloudera.com> wrote:

> http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source
>
>
> On Wed, Apr 16, 2014 at 5:14 PM, Something Something <
> mailinglists19@gmail.com> wrote:
>
>> Hello,
>>
>> Needless to say I am newbie to Flume, but I've got a basic flow working
>> in which I am importing a log file from my linux box to hdfs.  I am using
>>
>> a1.sources.r1.command = tail -F /var/log/xyz.log
>>
>> which is working like a stream of messages.  This is good!
>>
>> Now what I want to do is copy log files from a directory on a remote
>> machine on a regular basis.  For example:
>>
>> username@machinename:/var/log/logdir/<multiple files>
>>
>> One way to do it is to simply 'scp' files from the remote directory into
>> my box on a regular basis, but what's the best way to do this in Flume?
>> Please let me know.
>>
>> Thanks for the help.
>>
>>
>>
>

Re: Import files from a directory on remote machine

Posted by Jeff Lord <jl...@cloudera.com>.

http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source


On Wed, Apr 16, 2014 at 5:14 PM, Something Something <
mailinglists19@gmail.com> wrote:

> Hello,
>
> Needless to say I am newbie to Flume, but I've got a basic flow working in
> which I am importing a log file from my linux box to hdfs.  I am using
>
> a1.sources.r1.command = tail -F /var/log/xyz.log
>
> which is working like a stream of messages.  This is good!
>
> Now what I want to do is copy log files from a directory on a remote
> machine on a regular basis.  For example:
>
> username@machinename:/var/log/logdir/<multiple files>
>
> One way to do it is to simply 'scp' files from the remote directory into
> my box on a regular basis, but what's the best way to do this in Flume?
> Please let me know.
>
> Thanks for the help.
>
>
>