You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Sadananda Hegde <sa...@gmail.com> on 2012/10/16 06:37:07 UTC

picking up new files in Flume NG

Hello,

I have a scenario where in the client application is continuously pushing
xml messages. Actually the application is writing these messages to files
(new files; same directory). So we will be keep getting new files
throughout the day. I am trying to configure Flume agents on these
applcation servers (4 of them) to pick up the new data and transfer them to
HDFS on a hadoop cluster. How should I configure my source to pick up new
files (and exclude the files that have been processed already)? I don't
think Exec source with tail  -F will work in this scenario because data is
not getting added to existing files; rather new files get created.

Thank you very much for your time and support.

Sadu

Re: picking up new files in Flume NG

Posted by Brock Noland <br...@cloudera.com>.
We are planning a flume-1.3.0 release very soon, I hope spoodir will be in
that release, but I cannot make guarantees on that since I am not working
on the JIRA.

Brock

On Fri, Oct 19, 2012 at 9:04 AM, Sadananda Hegde <sa...@gmail.com>wrote:

> Hey Roshan,
>
> What version would be certified? Any idea on timing of the 'spooldir'
> availability? Is there any work around or alternate solution that can be
> used with Flume NG 1.2.0  ? We may have to initially build our solution in
> Flume NG 1.2.0 and   then upgrade to use spooldir later.
>
> Thanks,
> Sadu
>
>
> On Thu, Oct 18, 2012 at 1:13 PM, Roshan Naik <ro...@hortonworks.com>wrote:
>
>> Sadu,
>>     That combination wont be a tested/certified even if it happens to
>> work.
>> -roshan
>>
>>
>> On Wed, Oct 17, 2012 at 4:51 PM, Sadananda Hegde <sa...@gmail.com>wrote:
>>
>>> That's awesome! Patrick,. Thank you so much. That would tremendously
>>> help us.
>>>
>>> We are currently using Flume NG 1.2.0.  Will we be able to use spooldir
>>> on that version? or do we have to upgrade to latest version?
>>>
>>> Thanks,
>>> Sadu
>>>
>>>
>


-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: picking up new files in Flume NG

Posted by Sadananda Hegde <sa...@gmail.com>.
Hey Roshan,

What version would be certified? Any idea on timing of the 'spooldir'
availability? Is there any work around or alternate solution that can be
used with Flume NG 1.2.0  ? We may have to initially build our solution in
Flume NG 1.2.0 and   then upgrade to use spooldir later.

Thanks,
Sadu

On Thu, Oct 18, 2012 at 1:13 PM, Roshan Naik <ro...@hortonworks.com> wrote:

> Sadu,
>     That combination wont be a tested/certified even if it happens to work.
> -roshan
>
>
> On Wed, Oct 17, 2012 at 4:51 PM, Sadananda Hegde <sa...@gmail.com>wrote:
>
>> That's awesome! Patrick,. Thank you so much. That would tremendously help
>> us.
>>
>> We are currently using Flume NG 1.2.0.  Will we be able to use spooldir
>> on that version? or do we have to upgrade to latest version?
>>
>> Thanks,
>> Sadu
>>
>>

Re: picking up new files in Flume NG

Posted by Roshan Naik <ro...@hortonworks.com>.
Sadu,
    That combination wont be a tested/certified even if it happens to work.
-roshan

On Wed, Oct 17, 2012 at 4:51 PM, Sadananda Hegde <sa...@gmail.com>wrote:

> That's awesome! Patrick,. Thank you so much. That would tremendously help
> us.
>
> We are currently using Flume NG 1.2.0.  Will we be able to use spooldir on
> that version? or do we have to upgrade to latest version?
>
> Thanks,
> Sadu
>
>

Re: picking up new files in Flume NG

Posted by Sadananda Hegde <sa...@gmail.com>.
That's awesome! Patrick,. Thank you so much. That would tremendously help
us.

We are currently using Flume NG 1.2.0.  Will we be able to use spooldir on
that version? or do we have to upgrade to latest version?

Thanks,
Sadu

On Tue, Oct 16, 2012 at 11:37 PM, Patrick Wendell <pw...@gmail.com>wrote:

> Hey Sadu, your use case is exactly what I'm writing this for. I'm
> hoping this patch will get committed within a few days, we're on a
> last rev of reviews.
>
> - Patrick
>
> On Tue, Oct 16, 2012 at 10:47 AM, Brock Noland <br...@cloudera.com> wrote:
> > Correct, it's only available in that patch, from the RB it looks like
> > it's not too far off from being committed.
> >
> > Brock
> >
> > On Tue, Oct 16, 2012 at 12:00 PM, Sadananda Hegde <sa...@gmail.com>
> wrote:
> >> Yes, It is very similar.
> >>
> >> The spool directory will keep getting new files. We need to scan
> through the
> >> directory, send the data in the existing files to HDFS , cleanup the
> files
> >> (delete / move/ rename, etc) and scan for new files again. The Spooldir
> >> source is not available yet, right?
> >>
> >> Thanks,
> >> Sadu
> >>
> >>
> >> On Tue, Oct 16, 2012 at 10:11 AM, Brock Noland <br...@cloudera.com>
> wrote:
> >>>
> >>> Sounds like https://issues.apache.org/jira/browse/FlUME-1425  ?
> >>>
> >>> Brock
> >>>
> >>> On Mon, Oct 15, 2012 at 11:37 PM, Sadananda Hegde <saduhegde@gmail.com
> >
> >>> wrote:
> >>> > Hello,
> >>> >
> >>> > I have a scenario where in the client application is continuously
> >>> > pushing
> >>> > xml messages. Actually the application is writing these messages to
> >>> > files
> >>> > (new files; same directory). So we will be keep getting new files
> >>> > throughout
> >>> > the day. I am trying to configure Flume agents on these applcation
> >>> > servers
> >>> > (4 of them) to pick up the new data and transfer them to HDFS on a
> >>> > hadoop
> >>> > cluster. How should I configure my source to pick up new files (and
> >>> > exclude
> >>> > the files that have been processed already)? I don't think Exec
> source
> >>> > with
> >>> > tail  -F will work in this scenario because data is not getting
> added to
> >>> > existing files; rather new files get created.
> >>> >
> >>> > Thank you very much for your time and support.
> >>> >
> >>> > Sadu
> >>>
> >>>
> >>>
> >>> --
> >>> Apache MRUnit - Unit testing MapReduce -
> >>> http://incubator.apache.org/mrunit/
> >>
> >>
> >
> >
> >
> > --
> > Apache MRUnit - Unit testing MapReduce -
> http://incubator.apache.org/mrunit/
>

Re: picking up new files in Flume NG

Posted by Patrick Wendell <pw...@gmail.com>.
Hey Sadu, your use case is exactly what I'm writing this for. I'm
hoping this patch will get committed within a few days, we're on a
last rev of reviews.

- Patrick

On Tue, Oct 16, 2012 at 10:47 AM, Brock Noland <br...@cloudera.com> wrote:
> Correct, it's only available in that patch, from the RB it looks like
> it's not too far off from being committed.
>
> Brock
>
> On Tue, Oct 16, 2012 at 12:00 PM, Sadananda Hegde <sa...@gmail.com> wrote:
>> Yes, It is very similar.
>>
>> The spool directory will keep getting new files. We need to scan through the
>> directory, send the data in the existing files to HDFS , cleanup the files
>> (delete / move/ rename, etc) and scan for new files again. The Spooldir
>> source is not available yet, right?
>>
>> Thanks,
>> Sadu
>>
>>
>> On Tue, Oct 16, 2012 at 10:11 AM, Brock Noland <br...@cloudera.com> wrote:
>>>
>>> Sounds like https://issues.apache.org/jira/browse/FlUME-1425  ?
>>>
>>> Brock
>>>
>>> On Mon, Oct 15, 2012 at 11:37 PM, Sadananda Hegde <sa...@gmail.com>
>>> wrote:
>>> > Hello,
>>> >
>>> > I have a scenario where in the client application is continuously
>>> > pushing
>>> > xml messages. Actually the application is writing these messages to
>>> > files
>>> > (new files; same directory). So we will be keep getting new files
>>> > throughout
>>> > the day. I am trying to configure Flume agents on these applcation
>>> > servers
>>> > (4 of them) to pick up the new data and transfer them to HDFS on a
>>> > hadoop
>>> > cluster. How should I configure my source to pick up new files (and
>>> > exclude
>>> > the files that have been processed already)? I don't think Exec source
>>> > with
>>> > tail  -F will work in this scenario because data is not getting added to
>>> > existing files; rather new files get created.
>>> >
>>> > Thank you very much for your time and support.
>>> >
>>> > Sadu
>>>
>>>
>>>
>>> --
>>> Apache MRUnit - Unit testing MapReduce -
>>> http://incubator.apache.org/mrunit/
>>
>>
>
>
>
> --
> Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: picking up new files in Flume NG

Posted by Brock Noland <br...@cloudera.com>.
Correct, it's only available in that patch, from the RB it looks like
it's not too far off from being committed.

Brock

On Tue, Oct 16, 2012 at 12:00 PM, Sadananda Hegde <sa...@gmail.com> wrote:
> Yes, It is very similar.
>
> The spool directory will keep getting new files. We need to scan through the
> directory, send the data in the existing files to HDFS , cleanup the files
> (delete / move/ rename, etc) and scan for new files again. The Spooldir
> source is not available yet, right?
>
> Thanks,
> Sadu
>
>
> On Tue, Oct 16, 2012 at 10:11 AM, Brock Noland <br...@cloudera.com> wrote:
>>
>> Sounds like https://issues.apache.org/jira/browse/FlUME-1425  ?
>>
>> Brock
>>
>> On Mon, Oct 15, 2012 at 11:37 PM, Sadananda Hegde <sa...@gmail.com>
>> wrote:
>> > Hello,
>> >
>> > I have a scenario where in the client application is continuously
>> > pushing
>> > xml messages. Actually the application is writing these messages to
>> > files
>> > (new files; same directory). So we will be keep getting new files
>> > throughout
>> > the day. I am trying to configure Flume agents on these applcation
>> > servers
>> > (4 of them) to pick up the new data and transfer them to HDFS on a
>> > hadoop
>> > cluster. How should I configure my source to pick up new files (and
>> > exclude
>> > the files that have been processed already)? I don't think Exec source
>> > with
>> > tail  -F will work in this scenario because data is not getting added to
>> > existing files; rather new files get created.
>> >
>> > Thank you very much for your time and support.
>> >
>> > Sadu
>>
>>
>>
>> --
>> Apache MRUnit - Unit testing MapReduce -
>> http://incubator.apache.org/mrunit/
>
>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: picking up new files in Flume NG

Posted by Sadananda Hegde <sa...@gmail.com>.
Yes, It is very similar.

The spool directory will keep getting new files. We need to scan through
the directory, send the data in the existing files to HDFS , cleanup the
files (delete / move/ rename, etc) and scan for new files again. The
Spooldir source is not available yet, right?

Thanks,
Sadu

On Tue, Oct 16, 2012 at 10:11 AM, Brock Noland <br...@cloudera.com> wrote:

> Sounds like https://issues.apache.org/jira/browse/FlUME-1425  ?
>
> Brock
>
> On Mon, Oct 15, 2012 at 11:37 PM, Sadananda Hegde <sa...@gmail.com>
> wrote:
> > Hello,
> >
> > I have a scenario where in the client application is continuously pushing
> > xml messages. Actually the application is writing these messages to files
> > (new files; same directory). So we will be keep getting new files
> throughout
> > the day. I am trying to configure Flume agents on these applcation
> servers
> > (4 of them) to pick up the new data and transfer them to HDFS on a hadoop
> > cluster. How should I configure my source to pick up new files (and
> exclude
> > the files that have been processed already)? I don't think Exec source
> with
> > tail  -F will work in this scenario because data is not getting added to
> > existing files; rather new files get created.
> >
> > Thank you very much for your time and support.
> >
> > Sadu
>
>
>
> --
> Apache MRUnit - Unit testing MapReduce -
> http://incubator.apache.org/mrunit/
>

Re: picking up new files in Flume NG

Posted by Brock Noland <br...@cloudera.com>.
Sounds like https://issues.apache.org/jira/browse/FlUME-1425  ?

Brock

On Mon, Oct 15, 2012 at 11:37 PM, Sadananda Hegde <sa...@gmail.com> wrote:
> Hello,
>
> I have a scenario where in the client application is continuously pushing
> xml messages. Actually the application is writing these messages to files
> (new files; same directory). So we will be keep getting new files throughout
> the day. I am trying to configure Flume agents on these applcation servers
> (4 of them) to pick up the new data and transfer them to HDFS on a hadoop
> cluster. How should I configure my source to pick up new files (and exclude
> the files that have been processed already)? I don't think Exec source with
> tail  -F will work in this scenario because data is not getting added to
> existing files; rather new files get created.
>
> Thank you very much for your time and support.
>
> Sadu



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/