You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Jonathan Coveney <jc...@gmail.com> on 2011/03/24 18:09:45 UTC
A way to monitor HDFS for a file to come live, and then kick off a job?
I am not sure if this is the right listserv, forgive me if it is not. My
goal is this: monitor HDFS until a file is create, and then kick off a job.
Ideally I'd want to do this continuously, but the file would be create
hourly (with some sort of variance). I guess I could make a script that
would ping the server every 5 minutes or something, but I was wondering if
there might be a more elegant way?
Thanks
Jon
Re: A way to monitor HDFS for a file to come live, and then kick off
a job?
Posted by Eric <er...@gmail.com>.
You can also use a FUSE mount and use a cronjob to check if new files
arrived. You may want to make sure to create a pid file that is checked so
you won't run the script again before the previous run finished.
2011/3/25 Allen Wittenauer <aw...@apache.org>
>
> On Mar 24, 2011, at 10:09 AM, Jonathan Coveney wrote:
>
> > I am not sure if this is the right listserv, forgive me if it is not.
>
> A better choice would likely be hdfs-user@, since this is really
> about watching files in HDFS.
>
>
> > My
> > goal is this: monitor HDFS until a file is create, and then kick off a
> job.
> > Ideally I'd want to do this continuously, but the file would be create
> > hourly (with some sort of variance). I guess I could make a script that
> > would ping the server every 5 minutes or something, but I was wondering
> if
> > there might be a more elegant way?
>
> Two ways off the top of my head:
>
> 1) Read/watch the edits stream
>
> 2) Read/watch the HDFS audit log
>
> Given the latter is text built by log4j, that should be relatively
> simple to implement.
>
> There was a JIRA asking for this functionally to be built in recently, btw.
Re: A way to monitor HDFS for a file to come live, and then kick off
a job?
Posted by Lance Norskog <go...@gmail.com>.
Hamake does exactly this:
http://code.google.com/p/hamake/
On Fri, Mar 25, 2011 at 9:46 AM, Allen Wittenauer <aw...@apache.org> wrote:
>
> On Mar 24, 2011, at 10:09 AM, Jonathan Coveney wrote:
>
>> I am not sure if this is the right listserv, forgive me if it is not.
>
> A better choice would likely be hdfs-user@, since this is really about watching files in HDFS.
>
>
>> My
>> goal is this: monitor HDFS until a file is create, and then kick off a job.
>> Ideally I'd want to do this continuously, but the file would be create
>> hourly (with some sort of variance). I guess I could make a script that
>> would ping the server every 5 minutes or something, but I was wondering if
>> there might be a more elegant way?
>
> Two ways off the top of my head:
>
> 1) Read/watch the edits stream
>
> 2) Read/watch the HDFS audit log
>
> Given the latter is text built by log4j, that should be relatively simple to implement.
>
> There was a JIRA asking for this functionally to be built in recently, btw.
--
Lance Norskog
goksron@gmail.com
Re: A way to monitor HDFS for a file to come live, and then kick off a job?
Posted by Allen Wittenauer <aw...@apache.org>.
On Mar 24, 2011, at 10:09 AM, Jonathan Coveney wrote:
> I am not sure if this is the right listserv, forgive me if it is not.
A better choice would likely be hdfs-user@, since this is really about watching files in HDFS.
> My
> goal is this: monitor HDFS until a file is create, and then kick off a job.
> Ideally I'd want to do this continuously, but the file would be create
> hourly (with some sort of variance). I guess I could make a script that
> would ping the server every 5 minutes or something, but I was wondering if
> there might be a more elegant way?
Two ways off the top of my head:
1) Read/watch the edits stream
2) Read/watch the HDFS audit log
Given the latter is text built by log4j, that should be relatively simple to implement.
There was a JIRA asking for this functionally to be built in recently, btw.
Re: A way to monitor HDFS for a file to come live, and then kick off
a job?
Posted by Mapred Learn <ma...@gmail.com>.
Does Oozie co-ordinator work ? Last time I tried it, it had lot of problems:
i) job from start to end_timestamp were all being submitted at once not
at actual wall clock time.
ii) The links to all the jobs in a particular co-ordinator work-flow were
not working i.e. you were not able to see the progress of the jobs running.
-JJ
On Fri, Mar 25, 2011 at 7:25 AM, Bai, Gang <de...@baigang.net> wrote:
> Hi Jon,
>
> Oozie could handle this nicely. You may just specify a Oozie coordinator
> jobs. But if you don't have a Oozie server handy, cron jobs could also meet
> your needs.
>
> Regards,
> -BaiGang
>
>
> On Fri, Mar 25, 2011 at 1:09 AM, Jonathan Coveney <jc...@gmail.com>wrote:
>
>> I am not sure if this is the right listserv, forgive me if it is not. My
>> goal is this: monitor HDFS until a file is create, and then kick off a job.
>> Ideally I'd want to do this continuously, but the file would be create
>> hourly (with some sort of variance). I guess I could make a script that
>> would ping the server every 5 minutes or something, but I was wondering if
>> there might be a more elegant way?
>>
>> Thanks
>> Jon
>>
>
>
Re: A way to monitor HDFS for a file to come live, and then kick off
a job?
Posted by "Bai, Gang" <de...@baigang.net>.
Hi Jon,
Oozie could handle this nicely. You may just specify a Oozie coordinator
jobs. But if you don't have a Oozie server handy, cron jobs could also meet
your needs.
Regards,
-BaiGang
On Fri, Mar 25, 2011 at 1:09 AM, Jonathan Coveney <jc...@gmail.com>wrote:
> I am not sure if this is the right listserv, forgive me if it is not. My
> goal is this: monitor HDFS until a file is create, and then kick off a job.
> Ideally I'd want to do this continuously, but the file would be create
> hourly (with some sort of variance). I guess I could make a script that
> would ping the server every 5 minutes or something, but I was wondering if
> there might be a more elegant way?
>
> Thanks
> Jon
>
Re: A way to monitor HDFS for a file to come live, and then kick off a job?
Posted by Allen Wittenauer <aw...@apache.org>.
On Mar 24, 2011, at 10:09 AM, Jonathan Coveney wrote:
> I am not sure if this is the right listserv, forgive me if it is not.
A better choice would likely be hdfs-user@, since this is really about watching files in HDFS.
> My
> goal is this: monitor HDFS until a file is create, and then kick off a job.
> Ideally I'd want to do this continuously, but the file would be create
> hourly (with some sort of variance). I guess I could make a script that
> would ping the server every 5 minutes or something, but I was wondering if
> there might be a more elegant way?
Two ways off the top of my head:
1) Read/watch the edits stream
2) Read/watch the HDFS audit log
Given the latter is text built by log4j, that should be relatively simple to implement.
There was a JIRA asking for this functionally to be built in recently, btw.
Re: A way to monitor HDFS for a file to come live, and then kick
off a job?
Posted by David Rosenstrauch <da...@darose.net>.
On 03/24/2011 01:09 PM, Jonathan Coveney wrote:
> I am not sure if this is the right listserv, forgive me if it is not. My
> goal is this: monitor HDFS until a file is create, and then kick off a job.
> Ideally I'd want to do this continuously, but the file would be create
> hourly (with some sort of variance). I guess I could make a script that
> would ping the server every 5 minutes or something, but I was wondering if
> there might be a more elegant way?
>
> Thanks
> Jon
I suppose you could do this using HDFS, but this sounds to me like
Zookeeper is much better suited to this type of application. You could
just add watcher onto a particular zookeeper node and you'd get notified
about updates to it and its children.
HTH,
DR