You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Jonathan Coveney <jc...@gmail.com> on 2011/03/24 18:09:45 UTC

A way to monitor HDFS for a file to come live, and then kick off a job?

I am not sure if this is the right listserv, forgive me if it is not. My
goal is this: monitor HDFS until a file is create, and then kick off a job.
Ideally I'd want to do this continuously, but the file would be create
hourly (with some sort of variance). I guess I could make a script that
would ping the server every 5 minutes or something, but I was wondering if
there might be a more elegant way?

Thanks
Jon

Re: A way to monitor HDFS for a file to come live, and then kick off a job?

Posted by Eric <er...@gmail.com>.
You can also use a FUSE mount and use a cronjob to check if new files
arrived. You may want to make sure to create a pid file that is checked so
you won't run the script again before the previous run finished.

2011/3/25 Allen Wittenauer <aw...@apache.org>

>
> On Mar 24, 2011, at 10:09 AM, Jonathan Coveney wrote:
>
> > I am not sure if this is the right listserv, forgive me if it is not.
>
>         A better choice would likely be hdfs-user@, since this is really
> about watching files in HDFS.
>
>
> > My
> > goal is this: monitor HDFS until a file is create, and then kick off a
> job.
> > Ideally I'd want to do this continuously, but the file would be create
> > hourly (with some sort of variance). I guess I could make a script that
> > would ping the server every 5 minutes or something, but I was wondering
> if
> > there might be a more elegant way?
>
>         Two ways off the top of my head:
>
>        1) Read/watch the edits stream
>
>        2) Read/watch the HDFS audit log
>
>        Given the latter is text built by log4j, that should be relatively
> simple to implement.
>
> There was a JIRA asking for this functionally to be built in recently, btw.

Re: A way to monitor HDFS for a file to come live, and then kick off a job?

Posted by Lance Norskog <go...@gmail.com>.
Hamake does exactly this:

http://code.google.com/p/hamake/

On Fri, Mar 25, 2011 at 9:46 AM, Allen Wittenauer <aw...@apache.org> wrote:
>
> On Mar 24, 2011, at 10:09 AM, Jonathan Coveney wrote:
>
>> I am not sure if this is the right listserv, forgive me if it is not.
>
>        A better choice would likely be hdfs-user@, since this is really about watching files in HDFS.
>
>
>> My
>> goal is this: monitor HDFS until a file is create, and then kick off a job.
>> Ideally I'd want to do this continuously, but the file would be create
>> hourly (with some sort of variance). I guess I could make a script that
>> would ping the server every 5 minutes or something, but I was wondering if
>> there might be a more elegant way?
>
>        Two ways off the top of my head:
>
>        1) Read/watch the edits stream
>
>        2) Read/watch the HDFS audit log
>
>        Given the latter is text built by log4j, that should be relatively simple to implement.
>
> There was a JIRA asking for this functionally to be built in recently, btw.



-- 
Lance Norskog
goksron@gmail.com

Re: A way to monitor HDFS for a file to come live, and then kick off a job?

Posted by Allen Wittenauer <aw...@apache.org>.
On Mar 24, 2011, at 10:09 AM, Jonathan Coveney wrote:

> I am not sure if this is the right listserv, forgive me if it is not.

	A better choice would likely be hdfs-user@, since this is really about watching files in HDFS.


> My
> goal is this: monitor HDFS until a file is create, and then kick off a job.
> Ideally I'd want to do this continuously, but the file would be create
> hourly (with some sort of variance). I guess I could make a script that
> would ping the server every 5 minutes or something, but I was wondering if
> there might be a more elegant way?

	Two ways off the top of my head:

	1) Read/watch the edits stream

	2) Read/watch the HDFS audit log

	Given the latter is text built by log4j, that should be relatively simple to implement.

There was a JIRA asking for this functionally to be built in recently, btw.

Re: A way to monitor HDFS for a file to come live, and then kick off a job?

Posted by Mapred Learn <ma...@gmail.com>.
Does Oozie co-ordinator work ? Last time I tried it, it had lot of problems:

i) job from start to end_timestamp were all being submitted at once not
at actual wall clock time.

ii) The links to all the jobs in a particular co-ordinator work-flow were
not working i.e. you were not able to see the progress of the jobs running.

-JJ

On Fri, Mar 25, 2011 at 7:25 AM, Bai, Gang <de...@baigang.net> wrote:

> Hi Jon,
>
> Oozie could handle this nicely. You may just specify a Oozie coordinator
> jobs. But if you don't have a Oozie server handy, cron jobs could also meet
> your needs.
>
> Regards,
> -BaiGang
>
>
> On Fri, Mar 25, 2011 at 1:09 AM, Jonathan Coveney <jc...@gmail.com>wrote:
>
>> I am not sure if this is the right listserv, forgive me if it is not. My
>> goal is this: monitor HDFS until a file is create, and then kick off a job.
>> Ideally I'd want to do this continuously, but the file would be create
>> hourly (with some sort of variance). I guess I could make a script that
>> would ping the server every 5 minutes or something, but I was wondering if
>> there might be a more elegant way?
>>
>> Thanks
>> Jon
>>
>
>

Re: A way to monitor HDFS for a file to come live, and then kick off a job?

Posted by "Bai, Gang" <de...@baigang.net>.
Hi Jon,

Oozie could handle this nicely. You may just specify a Oozie coordinator
jobs. But if you don't have a Oozie server handy, cron jobs could also meet
your needs.

Regards,
-BaiGang

On Fri, Mar 25, 2011 at 1:09 AM, Jonathan Coveney <jc...@gmail.com>wrote:

> I am not sure if this is the right listserv, forgive me if it is not. My
> goal is this: monitor HDFS until a file is create, and then kick off a job.
> Ideally I'd want to do this continuously, but the file would be create
> hourly (with some sort of variance). I guess I could make a script that
> would ping the server every 5 minutes or something, but I was wondering if
> there might be a more elegant way?
>
> Thanks
> Jon
>

Re: A way to monitor HDFS for a file to come live, and then kick off a job?

Posted by Allen Wittenauer <aw...@apache.org>.
On Mar 24, 2011, at 10:09 AM, Jonathan Coveney wrote:

> I am not sure if this is the right listserv, forgive me if it is not.

	A better choice would likely be hdfs-user@, since this is really about watching files in HDFS.


> My
> goal is this: monitor HDFS until a file is create, and then kick off a job.
> Ideally I'd want to do this continuously, but the file would be create
> hourly (with some sort of variance). I guess I could make a script that
> would ping the server every 5 minutes or something, but I was wondering if
> there might be a more elegant way?

	Two ways off the top of my head:

	1) Read/watch the edits stream

	2) Read/watch the HDFS audit log

	Given the latter is text built by log4j, that should be relatively simple to implement.

There was a JIRA asking for this functionally to be built in recently, btw.

Re: A way to monitor HDFS for a file to come live, and then kick off a job?

Posted by David Rosenstrauch <da...@darose.net>.
On 03/24/2011 01:09 PM, Jonathan Coveney wrote:
> I am not sure if this is the right listserv, forgive me if it is not. My
> goal is this: monitor HDFS until a file is create, and then kick off a job.
> Ideally I'd want to do this continuously, but the file would be create
> hourly (with some sort of variance). I guess I could make a script that
> would ping the server every 5 minutes or something, but I was wondering if
> there might be a more elegant way?
>
> Thanks
> Jon

I suppose you could do this using HDFS, but this sounds to me like 
Zookeeper is much better suited to this type of application.  You could 
just add watcher onto a particular zookeeper node and you'd get notified 
about updates to it and its children.

HTH,

DR