You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by John Farrelly <Jo...@adaptivemobile.com> on 2013/03/27 11:24:05 UTC
Don't process already processed files?
Hi there,
In our system, we have multiple pig scripts that run against a particular HDFS directory. The pig scripts can run at different times, and are scheduled to run regularly. Is there a way to point a pig script at the same directory for multiple executions, but make sure that it only processed new files that it hasn't seen before? I was thinking of using a custom PathFilter for my loader, but I thought I would ask to see if there is already a way to do this, rather than me reinventing the wheel (!).
Thanks,
John.
</pre>****************************************************************************************<br>This email and any files transmitted with are confidential and intended solely for the<br>use of the individual or entity to whom they are addressed. If you have received this<br>email in error then please delete it and notify the sender. Do not make a copy or forward<br>it to anyone. This footnote also confirms that this email message has been swept for the<br>presence of computer viruses.<br><br>Adaptive Mobile Security Ltd, Ferry House, 48 Lower Mount Street, Dublin 2, Ireland<br>Directors: B. Collins, G. Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers (UK).<br>Registered in Ireland, Company No. 370343, VAT Reg.No.IE6390343O<br>****************************************************************************************</pre>
RE: Don't process already processed files?
Posted by John Farrelly <Jo...@adaptivemobile.com>.
Thanks Bill.
Option 2 is what I've started coding, as I have multiple pig scripts that need to process the same files, and the pig scripts run at different times.
Regards,
John.
-----Original Message-----
From: Bill Graham [mailto:billgraham@gmail.com]
Sent: 27 March 2013 15:15
To: user@pig.apache.org
Subject: Re: Don't process already processed files?
Yes, the state of what files have been processed needs to be tracked outside of the script somehow. Two other approaches come to mind as well:
- Use the HDFS file systems as a work queue. Move files from /incoming to /processed for example after processing them.
- Put files in a time-partitioned directory and run your jobs for explicit time intervals. This approach is more common.
On Wed, Mar 27, 2013 at 7:30 AM, John Farrelly < JohnFarrelly@adaptivemobile.com> wrote:
> Thanks Mike. That's what I was thinking, but I was wondering if
> (hoping!) there was something already to do it :)
>
> Thanks,
> John.
>
> -----Original Message-----
> From: Mike Sukmanowsky [mailto:mike@parsely.com]
> Sent: 27 March 2013 14:05
> To: user@pig.apache.org
> Subject: Re: Don't process already processed files?
>
> It's probably less work to have some kind of a script control Pig
> execution and keep track of what's been processed and pass in an input
> path to your Pig script dynamically. For example, you could create a
> control.py/rb/shfile which would somehow keep track of what's been
> processed (maybe a simple file) and then figure out the input path to
> pass to pig during execution via a parameter: pig --param
> inputpath="/some/dynamic/input/path/for/pig".
>
> You'd then setup your cron job to run your control script instead of
> the Pig script directly.
>
>
> On Wed, Mar 27, 2013 at 6:24 AM, John Farrelly <
> JohnFarrelly@adaptivemobile.com> wrote:
>
> > Hi there,
> >
> > In our system, we have multiple pig scripts that run against a
> > particular HDFS directory. The pig scripts can run at different
> > times, and are scheduled to run regularly. Is there a way to point
> > a pig script at the same directory for multiple executions, but make
> > sure that it only processed new files that it hasn't seen before? I
> > was thinking of using a custom PathFilter for my loader, but I
> > thought I would ask to see if there is already a way to do this,
> > rather than me
> reinventing the wheel (!).
> >
> > Thanks,
> > John.
> > </pre>**************************************************************
> > ** ************************<br>This email and any files transmitted
> > with are confidential and intended solely for the<br>use of the
> > individual or entity to whom they are addressed. If you have
> > received this<br>email in error then please delete it and notify the
> > sender. Do not make a copy or forward<br>it to anyone. This
> > footnote also confirms that this email message has been swept for
> > the<br>presence of computer viruses.<br><br>Adaptive Mobile Security
> > Ltd, Ferry House, 48 Lower Mount Street, Dublin 2,
> > Ireland<br>Directors: B. Collins, G.
> > Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers
> > (UK).<br>Registered in Ireland, Company No. 370343, VAT
> > Reg.No.IE6390343O<br>***********************************************
> > ** ***************************************</pre>
> >
>
>
>
> --
> Mike Sukmanowsky
>
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY 10018
> p: +1 (416) 953-4248
> e: mike@parsely.com
> </pre>****************************************************************
> ************************<br>This email and any files transmitted with
> are confidential and intended solely for the<br>use of the individual
> or entity to whom they are addressed. If you have received
> this<br>email in error then please delete it and notify the sender. Do
> not make a copy or forward<br>it to anyone. This footnote also
> confirms that this email message has been swept for the<br>presence of
> computer viruses.<br><br>Adaptive Mobile Security Ltd, Ferry House, 48
> Lower Mount Street, Dublin 2, Ireland<br>Directors: B. Collins, G.
> Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers
> (UK).<br>Registered in Ireland, Company No. 370343, VAT
> Reg.No.IE6390343O<br>*************************************************
> ***************************************</pre>
>
>
--
*Note that I'm no longer using my Yahoo! email address. Please email me at billgraham@gmail.com going forward.*
</pre>****************************************************************************************<br>This email and any files transmitted with are confidential and intended solely for the<br>use of the individual or entity to whom they are addressed. If you have received this<br>email in error then please delete it and notify the sender. Do not make a copy or forward<br>it to anyone. This footnote also confirms that this email message has been swept for the<br>presence of computer viruses.<br><br>Adaptive Mobile Security Ltd, Ferry House, 48 Lower Mount Street, Dublin 2, Ireland<br>Directors: B. Collins, G. Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers (UK).<br>Registered in Ireland, Company No. 370343, VAT Reg.No.IE6390343O<br>****************************************************************************************</pre>
Re: Don't process already processed files?
Posted by Bill Graham <bi...@gmail.com>.
Yes, the state of what files have been processed needs to be tracked
outside of the script somehow. Two other approaches come to mind as well:
- Use the HDFS file systems as a work queue. Move files from /incoming to
/processed for example after processing them.
- Put files in a time-partitioned directory and run your jobs for explicit
time intervals. This approach is more common.
On Wed, Mar 27, 2013 at 7:30 AM, John Farrelly <
JohnFarrelly@adaptivemobile.com> wrote:
> Thanks Mike. That's what I was thinking, but I was wondering if (hoping!)
> there was something already to do it :)
>
> Thanks,
> John.
>
> -----Original Message-----
> From: Mike Sukmanowsky [mailto:mike@parsely.com]
> Sent: 27 March 2013 14:05
> To: user@pig.apache.org
> Subject: Re: Don't process already processed files?
>
> It's probably less work to have some kind of a script control Pig
> execution and keep track of what's been processed and pass in an input path
> to your Pig script dynamically. For example, you could create a
> control.py/rb/shfile which would somehow keep track of what's been
> processed (maybe a simple file) and then figure out the input path to pass
> to pig during execution via a parameter: pig --param
> inputpath="/some/dynamic/input/path/for/pig".
>
> You'd then setup your cron job to run your control script instead of the
> Pig script directly.
>
>
> On Wed, Mar 27, 2013 at 6:24 AM, John Farrelly <
> JohnFarrelly@adaptivemobile.com> wrote:
>
> > Hi there,
> >
> > In our system, we have multiple pig scripts that run against a
> > particular HDFS directory. The pig scripts can run at different
> > times, and are scheduled to run regularly. Is there a way to point a
> > pig script at the same directory for multiple executions, but make
> > sure that it only processed new files that it hasn't seen before? I
> > was thinking of using a custom PathFilter for my loader, but I thought
> > I would ask to see if there is already a way to do this, rather than me
> reinventing the wheel (!).
> >
> > Thanks,
> > John.
> > </pre>****************************************************************
> > ************************<br>This email and any files transmitted with
> > are confidential and intended solely for the<br>use of the individual
> > or entity to whom they are addressed. If you have received
> > this<br>email in error then please delete it and notify the sender. Do
> > not make a copy or forward<br>it to anyone. This footnote also
> > confirms that this email message has been swept for the<br>presence of
> > computer viruses.<br><br>Adaptive Mobile Security Ltd, Ferry House, 48
> > Lower Mount Street, Dublin 2, Ireland<br>Directors: B. Collins, G.
> > Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers
> > (UK).<br>Registered in Ireland, Company No. 370343, VAT
> > Reg.No.IE6390343O<br>*************************************************
> > ***************************************</pre>
> >
>
>
>
> --
> Mike Sukmanowsky
>
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY 10018
> p: +1 (416) 953-4248
> e: mike@parsely.com
> </pre>****************************************************************************************<br>This
> email and any files transmitted with are confidential and intended solely
> for the<br>use of the individual or entity to whom they are addressed. If
> you have received this<br>email in error then please delete it and notify
> the sender. Do not make a copy or forward<br>it to anyone. This footnote
> also confirms that this email message has been swept for the<br>presence of
> computer viruses.<br><br>Adaptive Mobile Security Ltd, Ferry House, 48
> Lower Mount Street, Dublin 2, Ireland<br>Directors: B. Collins, G.
> Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers
> (UK).<br>Registered in Ireland, Company No. 370343, VAT
> Reg.No.IE6390343O<br>****************************************************************************************</pre>
>
>
--
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*
RE: Don't process already processed files?
Posted by John Farrelly <Jo...@adaptivemobile.com>.
Thanks Mike. That's what I was thinking, but I was wondering if (hoping!) there was something already to do it :)
Thanks,
John.
-----Original Message-----
From: Mike Sukmanowsky [mailto:mike@parsely.com]
Sent: 27 March 2013 14:05
To: user@pig.apache.org
Subject: Re: Don't process already processed files?
It's probably less work to have some kind of a script control Pig execution and keep track of what's been processed and pass in an input path to your Pig script dynamically. For example, you could create a control.py/rb/shfile which would somehow keep track of what's been processed (maybe a simple file) and then figure out the input path to pass to pig during execution via a parameter: pig --param inputpath="/some/dynamic/input/path/for/pig".
You'd then setup your cron job to run your control script instead of the Pig script directly.
On Wed, Mar 27, 2013 at 6:24 AM, John Farrelly < JohnFarrelly@adaptivemobile.com> wrote:
> Hi there,
>
> In our system, we have multiple pig scripts that run against a
> particular HDFS directory. The pig scripts can run at different
> times, and are scheduled to run regularly. Is there a way to point a
> pig script at the same directory for multiple executions, but make
> sure that it only processed new files that it hasn't seen before? I
> was thinking of using a custom PathFilter for my loader, but I thought
> I would ask to see if there is already a way to do this, rather than me reinventing the wheel (!).
>
> Thanks,
> John.
> </pre>****************************************************************
> ************************<br>This email and any files transmitted with
> are confidential and intended solely for the<br>use of the individual
> or entity to whom they are addressed. If you have received
> this<br>email in error then please delete it and notify the sender. Do
> not make a copy or forward<br>it to anyone. This footnote also
> confirms that this email message has been swept for the<br>presence of
> computer viruses.<br><br>Adaptive Mobile Security Ltd, Ferry House, 48
> Lower Mount Street, Dublin 2, Ireland<br>Directors: B. Collins, G.
> Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers
> (UK).<br>Registered in Ireland, Company No. 370343, VAT
> Reg.No.IE6390343O<br>*************************************************
> ***************************************</pre>
>
--
Mike Sukmanowsky
Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY 10018
p: +1 (416) 953-4248
e: mike@parsely.com
</pre>****************************************************************************************<br>This email and any files transmitted with are confidential and intended solely for the<br>use of the individual or entity to whom they are addressed. If you have received this<br>email in error then please delete it and notify the sender. Do not make a copy or forward<br>it to anyone. This footnote also confirms that this email message has been swept for the<br>presence of computer viruses.<br><br>Adaptive Mobile Security Ltd, Ferry House, 48 Lower Mount Street, Dublin 2, Ireland<br>Directors: B. Collins, G. Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers (UK).<br>Registered in Ireland, Company No. 370343, VAT Reg.No.IE6390343O<br>****************************************************************************************</pre>
Re: Don't process already processed files?
Posted by Mike Sukmanowsky <mi...@parsely.com>.
It's probably less work to have some kind of a script control Pig execution
and keep track of what's been processed and pass in an input path to your
Pig script dynamically. For example, you could create a
control.py/rb/shfile which would somehow keep track of what's been
processed (maybe a
simple file) and then figure out the input path to pass to pig during
execution via a parameter: pig --param
inputpath="/some/dynamic/input/path/for/pig".
You'd then setup your cron job to run your control script instead of the
Pig script directly.
On Wed, Mar 27, 2013 at 6:24 AM, John Farrelly <
JohnFarrelly@adaptivemobile.com> wrote:
> Hi there,
>
> In our system, we have multiple pig scripts that run against a particular
> HDFS directory. The pig scripts can run at different times, and are
> scheduled to run regularly. Is there a way to point a pig script at the
> same directory for multiple executions, but make sure that it only
> processed new files that it hasn't seen before? I was thinking of using a
> custom PathFilter for my loader, but I thought I would ask to see if there
> is already a way to do this, rather than me reinventing the wheel (!).
>
> Thanks,
> John.
> </pre>****************************************************************************************<br>This
> email and any files transmitted with are confidential and intended solely
> for the<br>use of the individual or entity to whom they are addressed. If
> you have received this<br>email in error then please delete it and notify
> the sender. Do not make a copy or forward<br>it to anyone. This footnote
> also confirms that this email message has been swept for the<br>presence of
> computer viruses.<br><br>Adaptive Mobile Security Ltd, Ferry House, 48
> Lower Mount Street, Dublin 2, Ireland<br>Directors: B. Collins, G.
> Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers
> (UK).<br>Registered in Ireland, Company No. 370343, VAT
> Reg.No.IE6390343O<br>****************************************************************************************</pre>
>
--
Mike Sukmanowsky
Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY 10018
p: +1 (416) 953-4248
e: mike@parsely.com