You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Chris Riccomini <cr...@linkedin.com> on 2009/07/08 19:52:54 UTC

Clear temp files

Hi All,

Is there an easy way to clear temp files that Pig creates when a script
runs?

I tried adding RMF /tmp/tmp* and /tmp/temp*, but it doesn¹t seem to work
(although it doesn¹t fail either ....).

Thanks!
Chris

Re: Clear temp files

Posted by Yiping Han <yh...@yahoo-inc.com>.
Even simple solution could work. Everytime Pig executes, logging the
temporary directory to a local log file. Then we could just write a simple
script to clean up the tmp file corpse.


--Yiping


On 7/8/09 11:52 AM, "Dmitriy Ryaboy" <dv...@cloudera.com> wrote:

> Just thinking out loud: if all running Pig queries registered themselves
> with some service (ZooKeeper?), it would become possible to write a "vacuum"
> utility that can occasionally scan the namespace and remove temp files that
> do not belong to any currently registered job. Then we don't have to rely on
> the hooks getting called and completing properly.
> 
> On Wed, Jul 8, 2009 at 11:43 AM, Olga Natkovich <ol...@yahoo-inc.com> wrote:
> 
>> Pig will cleanup the files if the job fails but not if it is killed. We
>> tried to address killed jobs by setting a shutdown hook; however, hadoop
>> also uses shutdown hook to close HDFs and the order of the hooks is not
>> guaranteed. So periodically the script would fail.
>> 
>> Olga
>> 
>> -----Original Message-----
>> From: Chris Riccomini [mailto:criccomini@linkedin.com]
>> Sent: Wednesday, July 08, 2009 11:27 AM
>> To: pig-user@hadoop.apache.org
>> Subject: Re: Clear temp files
>> 
>> Understood. I believe that the temp files are remaining when the script
>> fails or is killed. This is a bit of a bummer since some of the temp files
>> are 1+ TB, although I can't think of an easy way to fix the problem.
>> 
>> Thanks for the information.
>> 
>> Chris
>> 
>> 
>> On 7/8/09 11:23 AM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:
>> 
>>> Pradeep is absolutely right. As for your command, Pig does not support
>> globs
>>> for DFS commands - only in the load statement. The reason you don't see
>> an
>>> error is because rmf command does not error out if file is not found. If
>> you
>>> run rm, you would see an error.
>>> 
>>> Olga
>>> 
>>> -----Original Message-----
>>> From: Pradeep Kamath [mailto:pradeepk@yahoo-inc.com]
>>> Sent: Wednesday, July 08, 2009 11:05 AM
>>> To: pig-user@hadoop.apache.org
>>> Subject: RE: Clear temp files
>>> 
>>> Temp files created on DFS by pig during execution are to store
>> intermediate
>>> results used by later statements in the script and should not be deleted
>> while
>>> the script is executing.
>>> 
>>> Pig cleans up these intermediate files once the script execution
>> completes -do
>>> you see temp files even after a run? If so, can you attach a pig script
>> with
>>> sample data which can show this behavior?
>>> 
>>> Pradeep
>>> 
>>> -----Original Message-----
>>> From: Chris Riccomini [mailto:criccomini@linkedin.com]
>>> Sent: Wednesday, July 08, 2009 10:53 AM
>>> To: pig-user@hadoop.apache.org
>>> Subject: Clear temp files
>>> 
>>> Hi All,
>>> 
>>> Is there an easy way to clear temp files that Pig creates when a script
>>> runs?
>>> 
>>> I tried adding RMF /tmp/tmp* and /tmp/temp*, but it doesn¹t seem to work
>>> (although it doesn¹t fail either ....).
>>> 
>>> Thanks!
>>> Chris
>> 
>> 

----------------------
Yiping Han
F-3140 
(408)349-4403
yhan@yahoo-inc.com


Re: Clear temp files

Posted by Dmitriy Ryaboy <dv...@cloudera.com>.
Just thinking out loud: if all running Pig queries registered themselves
with some service (ZooKeeper?), it would become possible to write a "vacuum"
utility that can occasionally scan the namespace and remove temp files that
do not belong to any currently registered job. Then we don't have to rely on
the hooks getting called and completing properly.

On Wed, Jul 8, 2009 at 11:43 AM, Olga Natkovich <ol...@yahoo-inc.com> wrote:

> Pig will cleanup the files if the job fails but not if it is killed. We
> tried to address killed jobs by setting a shutdown hook; however, hadoop
> also uses shutdown hook to close HDFs and the order of the hooks is not
> guaranteed. So periodically the script would fail.
>
> Olga
>
> -----Original Message-----
> From: Chris Riccomini [mailto:criccomini@linkedin.com]
> Sent: Wednesday, July 08, 2009 11:27 AM
> To: pig-user@hadoop.apache.org
> Subject: Re: Clear temp files
>
> Understood. I believe that the temp files are remaining when the script
> fails or is killed. This is a bit of a bummer since some of the temp files
> are 1+ TB, although I can't think of an easy way to fix the problem.
>
> Thanks for the information.
>
> Chris
>
>
> On 7/8/09 11:23 AM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:
>
> > Pradeep is absolutely right. As for your command, Pig does not support
> globs
> > for DFS commands - only in the load statement. The reason you don't see
> an
> > error is because rmf command does not error out if file is not found. If
> you
> > run rm, you would see an error.
> >
> > Olga
> >
> > -----Original Message-----
> > From: Pradeep Kamath [mailto:pradeepk@yahoo-inc.com]
> > Sent: Wednesday, July 08, 2009 11:05 AM
> > To: pig-user@hadoop.apache.org
> > Subject: RE: Clear temp files
> >
> > Temp files created on DFS by pig during execution are to store
> intermediate
> > results used by later statements in the script and should not be deleted
> while
> > the script is executing.
> >
> > Pig cleans up these intermediate files once the script execution
> completes -do
> > you see temp files even after a run? If so, can you attach a pig script
> with
> > sample data which can show this behavior?
> >
> > Pradeep
> >
> > -----Original Message-----
> > From: Chris Riccomini [mailto:criccomini@linkedin.com]
> > Sent: Wednesday, July 08, 2009 10:53 AM
> > To: pig-user@hadoop.apache.org
> > Subject: Clear temp files
> >
> > Hi All,
> >
> > Is there an easy way to clear temp files that Pig creates when a script
> > runs?
> >
> > I tried adding RMF /tmp/tmp* and /tmp/temp*, but it doesn¹t seem to work
> > (although it doesn¹t fail either ....).
> >
> > Thanks!
> > Chris
>
>

RE: Clear temp files

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Pig will cleanup the files if the job fails but not if it is killed. We tried to address killed jobs by setting a shutdown hook; however, hadoop also uses shutdown hook to close HDFs and the order of the hooks is not guaranteed. So periodically the script would fail.

Olga

-----Original Message-----
From: Chris Riccomini [mailto:criccomini@linkedin.com] 
Sent: Wednesday, July 08, 2009 11:27 AM
To: pig-user@hadoop.apache.org
Subject: Re: Clear temp files

Understood. I believe that the temp files are remaining when the script
fails or is killed. This is a bit of a bummer since some of the temp files
are 1+ TB, although I can't think of an easy way to fix the problem.

Thanks for the information.

Chris


On 7/8/09 11:23 AM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:

> Pradeep is absolutely right. As for your command, Pig does not support globs
> for DFS commands - only in the load statement. The reason you don't see an
> error is because rmf command does not error out if file is not found. If you
> run rm, you would see an error.
> 
> Olga
> 
> -----Original Message-----
> From: Pradeep Kamath [mailto:pradeepk@yahoo-inc.com]
> Sent: Wednesday, July 08, 2009 11:05 AM
> To: pig-user@hadoop.apache.org
> Subject: RE: Clear temp files
> 
> Temp files created on DFS by pig during execution are to store intermediate
> results used by later statements in the script and should not be deleted while
> the script is executing.
> 
> Pig cleans up these intermediate files once the script execution completes -do
> you see temp files even after a run? If so, can you attach a pig script with
> sample data which can show this behavior?
> 
> Pradeep
> 
> -----Original Message-----
> From: Chris Riccomini [mailto:criccomini@linkedin.com]
> Sent: Wednesday, July 08, 2009 10:53 AM
> To: pig-user@hadoop.apache.org
> Subject: Clear temp files
> 
> Hi All,
> 
> Is there an easy way to clear temp files that Pig creates when a script
> runs?
> 
> I tried adding RMF /tmp/tmp* and /tmp/temp*, but it doesn¹t seem to work
> (although it doesn¹t fail either ....).
> 
> Thanks!
> Chris


Re: Clear temp files

Posted by Chris Riccomini <cr...@linkedin.com>.
Hi Pallavi,

Yes, it looks like we may be going down the same path. Regarding the hadoop
dfs -rmr, that DID work. What did NOT work was issuing RMR /tmp/tmp* from
pig itself.

Thanks!
Chris


On 7/8/09 8:23 PM, "Palleti, Pallavi" <pa...@corp.aol.com> wrote:

> Hi Chris,
> 
> We too faced similar issue and finally we ended up writing a cron job which
> deletes temporary files which are one day older from /tmp directory of HDFS. I
> am not sure why "hadoop dfs -rmr /tmp/tmp*" didn't work for you as it worked
> for me when I tried manually.
> 
> Thanks
> Pallavi
> 
> -----Original Message-----
> From: Chris Riccomini [mailto:criccomini@linkedin.com]
> Sent: Wednesday, July 08, 2009 11:57 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: Clear temp files
> 
> Understood. I believe that the temp files are remaining when the script
> fails or is killed. This is a bit of a bummer since some of the temp files
> are 1+ TB, although I can't think of an easy way to fix the problem.
> 
> Thanks for the information.
> 
> Chris
> 
> 
> On 7/8/09 11:23 AM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:
> 
>> Pradeep is absolutely right. As for your command, Pig does not support globs
>> for DFS commands - only in the load statement. The reason you don't see an
>> error is because rmf command does not error out if file is not found. If you
>> run rm, you would see an error.
>> 
>> Olga
>> 
>> -----Original Message-----
>> From: Pradeep Kamath [mailto:pradeepk@yahoo-inc.com]
>> Sent: Wednesday, July 08, 2009 11:05 AM
>> To: pig-user@hadoop.apache.org
>> Subject: RE: Clear temp files
>> 
>> Temp files created on DFS by pig during execution are to store intermediate
>> results used by later statements in the script and should not be deleted
>> while
>> the script is executing.
>> 
>> Pig cleans up these intermediate files once the script execution completes
>> -do
>> you see temp files even after a run? If so, can you attach a pig script with
>> sample data which can show this behavior?
>> 
>> Pradeep
>> 
>> -----Original Message-----
>> From: Chris Riccomini [mailto:criccomini@linkedin.com]
>> Sent: Wednesday, July 08, 2009 10:53 AM
>> To: pig-user@hadoop.apache.org
>> Subject: Clear temp files
>> 
>> Hi All,
>> 
>> Is there an easy way to clear temp files that Pig creates when a script
>> runs?
>> 
>> I tried adding RMF /tmp/tmp* and /tmp/temp*, but it doesn¹t seem to work
>> (although it doesn¹t fail either ....).
>> 
>> Thanks!
>> Chris
> 


RE: Clear temp files

Posted by "Palleti, Pallavi" <pa...@corp.aol.com>.
Hi Chris,

We too faced similar issue and finally we ended up writing a cron job which deletes temporary files which are one day older from /tmp directory of HDFS. I am not sure why "hadoop dfs -rmr /tmp/tmp*" didn't work for you as it worked for me when I tried manually.

Thanks
Pallavi

-----Original Message-----
From: Chris Riccomini [mailto:criccomini@linkedin.com] 
Sent: Wednesday, July 08, 2009 11:57 PM
To: pig-user@hadoop.apache.org
Subject: Re: Clear temp files

Understood. I believe that the temp files are remaining when the script
fails or is killed. This is a bit of a bummer since some of the temp files
are 1+ TB, although I can't think of an easy way to fix the problem.

Thanks for the information.

Chris


On 7/8/09 11:23 AM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:

> Pradeep is absolutely right. As for your command, Pig does not support globs
> for DFS commands - only in the load statement. The reason you don't see an
> error is because rmf command does not error out if file is not found. If you
> run rm, you would see an error.
> 
> Olga
> 
> -----Original Message-----
> From: Pradeep Kamath [mailto:pradeepk@yahoo-inc.com]
> Sent: Wednesday, July 08, 2009 11:05 AM
> To: pig-user@hadoop.apache.org
> Subject: RE: Clear temp files
> 
> Temp files created on DFS by pig during execution are to store intermediate
> results used by later statements in the script and should not be deleted while
> the script is executing.
> 
> Pig cleans up these intermediate files once the script execution completes -do
> you see temp files even after a run? If so, can you attach a pig script with
> sample data which can show this behavior?
> 
> Pradeep
> 
> -----Original Message-----
> From: Chris Riccomini [mailto:criccomini@linkedin.com]
> Sent: Wednesday, July 08, 2009 10:53 AM
> To: pig-user@hadoop.apache.org
> Subject: Clear temp files
> 
> Hi All,
> 
> Is there an easy way to clear temp files that Pig creates when a script
> runs?
> 
> I tried adding RMF /tmp/tmp* and /tmp/temp*, but it doesn¹t seem to work
> (although it doesn¹t fail either ....).
> 
> Thanks!
> Chris


Re: Clear temp files

Posted by Yiping Han <yh...@yahoo-inc.com>.
Yes. I second this. There should be an option such that user could choose to
let pig delete these tmp files when job fails or killed.


--Yiping


On 7/8/09 11:27 AM, "Chris Riccomini" <cr...@linkedin.com> wrote:

> Understood. I believe that the temp files are remaining when the script
> fails or is killed. This is a bit of a bummer since some of the temp files
> are 1+ TB, although I can't think of an easy way to fix the problem.
> 
> Thanks for the information.
> 
> Chris
> 
> 
> On 7/8/09 11:23 AM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:
> 
>> Pradeep is absolutely right. As for your command, Pig does not support globs
>> for DFS commands - only in the load statement. The reason you don't see an
>> error is because rmf command does not error out if file is not found. If you
>> run rm, you would see an error.
>> 
>> Olga
>> 
>> -----Original Message-----
>> From: Pradeep Kamath [mailto:pradeepk@yahoo-inc.com]
>> Sent: Wednesday, July 08, 2009 11:05 AM
>> To: pig-user@hadoop.apache.org
>> Subject: RE: Clear temp files
>> 
>> Temp files created on DFS by pig during execution are to store intermediate
>> results used by later statements in the script and should not be deleted
>> while
>> the script is executing.
>> 
>> Pig cleans up these intermediate files once the script execution completes
>> -do
>> you see temp files even after a run? If so, can you attach a pig script with
>> sample data which can show this behavior?
>> 
>> Pradeep
>> 
>> -----Original Message-----
>> From: Chris Riccomini [mailto:criccomini@linkedin.com]
>> Sent: Wednesday, July 08, 2009 10:53 AM
>> To: pig-user@hadoop.apache.org
>> Subject: Clear temp files
>> 
>> Hi All,
>> 
>> Is there an easy way to clear temp files that Pig creates when a script
>> runs?
>> 
>> I tried adding RMF /tmp/tmp* and /tmp/temp*, but it doesn¹t seem to work
>> (although it doesn¹t fail either ....).
>> 
>> Thanks!
>> Chris
> 

----------------------
Yiping Han
F-3140 
(408)349-4403
yhan@yahoo-inc.com


Re: Clear temp files

Posted by Chris Riccomini <cr...@linkedin.com>.
Understood. I believe that the temp files are remaining when the script
fails or is killed. This is a bit of a bummer since some of the temp files
are 1+ TB, although I can't think of an easy way to fix the problem.

Thanks for the information.

Chris


On 7/8/09 11:23 AM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:

> Pradeep is absolutely right. As for your command, Pig does not support globs
> for DFS commands - only in the load statement. The reason you don't see an
> error is because rmf command does not error out if file is not found. If you
> run rm, you would see an error.
> 
> Olga
> 
> -----Original Message-----
> From: Pradeep Kamath [mailto:pradeepk@yahoo-inc.com]
> Sent: Wednesday, July 08, 2009 11:05 AM
> To: pig-user@hadoop.apache.org
> Subject: RE: Clear temp files
> 
> Temp files created on DFS by pig during execution are to store intermediate
> results used by later statements in the script and should not be deleted while
> the script is executing.
> 
> Pig cleans up these intermediate files once the script execution completes -do
> you see temp files even after a run? If so, can you attach a pig script with
> sample data which can show this behavior?
> 
> Pradeep
> 
> -----Original Message-----
> From: Chris Riccomini [mailto:criccomini@linkedin.com]
> Sent: Wednesday, July 08, 2009 10:53 AM
> To: pig-user@hadoop.apache.org
> Subject: Clear temp files
> 
> Hi All,
> 
> Is there an easy way to clear temp files that Pig creates when a script
> runs?
> 
> I tried adding RMF /tmp/tmp* and /tmp/temp*, but it doesn¹t seem to work
> (although it doesn¹t fail either ....).
> 
> Thanks!
> Chris


RE: Clear temp files

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Pradeep is absolutely right. As for your command, Pig does not support globs for DFS commands - only in the load statement. The reason you don't see an error is because rmf command does not error out if file is not found. If you run rm, you would see an error.

Olga

-----Original Message-----
From: Pradeep Kamath [mailto:pradeepk@yahoo-inc.com] 
Sent: Wednesday, July 08, 2009 11:05 AM
To: pig-user@hadoop.apache.org
Subject: RE: Clear temp files

Temp files created on DFS by pig during execution are to store intermediate results used by later statements in the script and should not be deleted while the script is executing.

Pig cleans up these intermediate files once the script execution completes -do you see temp files even after a run? If so, can you attach a pig script with sample data which can show this behavior?

Pradeep

-----Original Message-----
From: Chris Riccomini [mailto:criccomini@linkedin.com] 
Sent: Wednesday, July 08, 2009 10:53 AM
To: pig-user@hadoop.apache.org
Subject: Clear temp files

Hi All,

Is there an easy way to clear temp files that Pig creates when a script
runs?

I tried adding RMF /tmp/tmp* and /tmp/temp*, but it doesn¹t seem to work
(although it doesn¹t fail either ....).

Thanks!
Chris

RE: Clear temp files

Posted by Pradeep Kamath <pr...@yahoo-inc.com>.
Temp files created on DFS by pig during execution are to store intermediate results used by later statements in the script and should not be deleted while the script is executing.

Pig cleans up these intermediate files once the script execution completes -do you see temp files even after a run? If so, can you attach a pig script with sample data which can show this behavior?

Pradeep

-----Original Message-----
From: Chris Riccomini [mailto:criccomini@linkedin.com] 
Sent: Wednesday, July 08, 2009 10:53 AM
To: pig-user@hadoop.apache.org
Subject: Clear temp files

Hi All,

Is there an easy way to clear temp files that Pig creates when a script
runs?

I tried adding RMF /tmp/tmp* and /tmp/temp*, but it doesn¹t seem to work
(although it doesn¹t fail either ....).

Thanks!
Chris