You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Jean-Marc Spaggiari <je...@spaggiari.org> on 2013/03/26 01:16:17 UTC

Auto clean DistCache?

Hi,

Each time my MR job is run, a directory is created on the TaskTracker
under mapred/local/taskTracker/hadoop/distcache (based on my
configuration).

I looked at the directory today, and it's hosting thousands of
directories and more than 8GB of data there.

Is there a way to automatically delete this directory when the job is done?

Thanks,

JM

Re: Auto clean DistCache?

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
You can control the limit of these cache files, the default is 10GB (value of 10737418240L): Try changing local.cache.size or mapreduce.tasktracker.cache.local.size in mapred-site.xml

Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari wrote:

> Hi,
> 
> Each time my MR job is run, a directory is created on the TaskTracker
> under mapred/local/taskTracker/hadoop/distcache (based on my
> configuration).
> 
> I looked at the directory today, and it's hosting thousands of
> directories and more than 8GB of data there.
> 
> Is there a way to automatically delete this directory when the job is done?
> 
> Thanks,
> 
> JM


Re: Auto clean DistCache?

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
I don't think it is documented in mapred-default.xml, where it should
ideally be. I could see it only in code. You can take a look at it here, if
you are interested: http://goo.gl/k5xsI

Thanks
Hemanth


On Wed, Mar 27, 2013 at 7:07 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Oh! good to know! It keep tracks even of month old entries??? There is no
> TTL?
>
> I was not able to find the documentation for  local.cache.size or
> mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
> where I can found that?
>
> Thanks,
>
> JM
>
> 2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
> >> Else, I will go for a customed script to delete all directories (and
> content) older than 2 or 3 days…
> >>
> > TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries
> in memory.
> > So if external process (like your script) start deleting dist cache
> files, there would be inconsistency and you'll start seeing task
> initialization failures due to no file found error.
> >
> > Koji
> >
> >
> > On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
> >
> >> For the situation I faced I was really a disk space issue, not related
> >> to the number of files. It was writing on a small partition.
> >>
> >> I will try with local.cache.size or
> >> mapreduce.tasktracker.cache.local.size to see if I can keep the final
> >> total size under 5GB... Else, I will go for a customed script to
> >> delete all directories (and content) older than 2 or 3 days...
> >>
> >> Thanks,
> >>
> >> JM
> >>
> >> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
> >>> Let me clarify , If there are lots of files or directories up to 32K (
> >>> Depending on the user's # of files sys os config) in those distributed
> cache
> >>> dirs, The OS will not be able to create any more files/dirs, Thus M-R
> jobs
> >>> wont get initiated on those tasktracker machines. Hope this helps.
> >>>
> >>>
> >>> Thanks
> >>>
> >>>
> >>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
> >>> <vi...@hortonworks.com> wrote:
> >>>>
> >>>>
> >>>> All the files are not opened at the same time ever, so you shouldn't
> see
> >>>> any "# of open files exceeds error".
> >>>>
> >>>> Thanks,
> >>>> +Vinod Kumar Vavilapalli
> >>>> Hortonworks Inc.
> >>>> http://hortonworks.com/
> >>>>
> >>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
> >>>>
> >>>> Hi JM ,
> >>>>
> >>>> Actually these dirs need to be purged by a script that keeps the last
> 2
> >>>> days worth of files, Otherwise you may run into # of open files
> exceeds
> >>>> error.
> >>>>
> >>>> Thanks
> >>>>
> >>>>
> >>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org>
> >>>> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>>
> >>>> Each time my MR job is run, a directory is created on the TaskTracker
> >>>>
> >>>> under mapred/local/taskTracker/hadoop/distcache (based on my
> >>>>
> >>>> configuration).
> >>>>
> >>>>
> >>>> I looked at the directory today, and it's hosting thousands of
> >>>>
> >>>> directories and more than 8GB of data there.
> >>>>
> >>>>
> >>>> Is there a way to automatically delete this directory when the job is
> >>>> done?
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>>
> >>>> JM
> >>>>
> >>>>
> >>>>
> >>>
> >
>

Re: Auto clean DistCache?

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
I don't think it is documented in mapred-default.xml, where it should
ideally be. I could see it only in code. You can take a look at it here, if
you are interested: http://goo.gl/k5xsI

Thanks
Hemanth


On Wed, Mar 27, 2013 at 7:07 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Oh! good to know! It keep tracks even of month old entries??? There is no
> TTL?
>
> I was not able to find the documentation for  local.cache.size or
> mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
> where I can found that?
>
> Thanks,
>
> JM
>
> 2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
> >> Else, I will go for a customed script to delete all directories (and
> content) older than 2 or 3 days…
> >>
> > TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries
> in memory.
> > So if external process (like your script) start deleting dist cache
> files, there would be inconsistency and you'll start seeing task
> initialization failures due to no file found error.
> >
> > Koji
> >
> >
> > On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
> >
> >> For the situation I faced I was really a disk space issue, not related
> >> to the number of files. It was writing on a small partition.
> >>
> >> I will try with local.cache.size or
> >> mapreduce.tasktracker.cache.local.size to see if I can keep the final
> >> total size under 5GB... Else, I will go for a customed script to
> >> delete all directories (and content) older than 2 or 3 days...
> >>
> >> Thanks,
> >>
> >> JM
> >>
> >> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
> >>> Let me clarify , If there are lots of files or directories up to 32K (
> >>> Depending on the user's # of files sys os config) in those distributed
> cache
> >>> dirs, The OS will not be able to create any more files/dirs, Thus M-R
> jobs
> >>> wont get initiated on those tasktracker machines. Hope this helps.
> >>>
> >>>
> >>> Thanks
> >>>
> >>>
> >>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
> >>> <vi...@hortonworks.com> wrote:
> >>>>
> >>>>
> >>>> All the files are not opened at the same time ever, so you shouldn't
> see
> >>>> any "# of open files exceeds error".
> >>>>
> >>>> Thanks,
> >>>> +Vinod Kumar Vavilapalli
> >>>> Hortonworks Inc.
> >>>> http://hortonworks.com/
> >>>>
> >>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
> >>>>
> >>>> Hi JM ,
> >>>>
> >>>> Actually these dirs need to be purged by a script that keeps the last
> 2
> >>>> days worth of files, Otherwise you may run into # of open files
> exceeds
> >>>> error.
> >>>>
> >>>> Thanks
> >>>>
> >>>>
> >>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org>
> >>>> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>>
> >>>> Each time my MR job is run, a directory is created on the TaskTracker
> >>>>
> >>>> under mapred/local/taskTracker/hadoop/distcache (based on my
> >>>>
> >>>> configuration).
> >>>>
> >>>>
> >>>> I looked at the directory today, and it's hosting thousands of
> >>>>
> >>>> directories and more than 8GB of data there.
> >>>>
> >>>>
> >>>> Is there a way to automatically delete this directory when the job is
> >>>> done?
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>>
> >>>> JM
> >>>>
> >>>>
> >>>>
> >>>
> >
>

Re: Auto clean DistCache?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Thanks Harsh. My issue was not related to the number of files/folders
but related to the total size of the DistributedCache. The directory
where it's stored only has 7GB available... So I will setup the limit
to 5GB with local.cache.size, or move it to the drives there I have
the dfs files stored.

Thanks,

JM

2013/3/28 Harsh J <ha...@cloudera.com>:
> The DistributedCache is cleaned automatically and no user intervention
> (aside of size limitation changes, which may be an administrative
> requirement) is generally required to delete the older distributed
> cache files.
>
> This is observable in code and is also noted in TDG, 2ed.:
>
> Tom White:
> """
> The tasktracker also maintains a reference count for the number of
> tasks using each file in the cache. Before the task has run, the
> file’s reference count is incremented by one; then after the task has
> run, the count is decreased by one. Only when the count reaches zero
> it is eligible for deletion, since no tasks are using it. Files are
> deleted to make room for a new file when the cache exceeds a certain
> size—10 GB by default. The cache size may be changed by setting the
> configuration property local.cache.size, which is measured in bytes.
> """
>
> And also, the maximum allowed dirs is also checked for automatically
> today, to not violate the OS's limits.
>
> On Wed, Mar 27, 2013 at 7:07 PM, Jean-Marc Spaggiari
> <je...@spaggiari.org> wrote:
>> Oh! good to know! It keep tracks even of month old entries??? There is no TTL?
>>
>> I was not able to find the documentation for  local.cache.size or
>> mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
>> where I can found that?
>>
>> Thanks,
>>
>> JM
>>
>> 2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
>>>> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>>>>
>>> TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
>>> So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.
>>>
>>> Koji
>>>
>>>
>>> On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
>>>
>>>> For the situation I faced I was really a disk space issue, not related
>>>> to the number of files. It was writing on a small partition.
>>>>
>>>> I will try with local.cache.size or
>>>> mapreduce.tasktracker.cache.local.size to see if I can keep the final
>>>> total size under 5GB... Else, I will go for a customed script to
>>>> delete all directories (and content) older than 2 or 3 days...
>>>>
>>>> Thanks,
>>>>
>>>> JM
>>>>
>>>> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>>>>> Let me clarify , If there are lots of files or directories up to 32K (
>>>>> Depending on the user's # of files sys os config) in those distributed cache
>>>>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>>>>> wont get initiated on those tasktracker machines. Hope this helps.
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>>>>> <vi...@hortonworks.com> wrote:
>>>>>>
>>>>>>
>>>>>> All the files are not opened at the same time ever, so you shouldn't see
>>>>>> any "# of open files exceeds error".
>>>>>>
>>>>>> Thanks,
>>>>>> +Vinod Kumar Vavilapalli
>>>>>> Hortonworks Inc.
>>>>>> http://hortonworks.com/
>>>>>>
>>>>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>>>>>
>>>>>> Hi JM ,
>>>>>>
>>>>>> Actually these dirs need to be purged by a script that keeps the last 2
>>>>>> days worth of files, Otherwise you may run into # of open files exceeds
>>>>>> error.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> Each time my MR job is run, a directory is created on the TaskTracker
>>>>>>
>>>>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>>>>>
>>>>>> configuration).
>>>>>>
>>>>>>
>>>>>> I looked at the directory today, and it's hosting thousands of
>>>>>>
>>>>>> directories and more than 8GB of data there.
>>>>>>
>>>>>>
>>>>>> Is there a way to automatically delete this directory when the job is
>>>>>> done?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> JM
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>
>
>
> --
> Harsh J

Re: Auto clean DistCache?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Thanks Harsh. My issue was not related to the number of files/folders
but related to the total size of the DistributedCache. The directory
where it's stored only has 7GB available... So I will setup the limit
to 5GB with local.cache.size, or move it to the drives there I have
the dfs files stored.

Thanks,

JM

2013/3/28 Harsh J <ha...@cloudera.com>:
> The DistributedCache is cleaned automatically and no user intervention
> (aside of size limitation changes, which may be an administrative
> requirement) is generally required to delete the older distributed
> cache files.
>
> This is observable in code and is also noted in TDG, 2ed.:
>
> Tom White:
> """
> The tasktracker also maintains a reference count for the number of
> tasks using each file in the cache. Before the task has run, the
> file’s reference count is incremented by one; then after the task has
> run, the count is decreased by one. Only when the count reaches zero
> it is eligible for deletion, since no tasks are using it. Files are
> deleted to make room for a new file when the cache exceeds a certain
> size—10 GB by default. The cache size may be changed by setting the
> configuration property local.cache.size, which is measured in bytes.
> """
>
> And also, the maximum allowed dirs is also checked for automatically
> today, to not violate the OS's limits.
>
> On Wed, Mar 27, 2013 at 7:07 PM, Jean-Marc Spaggiari
> <je...@spaggiari.org> wrote:
>> Oh! good to know! It keep tracks even of month old entries??? There is no TTL?
>>
>> I was not able to find the documentation for  local.cache.size or
>> mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
>> where I can found that?
>>
>> Thanks,
>>
>> JM
>>
>> 2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
>>>> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>>>>
>>> TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
>>> So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.
>>>
>>> Koji
>>>
>>>
>>> On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
>>>
>>>> For the situation I faced I was really a disk space issue, not related
>>>> to the number of files. It was writing on a small partition.
>>>>
>>>> I will try with local.cache.size or
>>>> mapreduce.tasktracker.cache.local.size to see if I can keep the final
>>>> total size under 5GB... Else, I will go for a customed script to
>>>> delete all directories (and content) older than 2 or 3 days...
>>>>
>>>> Thanks,
>>>>
>>>> JM
>>>>
>>>> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>>>>> Let me clarify , If there are lots of files or directories up to 32K (
>>>>> Depending on the user's # of files sys os config) in those distributed cache
>>>>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>>>>> wont get initiated on those tasktracker machines. Hope this helps.
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>>>>> <vi...@hortonworks.com> wrote:
>>>>>>
>>>>>>
>>>>>> All the files are not opened at the same time ever, so you shouldn't see
>>>>>> any "# of open files exceeds error".
>>>>>>
>>>>>> Thanks,
>>>>>> +Vinod Kumar Vavilapalli
>>>>>> Hortonworks Inc.
>>>>>> http://hortonworks.com/
>>>>>>
>>>>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>>>>>
>>>>>> Hi JM ,
>>>>>>
>>>>>> Actually these dirs need to be purged by a script that keeps the last 2
>>>>>> days worth of files, Otherwise you may run into # of open files exceeds
>>>>>> error.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> Each time my MR job is run, a directory is created on the TaskTracker
>>>>>>
>>>>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>>>>>
>>>>>> configuration).
>>>>>>
>>>>>>
>>>>>> I looked at the directory today, and it's hosting thousands of
>>>>>>
>>>>>> directories and more than 8GB of data there.
>>>>>>
>>>>>>
>>>>>> Is there a way to automatically delete this directory when the job is
>>>>>> done?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> JM
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>
>
>
> --
> Harsh J

Re: Auto clean DistCache?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Thanks Harsh. My issue was not related to the number of files/folders
but related to the total size of the DistributedCache. The directory
where it's stored only has 7GB available... So I will setup the limit
to 5GB with local.cache.size, or move it to the drives there I have
the dfs files stored.

Thanks,

JM

2013/3/28 Harsh J <ha...@cloudera.com>:
> The DistributedCache is cleaned automatically and no user intervention
> (aside of size limitation changes, which may be an administrative
> requirement) is generally required to delete the older distributed
> cache files.
>
> This is observable in code and is also noted in TDG, 2ed.:
>
> Tom White:
> """
> The tasktracker also maintains a reference count for the number of
> tasks using each file in the cache. Before the task has run, the
> file’s reference count is incremented by one; then after the task has
> run, the count is decreased by one. Only when the count reaches zero
> it is eligible for deletion, since no tasks are using it. Files are
> deleted to make room for a new file when the cache exceeds a certain
> size—10 GB by default. The cache size may be changed by setting the
> configuration property local.cache.size, which is measured in bytes.
> """
>
> And also, the maximum allowed dirs is also checked for automatically
> today, to not violate the OS's limits.
>
> On Wed, Mar 27, 2013 at 7:07 PM, Jean-Marc Spaggiari
> <je...@spaggiari.org> wrote:
>> Oh! good to know! It keep tracks even of month old entries??? There is no TTL?
>>
>> I was not able to find the documentation for  local.cache.size or
>> mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
>> where I can found that?
>>
>> Thanks,
>>
>> JM
>>
>> 2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
>>>> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>>>>
>>> TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
>>> So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.
>>>
>>> Koji
>>>
>>>
>>> On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
>>>
>>>> For the situation I faced I was really a disk space issue, not related
>>>> to the number of files. It was writing on a small partition.
>>>>
>>>> I will try with local.cache.size or
>>>> mapreduce.tasktracker.cache.local.size to see if I can keep the final
>>>> total size under 5GB... Else, I will go for a customed script to
>>>> delete all directories (and content) older than 2 or 3 days...
>>>>
>>>> Thanks,
>>>>
>>>> JM
>>>>
>>>> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>>>>> Let me clarify , If there are lots of files or directories up to 32K (
>>>>> Depending on the user's # of files sys os config) in those distributed cache
>>>>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>>>>> wont get initiated on those tasktracker machines. Hope this helps.
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>>>>> <vi...@hortonworks.com> wrote:
>>>>>>
>>>>>>
>>>>>> All the files are not opened at the same time ever, so you shouldn't see
>>>>>> any "# of open files exceeds error".
>>>>>>
>>>>>> Thanks,
>>>>>> +Vinod Kumar Vavilapalli
>>>>>> Hortonworks Inc.
>>>>>> http://hortonworks.com/
>>>>>>
>>>>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>>>>>
>>>>>> Hi JM ,
>>>>>>
>>>>>> Actually these dirs need to be purged by a script that keeps the last 2
>>>>>> days worth of files, Otherwise you may run into # of open files exceeds
>>>>>> error.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> Each time my MR job is run, a directory is created on the TaskTracker
>>>>>>
>>>>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>>>>>
>>>>>> configuration).
>>>>>>
>>>>>>
>>>>>> I looked at the directory today, and it's hosting thousands of
>>>>>>
>>>>>> directories and more than 8GB of data there.
>>>>>>
>>>>>>
>>>>>> Is there a way to automatically delete this directory when the job is
>>>>>> done?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> JM
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>
>
>
> --
> Harsh J

Re: Auto clean DistCache?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Thanks Harsh. My issue was not related to the number of files/folders
but related to the total size of the DistributedCache. The directory
where it's stored only has 7GB available... So I will setup the limit
to 5GB with local.cache.size, or move it to the drives there I have
the dfs files stored.

Thanks,

JM

2013/3/28 Harsh J <ha...@cloudera.com>:
> The DistributedCache is cleaned automatically and no user intervention
> (aside of size limitation changes, which may be an administrative
> requirement) is generally required to delete the older distributed
> cache files.
>
> This is observable in code and is also noted in TDG, 2ed.:
>
> Tom White:
> """
> The tasktracker also maintains a reference count for the number of
> tasks using each file in the cache. Before the task has run, the
> file’s reference count is incremented by one; then after the task has
> run, the count is decreased by one. Only when the count reaches zero
> it is eligible for deletion, since no tasks are using it. Files are
> deleted to make room for a new file when the cache exceeds a certain
> size—10 GB by default. The cache size may be changed by setting the
> configuration property local.cache.size, which is measured in bytes.
> """
>
> And also, the maximum allowed dirs is also checked for automatically
> today, to not violate the OS's limits.
>
> On Wed, Mar 27, 2013 at 7:07 PM, Jean-Marc Spaggiari
> <je...@spaggiari.org> wrote:
>> Oh! good to know! It keep tracks even of month old entries??? There is no TTL?
>>
>> I was not able to find the documentation for  local.cache.size or
>> mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
>> where I can found that?
>>
>> Thanks,
>>
>> JM
>>
>> 2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
>>>> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>>>>
>>> TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
>>> So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.
>>>
>>> Koji
>>>
>>>
>>> On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
>>>
>>>> For the situation I faced I was really a disk space issue, not related
>>>> to the number of files. It was writing on a small partition.
>>>>
>>>> I will try with local.cache.size or
>>>> mapreduce.tasktracker.cache.local.size to see if I can keep the final
>>>> total size under 5GB... Else, I will go for a customed script to
>>>> delete all directories (and content) older than 2 or 3 days...
>>>>
>>>> Thanks,
>>>>
>>>> JM
>>>>
>>>> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>>>>> Let me clarify , If there are lots of files or directories up to 32K (
>>>>> Depending on the user's # of files sys os config) in those distributed cache
>>>>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>>>>> wont get initiated on those tasktracker machines. Hope this helps.
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>>>>> <vi...@hortonworks.com> wrote:
>>>>>>
>>>>>>
>>>>>> All the files are not opened at the same time ever, so you shouldn't see
>>>>>> any "# of open files exceeds error".
>>>>>>
>>>>>> Thanks,
>>>>>> +Vinod Kumar Vavilapalli
>>>>>> Hortonworks Inc.
>>>>>> http://hortonworks.com/
>>>>>>
>>>>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>>>>>
>>>>>> Hi JM ,
>>>>>>
>>>>>> Actually these dirs need to be purged by a script that keeps the last 2
>>>>>> days worth of files, Otherwise you may run into # of open files exceeds
>>>>>> error.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> Each time my MR job is run, a directory is created on the TaskTracker
>>>>>>
>>>>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>>>>>
>>>>>> configuration).
>>>>>>
>>>>>>
>>>>>> I looked at the directory today, and it's hosting thousands of
>>>>>>
>>>>>> directories and more than 8GB of data there.
>>>>>>
>>>>>>
>>>>>> Is there a way to automatically delete this directory when the job is
>>>>>> done?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> JM
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>
>
>
> --
> Harsh J

Re: Auto clean DistCache?

Posted by Harsh J <ha...@cloudera.com>.
The DistributedCache is cleaned automatically and no user intervention
(aside of size limitation changes, which may be an administrative
requirement) is generally required to delete the older distributed
cache files.

This is observable in code and is also noted in TDG, 2ed.:

Tom White:
"""
The tasktracker also maintains a reference count for the number of
tasks using each file in the cache. Before the task has run, the
file’s reference count is incremented by one; then after the task has
run, the count is decreased by one. Only when the count reaches zero
it is eligible for deletion, since no tasks are using it. Files are
deleted to make room for a new file when the cache exceeds a certain
size—10 GB by default. The cache size may be changed by setting the
configuration property local.cache.size, which is measured in bytes.
"""

And also, the maximum allowed dirs is also checked for automatically
today, to not violate the OS's limits.

On Wed, Mar 27, 2013 at 7:07 PM, Jean-Marc Spaggiari
<je...@spaggiari.org> wrote:
> Oh! good to know! It keep tracks even of month old entries??? There is no TTL?
>
> I was not able to find the documentation for  local.cache.size or
> mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
> where I can found that?
>
> Thanks,
>
> JM
>
> 2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
>>> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>>>
>> TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
>> So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.
>>
>> Koji
>>
>>
>> On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
>>
>>> For the situation I faced I was really a disk space issue, not related
>>> to the number of files. It was writing on a small partition.
>>>
>>> I will try with local.cache.size or
>>> mapreduce.tasktracker.cache.local.size to see if I can keep the final
>>> total size under 5GB... Else, I will go for a customed script to
>>> delete all directories (and content) older than 2 or 3 days...
>>>
>>> Thanks,
>>>
>>> JM
>>>
>>> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>>>> Let me clarify , If there are lots of files or directories up to 32K (
>>>> Depending on the user's # of files sys os config) in those distributed cache
>>>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>>>> wont get initiated on those tasktracker machines. Hope this helps.
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>>>> <vi...@hortonworks.com> wrote:
>>>>>
>>>>>
>>>>> All the files are not opened at the same time ever, so you shouldn't see
>>>>> any "# of open files exceeds error".
>>>>>
>>>>> Thanks,
>>>>> +Vinod Kumar Vavilapalli
>>>>> Hortonworks Inc.
>>>>> http://hortonworks.com/
>>>>>
>>>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>>>>
>>>>> Hi JM ,
>>>>>
>>>>> Actually these dirs need to be purged by a script that keeps the last 2
>>>>> days worth of files, Otherwise you may run into # of open files exceeds
>>>>> error.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> Each time my MR job is run, a directory is created on the TaskTracker
>>>>>
>>>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>>>>
>>>>> configuration).
>>>>>
>>>>>
>>>>> I looked at the directory today, and it's hosting thousands of
>>>>>
>>>>> directories and more than 8GB of data there.
>>>>>
>>>>>
>>>>> Is there a way to automatically delete this directory when the job is
>>>>> done?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> JM
>>>>>
>>>>>
>>>>>
>>>>
>>



-- 
Harsh J

Re: Auto clean DistCache?

Posted by Harsh J <ha...@cloudera.com>.
The DistributedCache is cleaned automatically and no user intervention
(aside of size limitation changes, which may be an administrative
requirement) is generally required to delete the older distributed
cache files.

This is observable in code and is also noted in TDG, 2ed.:

Tom White:
"""
The tasktracker also maintains a reference count for the number of
tasks using each file in the cache. Before the task has run, the
file’s reference count is incremented by one; then after the task has
run, the count is decreased by one. Only when the count reaches zero
it is eligible for deletion, since no tasks are using it. Files are
deleted to make room for a new file when the cache exceeds a certain
size—10 GB by default. The cache size may be changed by setting the
configuration property local.cache.size, which is measured in bytes.
"""

And also, the maximum allowed dirs is also checked for automatically
today, to not violate the OS's limits.

On Wed, Mar 27, 2013 at 7:07 PM, Jean-Marc Spaggiari
<je...@spaggiari.org> wrote:
> Oh! good to know! It keep tracks even of month old entries??? There is no TTL?
>
> I was not able to find the documentation for  local.cache.size or
> mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
> where I can found that?
>
> Thanks,
>
> JM
>
> 2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
>>> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>>>
>> TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
>> So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.
>>
>> Koji
>>
>>
>> On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
>>
>>> For the situation I faced I was really a disk space issue, not related
>>> to the number of files. It was writing on a small partition.
>>>
>>> I will try with local.cache.size or
>>> mapreduce.tasktracker.cache.local.size to see if I can keep the final
>>> total size under 5GB... Else, I will go for a customed script to
>>> delete all directories (and content) older than 2 or 3 days...
>>>
>>> Thanks,
>>>
>>> JM
>>>
>>> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>>>> Let me clarify , If there are lots of files or directories up to 32K (
>>>> Depending on the user's # of files sys os config) in those distributed cache
>>>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>>>> wont get initiated on those tasktracker machines. Hope this helps.
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>>>> <vi...@hortonworks.com> wrote:
>>>>>
>>>>>
>>>>> All the files are not opened at the same time ever, so you shouldn't see
>>>>> any "# of open files exceeds error".
>>>>>
>>>>> Thanks,
>>>>> +Vinod Kumar Vavilapalli
>>>>> Hortonworks Inc.
>>>>> http://hortonworks.com/
>>>>>
>>>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>>>>
>>>>> Hi JM ,
>>>>>
>>>>> Actually these dirs need to be purged by a script that keeps the last 2
>>>>> days worth of files, Otherwise you may run into # of open files exceeds
>>>>> error.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> Each time my MR job is run, a directory is created on the TaskTracker
>>>>>
>>>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>>>>
>>>>> configuration).
>>>>>
>>>>>
>>>>> I looked at the directory today, and it's hosting thousands of
>>>>>
>>>>> directories and more than 8GB of data there.
>>>>>
>>>>>
>>>>> Is there a way to automatically delete this directory when the job is
>>>>> done?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> JM
>>>>>
>>>>>
>>>>>
>>>>
>>



-- 
Harsh J

Re: Auto clean DistCache?

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
I don't think it is documented in mapred-default.xml, where it should
ideally be. I could see it only in code. You can take a look at it here, if
you are interested: http://goo.gl/k5xsI

Thanks
Hemanth


On Wed, Mar 27, 2013 at 7:07 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Oh! good to know! It keep tracks even of month old entries??? There is no
> TTL?
>
> I was not able to find the documentation for  local.cache.size or
> mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
> where I can found that?
>
> Thanks,
>
> JM
>
> 2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
> >> Else, I will go for a customed script to delete all directories (and
> content) older than 2 or 3 days…
> >>
> > TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries
> in memory.
> > So if external process (like your script) start deleting dist cache
> files, there would be inconsistency and you'll start seeing task
> initialization failures due to no file found error.
> >
> > Koji
> >
> >
> > On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
> >
> >> For the situation I faced I was really a disk space issue, not related
> >> to the number of files. It was writing on a small partition.
> >>
> >> I will try with local.cache.size or
> >> mapreduce.tasktracker.cache.local.size to see if I can keep the final
> >> total size under 5GB... Else, I will go for a customed script to
> >> delete all directories (and content) older than 2 or 3 days...
> >>
> >> Thanks,
> >>
> >> JM
> >>
> >> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
> >>> Let me clarify , If there are lots of files or directories up to 32K (
> >>> Depending on the user's # of files sys os config) in those distributed
> cache
> >>> dirs, The OS will not be able to create any more files/dirs, Thus M-R
> jobs
> >>> wont get initiated on those tasktracker machines. Hope this helps.
> >>>
> >>>
> >>> Thanks
> >>>
> >>>
> >>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
> >>> <vi...@hortonworks.com> wrote:
> >>>>
> >>>>
> >>>> All the files are not opened at the same time ever, so you shouldn't
> see
> >>>> any "# of open files exceeds error".
> >>>>
> >>>> Thanks,
> >>>> +Vinod Kumar Vavilapalli
> >>>> Hortonworks Inc.
> >>>> http://hortonworks.com/
> >>>>
> >>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
> >>>>
> >>>> Hi JM ,
> >>>>
> >>>> Actually these dirs need to be purged by a script that keeps the last
> 2
> >>>> days worth of files, Otherwise you may run into # of open files
> exceeds
> >>>> error.
> >>>>
> >>>> Thanks
> >>>>
> >>>>
> >>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org>
> >>>> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>>
> >>>> Each time my MR job is run, a directory is created on the TaskTracker
> >>>>
> >>>> under mapred/local/taskTracker/hadoop/distcache (based on my
> >>>>
> >>>> configuration).
> >>>>
> >>>>
> >>>> I looked at the directory today, and it's hosting thousands of
> >>>>
> >>>> directories and more than 8GB of data there.
> >>>>
> >>>>
> >>>> Is there a way to automatically delete this directory when the job is
> >>>> done?
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>>
> >>>> JM
> >>>>
> >>>>
> >>>>
> >>>
> >
>

Re: Auto clean DistCache?

Posted by Harsh J <ha...@cloudera.com>.
The DistributedCache is cleaned automatically and no user intervention
(aside of size limitation changes, which may be an administrative
requirement) is generally required to delete the older distributed
cache files.

This is observable in code and is also noted in TDG, 2ed.:

Tom White:
"""
The tasktracker also maintains a reference count for the number of
tasks using each file in the cache. Before the task has run, the
file’s reference count is incremented by one; then after the task has
run, the count is decreased by one. Only when the count reaches zero
it is eligible for deletion, since no tasks are using it. Files are
deleted to make room for a new file when the cache exceeds a certain
size—10 GB by default. The cache size may be changed by setting the
configuration property local.cache.size, which is measured in bytes.
"""

And also, the maximum allowed dirs is also checked for automatically
today, to not violate the OS's limits.

On Wed, Mar 27, 2013 at 7:07 PM, Jean-Marc Spaggiari
<je...@spaggiari.org> wrote:
> Oh! good to know! It keep tracks even of month old entries??? There is no TTL?
>
> I was not able to find the documentation for  local.cache.size or
> mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
> where I can found that?
>
> Thanks,
>
> JM
>
> 2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
>>> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>>>
>> TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
>> So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.
>>
>> Koji
>>
>>
>> On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
>>
>>> For the situation I faced I was really a disk space issue, not related
>>> to the number of files. It was writing on a small partition.
>>>
>>> I will try with local.cache.size or
>>> mapreduce.tasktracker.cache.local.size to see if I can keep the final
>>> total size under 5GB... Else, I will go for a customed script to
>>> delete all directories (and content) older than 2 or 3 days...
>>>
>>> Thanks,
>>>
>>> JM
>>>
>>> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>>>> Let me clarify , If there are lots of files or directories up to 32K (
>>>> Depending on the user's # of files sys os config) in those distributed cache
>>>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>>>> wont get initiated on those tasktracker machines. Hope this helps.
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>>>> <vi...@hortonworks.com> wrote:
>>>>>
>>>>>
>>>>> All the files are not opened at the same time ever, so you shouldn't see
>>>>> any "# of open files exceeds error".
>>>>>
>>>>> Thanks,
>>>>> +Vinod Kumar Vavilapalli
>>>>> Hortonworks Inc.
>>>>> http://hortonworks.com/
>>>>>
>>>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>>>>
>>>>> Hi JM ,
>>>>>
>>>>> Actually these dirs need to be purged by a script that keeps the last 2
>>>>> days worth of files, Otherwise you may run into # of open files exceeds
>>>>> error.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> Each time my MR job is run, a directory is created on the TaskTracker
>>>>>
>>>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>>>>
>>>>> configuration).
>>>>>
>>>>>
>>>>> I looked at the directory today, and it's hosting thousands of
>>>>>
>>>>> directories and more than 8GB of data there.
>>>>>
>>>>>
>>>>> Is there a way to automatically delete this directory when the job is
>>>>> done?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> JM
>>>>>
>>>>>
>>>>>
>>>>
>>



-- 
Harsh J

Re: Auto clean DistCache?

Posted by Harsh J <ha...@cloudera.com>.
The DistributedCache is cleaned automatically and no user intervention
(aside of size limitation changes, which may be an administrative
requirement) is generally required to delete the older distributed
cache files.

This is observable in code and is also noted in TDG, 2ed.:

Tom White:
"""
The tasktracker also maintains a reference count for the number of
tasks using each file in the cache. Before the task has run, the
file’s reference count is incremented by one; then after the task has
run, the count is decreased by one. Only when the count reaches zero
it is eligible for deletion, since no tasks are using it. Files are
deleted to make room for a new file when the cache exceeds a certain
size—10 GB by default. The cache size may be changed by setting the
configuration property local.cache.size, which is measured in bytes.
"""

And also, the maximum allowed dirs is also checked for automatically
today, to not violate the OS's limits.

On Wed, Mar 27, 2013 at 7:07 PM, Jean-Marc Spaggiari
<je...@spaggiari.org> wrote:
> Oh! good to know! It keep tracks even of month old entries??? There is no TTL?
>
> I was not able to find the documentation for  local.cache.size or
> mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
> where I can found that?
>
> Thanks,
>
> JM
>
> 2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
>>> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>>>
>> TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
>> So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.
>>
>> Koji
>>
>>
>> On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
>>
>>> For the situation I faced I was really a disk space issue, not related
>>> to the number of files. It was writing on a small partition.
>>>
>>> I will try with local.cache.size or
>>> mapreduce.tasktracker.cache.local.size to see if I can keep the final
>>> total size under 5GB... Else, I will go for a customed script to
>>> delete all directories (and content) older than 2 or 3 days...
>>>
>>> Thanks,
>>>
>>> JM
>>>
>>> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>>>> Let me clarify , If there are lots of files or directories up to 32K (
>>>> Depending on the user's # of files sys os config) in those distributed cache
>>>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>>>> wont get initiated on those tasktracker machines. Hope this helps.
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>>>> <vi...@hortonworks.com> wrote:
>>>>>
>>>>>
>>>>> All the files are not opened at the same time ever, so you shouldn't see
>>>>> any "# of open files exceeds error".
>>>>>
>>>>> Thanks,
>>>>> +Vinod Kumar Vavilapalli
>>>>> Hortonworks Inc.
>>>>> http://hortonworks.com/
>>>>>
>>>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>>>>
>>>>> Hi JM ,
>>>>>
>>>>> Actually these dirs need to be purged by a script that keeps the last 2
>>>>> days worth of files, Otherwise you may run into # of open files exceeds
>>>>> error.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> Each time my MR job is run, a directory is created on the TaskTracker
>>>>>
>>>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>>>>
>>>>> configuration).
>>>>>
>>>>>
>>>>> I looked at the directory today, and it's hosting thousands of
>>>>>
>>>>> directories and more than 8GB of data there.
>>>>>
>>>>>
>>>>> Is there a way to automatically delete this directory when the job is
>>>>> done?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> JM
>>>>>
>>>>>
>>>>>
>>>>
>>



-- 
Harsh J

Re: Auto clean DistCache?

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
I don't think it is documented in mapred-default.xml, where it should
ideally be. I could see it only in code. You can take a look at it here, if
you are interested: http://goo.gl/k5xsI

Thanks
Hemanth


On Wed, Mar 27, 2013 at 7:07 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Oh! good to know! It keep tracks even of month old entries??? There is no
> TTL?
>
> I was not able to find the documentation for  local.cache.size or
> mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
> where I can found that?
>
> Thanks,
>
> JM
>
> 2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
> >> Else, I will go for a customed script to delete all directories (and
> content) older than 2 or 3 days…
> >>
> > TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries
> in memory.
> > So if external process (like your script) start deleting dist cache
> files, there would be inconsistency and you'll start seeing task
> initialization failures due to no file found error.
> >
> > Koji
> >
> >
> > On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
> >
> >> For the situation I faced I was really a disk space issue, not related
> >> to the number of files. It was writing on a small partition.
> >>
> >> I will try with local.cache.size or
> >> mapreduce.tasktracker.cache.local.size to see if I can keep the final
> >> total size under 5GB... Else, I will go for a customed script to
> >> delete all directories (and content) older than 2 or 3 days...
> >>
> >> Thanks,
> >>
> >> JM
> >>
> >> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
> >>> Let me clarify , If there are lots of files or directories up to 32K (
> >>> Depending on the user's # of files sys os config) in those distributed
> cache
> >>> dirs, The OS will not be able to create any more files/dirs, Thus M-R
> jobs
> >>> wont get initiated on those tasktracker machines. Hope this helps.
> >>>
> >>>
> >>> Thanks
> >>>
> >>>
> >>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
> >>> <vi...@hortonworks.com> wrote:
> >>>>
> >>>>
> >>>> All the files are not opened at the same time ever, so you shouldn't
> see
> >>>> any "# of open files exceeds error".
> >>>>
> >>>> Thanks,
> >>>> +Vinod Kumar Vavilapalli
> >>>> Hortonworks Inc.
> >>>> http://hortonworks.com/
> >>>>
> >>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
> >>>>
> >>>> Hi JM ,
> >>>>
> >>>> Actually these dirs need to be purged by a script that keeps the last
> 2
> >>>> days worth of files, Otherwise you may run into # of open files
> exceeds
> >>>> error.
> >>>>
> >>>> Thanks
> >>>>
> >>>>
> >>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org>
> >>>> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>>
> >>>> Each time my MR job is run, a directory is created on the TaskTracker
> >>>>
> >>>> under mapred/local/taskTracker/hadoop/distcache (based on my
> >>>>
> >>>> configuration).
> >>>>
> >>>>
> >>>> I looked at the directory today, and it's hosting thousands of
> >>>>
> >>>> directories and more than 8GB of data there.
> >>>>
> >>>>
> >>>> Is there a way to automatically delete this directory when the job is
> >>>> done?
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>>
> >>>> JM
> >>>>
> >>>>
> >>>>
> >>>
> >
>

Re: Auto clean DistCache?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Oh! good to know! It keep tracks even of month old entries??? There is no TTL?

I was not able to find the documentation for  local.cache.size or
mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
where I can found that?

Thanks,

JM

2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
>> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>>
> TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
> So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.
>
> Koji
>
>
> On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
>
>> For the situation I faced I was really a disk space issue, not related
>> to the number of files. It was writing on a small partition.
>>
>> I will try with local.cache.size or
>> mapreduce.tasktracker.cache.local.size to see if I can keep the final
>> total size under 5GB... Else, I will go for a customed script to
>> delete all directories (and content) older than 2 or 3 days...
>>
>> Thanks,
>>
>> JM
>>
>> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>>> Let me clarify , If there are lots of files or directories up to 32K (
>>> Depending on the user's # of files sys os config) in those distributed cache
>>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>>> wont get initiated on those tasktracker machines. Hope this helps.
>>>
>>>
>>> Thanks
>>>
>>>
>>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>>> <vi...@hortonworks.com> wrote:
>>>>
>>>>
>>>> All the files are not opened at the same time ever, so you shouldn't see
>>>> any "# of open files exceeds error".
>>>>
>>>> Thanks,
>>>> +Vinod Kumar Vavilapalli
>>>> Hortonworks Inc.
>>>> http://hortonworks.com/
>>>>
>>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>>>
>>>> Hi JM ,
>>>>
>>>> Actually these dirs need to be purged by a script that keeps the last 2
>>>> days worth of files, Otherwise you may run into # of open files exceeds
>>>> error.
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>> Each time my MR job is run, a directory is created on the TaskTracker
>>>>
>>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>>>
>>>> configuration).
>>>>
>>>>
>>>> I looked at the directory today, and it's hosting thousands of
>>>>
>>>> directories and more than 8GB of data there.
>>>>
>>>>
>>>> Is there a way to automatically delete this directory when the job is
>>>> done?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> JM
>>>>
>>>>
>>>>
>>>
>

Re: Auto clean DistCache?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Oh! good to know! It keep tracks even of month old entries??? There is no TTL?

I was not able to find the documentation for  local.cache.size or
mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
where I can found that?

Thanks,

JM

2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
>> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>>
> TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
> So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.
>
> Koji
>
>
> On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
>
>> For the situation I faced I was really a disk space issue, not related
>> to the number of files. It was writing on a small partition.
>>
>> I will try with local.cache.size or
>> mapreduce.tasktracker.cache.local.size to see if I can keep the final
>> total size under 5GB... Else, I will go for a customed script to
>> delete all directories (and content) older than 2 or 3 days...
>>
>> Thanks,
>>
>> JM
>>
>> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>>> Let me clarify , If there are lots of files or directories up to 32K (
>>> Depending on the user's # of files sys os config) in those distributed cache
>>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>>> wont get initiated on those tasktracker machines. Hope this helps.
>>>
>>>
>>> Thanks
>>>
>>>
>>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>>> <vi...@hortonworks.com> wrote:
>>>>
>>>>
>>>> All the files are not opened at the same time ever, so you shouldn't see
>>>> any "# of open files exceeds error".
>>>>
>>>> Thanks,
>>>> +Vinod Kumar Vavilapalli
>>>> Hortonworks Inc.
>>>> http://hortonworks.com/
>>>>
>>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>>>
>>>> Hi JM ,
>>>>
>>>> Actually these dirs need to be purged by a script that keeps the last 2
>>>> days worth of files, Otherwise you may run into # of open files exceeds
>>>> error.
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>> Each time my MR job is run, a directory is created on the TaskTracker
>>>>
>>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>>>
>>>> configuration).
>>>>
>>>>
>>>> I looked at the directory today, and it's hosting thousands of
>>>>
>>>> directories and more than 8GB of data there.
>>>>
>>>>
>>>> Is there a way to automatically delete this directory when the job is
>>>> done?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> JM
>>>>
>>>>
>>>>
>>>
>

Re: Auto clean DistCache?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Oh! good to know! It keep tracks even of month old entries??? There is no TTL?

I was not able to find the documentation for  local.cache.size or
mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
where I can found that?

Thanks,

JM

2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
>> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>>
> TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
> So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.
>
> Koji
>
>
> On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
>
>> For the situation I faced I was really a disk space issue, not related
>> to the number of files. It was writing on a small partition.
>>
>> I will try with local.cache.size or
>> mapreduce.tasktracker.cache.local.size to see if I can keep the final
>> total size under 5GB... Else, I will go for a customed script to
>> delete all directories (and content) older than 2 or 3 days...
>>
>> Thanks,
>>
>> JM
>>
>> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>>> Let me clarify , If there are lots of files or directories up to 32K (
>>> Depending on the user's # of files sys os config) in those distributed cache
>>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>>> wont get initiated on those tasktracker machines. Hope this helps.
>>>
>>>
>>> Thanks
>>>
>>>
>>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>>> <vi...@hortonworks.com> wrote:
>>>>
>>>>
>>>> All the files are not opened at the same time ever, so you shouldn't see
>>>> any "# of open files exceeds error".
>>>>
>>>> Thanks,
>>>> +Vinod Kumar Vavilapalli
>>>> Hortonworks Inc.
>>>> http://hortonworks.com/
>>>>
>>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>>>
>>>> Hi JM ,
>>>>
>>>> Actually these dirs need to be purged by a script that keeps the last 2
>>>> days worth of files, Otherwise you may run into # of open files exceeds
>>>> error.
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>> Each time my MR job is run, a directory is created on the TaskTracker
>>>>
>>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>>>
>>>> configuration).
>>>>
>>>>
>>>> I looked at the directory today, and it's hosting thousands of
>>>>
>>>> directories and more than 8GB of data there.
>>>>
>>>>
>>>> Is there a way to automatically delete this directory when the job is
>>>> done?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> JM
>>>>
>>>>
>>>>
>>>
>

Re: Auto clean DistCache?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Oh! good to know! It keep tracks even of month old entries??? There is no TTL?

I was not able to find the documentation for  local.cache.size or
mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
where I can found that?

Thanks,

JM

2013/3/27 Koji Noguchi <kn...@yahoo-inc.com>:
>> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>>
> TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
> So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.
>
> Koji
>
>
> On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
>
>> For the situation I faced I was really a disk space issue, not related
>> to the number of files. It was writing on a small partition.
>>
>> I will try with local.cache.size or
>> mapreduce.tasktracker.cache.local.size to see if I can keep the final
>> total size under 5GB... Else, I will go for a customed script to
>> delete all directories (and content) older than 2 or 3 days...
>>
>> Thanks,
>>
>> JM
>>
>> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>>> Let me clarify , If there are lots of files or directories up to 32K (
>>> Depending on the user's # of files sys os config) in those distributed cache
>>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>>> wont get initiated on those tasktracker machines. Hope this helps.
>>>
>>>
>>> Thanks
>>>
>>>
>>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>>> <vi...@hortonworks.com> wrote:
>>>>
>>>>
>>>> All the files are not opened at the same time ever, so you shouldn't see
>>>> any "# of open files exceeds error".
>>>>
>>>> Thanks,
>>>> +Vinod Kumar Vavilapalli
>>>> Hortonworks Inc.
>>>> http://hortonworks.com/
>>>>
>>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>>>
>>>> Hi JM ,
>>>>
>>>> Actually these dirs need to be purged by a script that keeps the last 2
>>>> days worth of files, Otherwise you may run into # of open files exceeds
>>>> error.
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>> Each time my MR job is run, a directory is created on the TaskTracker
>>>>
>>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>>>
>>>> configuration).
>>>>
>>>>
>>>> I looked at the directory today, and it's hosting thousands of
>>>>
>>>> directories and more than 8GB of data there.
>>>>
>>>>
>>>> Is there a way to automatically delete this directory when the job is
>>>> done?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> JM
>>>>
>>>>
>>>>
>>>
>

Re: Auto clean DistCache?

Posted by Koji Noguchi <kn...@yahoo-inc.com>.
> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>
TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.

Koji


On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:

> For the situation I faced I was really a disk space issue, not related
> to the number of files. It was writing on a small partition.
> 
> I will try with local.cache.size or
> mapreduce.tasktracker.cache.local.size to see if I can keep the final
> total size under 5GB... Else, I will go for a customed script to
> delete all directories (and content) older than 2 or 3 days...
> 
> Thanks,
> 
> JM
> 
> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>> Let me clarify , If there are lots of files or directories up to 32K (
>> Depending on the user's # of files sys os config) in those distributed cache
>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>> wont get initiated on those tasktracker machines. Hope this helps.
>> 
>> 
>> Thanks
>> 
>> 
>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>> <vi...@hortonworks.com> wrote:
>>> 
>>> 
>>> All the files are not opened at the same time ever, so you shouldn't see
>>> any "# of open files exceeds error".
>>> 
>>> Thanks,
>>> +Vinod Kumar Vavilapalli
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>> 
>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>> 
>>> Hi JM ,
>>> 
>>> Actually these dirs need to be purged by a script that keeps the last 2
>>> days worth of files, Otherwise you may run into # of open files exceeds
>>> error.
>>> 
>>> Thanks
>>> 
>>> 
>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> 
>>> Each time my MR job is run, a directory is created on the TaskTracker
>>> 
>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>> 
>>> configuration).
>>> 
>>> 
>>> I looked at the directory today, and it's hosting thousands of
>>> 
>>> directories and more than 8GB of data there.
>>> 
>>> 
>>> Is there a way to automatically delete this directory when the job is
>>> done?
>>> 
>>> 
>>> Thanks,
>>> 
>>> 
>>> JM
>>> 
>>> 
>>> 
>> 


Re: Auto clean DistCache?

Posted by Koji Noguchi <kn...@yahoo-inc.com>.
> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>
TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.

Koji


On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:

> For the situation I faced I was really a disk space issue, not related
> to the number of files. It was writing on a small partition.
> 
> I will try with local.cache.size or
> mapreduce.tasktracker.cache.local.size to see if I can keep the final
> total size under 5GB... Else, I will go for a customed script to
> delete all directories (and content) older than 2 or 3 days...
> 
> Thanks,
> 
> JM
> 
> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>> Let me clarify , If there are lots of files or directories up to 32K (
>> Depending on the user's # of files sys os config) in those distributed cache
>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>> wont get initiated on those tasktracker machines. Hope this helps.
>> 
>> 
>> Thanks
>> 
>> 
>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>> <vi...@hortonworks.com> wrote:
>>> 
>>> 
>>> All the files are not opened at the same time ever, so you shouldn't see
>>> any "# of open files exceeds error".
>>> 
>>> Thanks,
>>> +Vinod Kumar Vavilapalli
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>> 
>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>> 
>>> Hi JM ,
>>> 
>>> Actually these dirs need to be purged by a script that keeps the last 2
>>> days worth of files, Otherwise you may run into # of open files exceeds
>>> error.
>>> 
>>> Thanks
>>> 
>>> 
>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> 
>>> Each time my MR job is run, a directory is created on the TaskTracker
>>> 
>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>> 
>>> configuration).
>>> 
>>> 
>>> I looked at the directory today, and it's hosting thousands of
>>> 
>>> directories and more than 8GB of data there.
>>> 
>>> 
>>> Is there a way to automatically delete this directory when the job is
>>> done?
>>> 
>>> 
>>> Thanks,
>>> 
>>> 
>>> JM
>>> 
>>> 
>>> 
>> 


Re: Auto clean DistCache?

Posted by Koji Noguchi <kn...@yahoo-inc.com>.
> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>
TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.

Koji


On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:

> For the situation I faced I was really a disk space issue, not related
> to the number of files. It was writing on a small partition.
> 
> I will try with local.cache.size or
> mapreduce.tasktracker.cache.local.size to see if I can keep the final
> total size under 5GB... Else, I will go for a customed script to
> delete all directories (and content) older than 2 or 3 days...
> 
> Thanks,
> 
> JM
> 
> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>> Let me clarify , If there are lots of files or directories up to 32K (
>> Depending on the user's # of files sys os config) in those distributed cache
>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>> wont get initiated on those tasktracker machines. Hope this helps.
>> 
>> 
>> Thanks
>> 
>> 
>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>> <vi...@hortonworks.com> wrote:
>>> 
>>> 
>>> All the files are not opened at the same time ever, so you shouldn't see
>>> any "# of open files exceeds error".
>>> 
>>> Thanks,
>>> +Vinod Kumar Vavilapalli
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>> 
>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>> 
>>> Hi JM ,
>>> 
>>> Actually these dirs need to be purged by a script that keeps the last 2
>>> days worth of files, Otherwise you may run into # of open files exceeds
>>> error.
>>> 
>>> Thanks
>>> 
>>> 
>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> 
>>> Each time my MR job is run, a directory is created on the TaskTracker
>>> 
>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>> 
>>> configuration).
>>> 
>>> 
>>> I looked at the directory today, and it's hosting thousands of
>>> 
>>> directories and more than 8GB of data there.
>>> 
>>> 
>>> Is there a way to automatically delete this directory when the job is
>>> done?
>>> 
>>> 
>>> Thanks,
>>> 
>>> 
>>> JM
>>> 
>>> 
>>> 
>> 


Re: Auto clean DistCache?

Posted by Koji Noguchi <kn...@yahoo-inc.com>.
> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>
TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.

Koji


On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:

> For the situation I faced I was really a disk space issue, not related
> to the number of files. It was writing on a small partition.
> 
> I will try with local.cache.size or
> mapreduce.tasktracker.cache.local.size to see if I can keep the final
> total size under 5GB... Else, I will go for a customed script to
> delete all directories (and content) older than 2 or 3 days...
> 
> Thanks,
> 
> JM
> 
> 2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
>> Let me clarify , If there are lots of files or directories up to 32K (
>> Depending on the user's # of files sys os config) in those distributed cache
>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>> wont get initiated on those tasktracker machines. Hope this helps.
>> 
>> 
>> Thanks
>> 
>> 
>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>> <vi...@hortonworks.com> wrote:
>>> 
>>> 
>>> All the files are not opened at the same time ever, so you shouldn't see
>>> any "# of open files exceeds error".
>>> 
>>> Thanks,
>>> +Vinod Kumar Vavilapalli
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>> 
>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>> 
>>> Hi JM ,
>>> 
>>> Actually these dirs need to be purged by a script that keeps the last 2
>>> days worth of files, Otherwise you may run into # of open files exceeds
>>> error.
>>> 
>>> Thanks
>>> 
>>> 
>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> 
>>> Each time my MR job is run, a directory is created on the TaskTracker
>>> 
>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>> 
>>> configuration).
>>> 
>>> 
>>> I looked at the directory today, and it's hosting thousands of
>>> 
>>> directories and more than 8GB of data there.
>>> 
>>> 
>>> Is there a way to automatically delete this directory when the job is
>>> done?
>>> 
>>> 
>>> Thanks,
>>> 
>>> 
>>> JM
>>> 
>>> 
>>> 
>> 


Re: Auto clean DistCache?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
For the situation I faced I was really a disk space issue, not related
to the number of files. It was writing on a small partition.

I will try with local.cache.size or
mapreduce.tasktracker.cache.local.size to see if I can keep the final
total size under 5GB... Else, I will go for a customed script to
delete all directories (and content) older than 2 or 3 days...

Thanks,

JM

2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
> Let me clarify , If there are lots of files or directories up to 32K (
> Depending on the user's # of files sys os config) in those distributed cache
> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
> wont get initiated on those tasktracker machines. Hope this helps.
>
>
> Thanks
>
>
> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
> <vi...@hortonworks.com> wrote:
>>
>>
>> All the files are not opened at the same time ever, so you shouldn't see
>> any "# of open files exceeds error".
>>
>> Thanks,
>> +Vinod Kumar Vavilapalli
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>
>> Hi JM ,
>>
>> Actually these dirs need to be purged by a script that keeps the last 2
>> days worth of files, Otherwise you may run into # of open files exceeds
>> error.
>>
>> Thanks
>>
>>
>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>> wrote:
>>
>> Hi,
>>
>>
>> Each time my MR job is run, a directory is created on the TaskTracker
>>
>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>
>> configuration).
>>
>>
>> I looked at the directory today, and it's hosting thousands of
>>
>> directories and more than 8GB of data there.
>>
>>
>> Is there a way to automatically delete this directory when the job is
>> done?
>>
>>
>> Thanks,
>>
>>
>> JM
>>
>>
>>
>

Re: Auto clean DistCache?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
For the situation I faced I was really a disk space issue, not related
to the number of files. It was writing on a small partition.

I will try with local.cache.size or
mapreduce.tasktracker.cache.local.size to see if I can keep the final
total size under 5GB... Else, I will go for a customed script to
delete all directories (and content) older than 2 or 3 days...

Thanks,

JM

2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
> Let me clarify , If there are lots of files or directories up to 32K (
> Depending on the user's # of files sys os config) in those distributed cache
> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
> wont get initiated on those tasktracker machines. Hope this helps.
>
>
> Thanks
>
>
> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
> <vi...@hortonworks.com> wrote:
>>
>>
>> All the files are not opened at the same time ever, so you shouldn't see
>> any "# of open files exceeds error".
>>
>> Thanks,
>> +Vinod Kumar Vavilapalli
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>
>> Hi JM ,
>>
>> Actually these dirs need to be purged by a script that keeps the last 2
>> days worth of files, Otherwise you may run into # of open files exceeds
>> error.
>>
>> Thanks
>>
>>
>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>> wrote:
>>
>> Hi,
>>
>>
>> Each time my MR job is run, a directory is created on the TaskTracker
>>
>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>
>> configuration).
>>
>>
>> I looked at the directory today, and it's hosting thousands of
>>
>> directories and more than 8GB of data there.
>>
>>
>> Is there a way to automatically delete this directory when the job is
>> done?
>>
>>
>> Thanks,
>>
>>
>> JM
>>
>>
>>
>

Re: Auto clean DistCache?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
For the situation I faced I was really a disk space issue, not related
to the number of files. It was writing on a small partition.

I will try with local.cache.size or
mapreduce.tasktracker.cache.local.size to see if I can keep the final
total size under 5GB... Else, I will go for a customed script to
delete all directories (and content) older than 2 or 3 days...

Thanks,

JM

2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
> Let me clarify , If there are lots of files or directories up to 32K (
> Depending on the user's # of files sys os config) in those distributed cache
> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
> wont get initiated on those tasktracker machines. Hope this helps.
>
>
> Thanks
>
>
> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
> <vi...@hortonworks.com> wrote:
>>
>>
>> All the files are not opened at the same time ever, so you shouldn't see
>> any "# of open files exceeds error".
>>
>> Thanks,
>> +Vinod Kumar Vavilapalli
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>
>> Hi JM ,
>>
>> Actually these dirs need to be purged by a script that keeps the last 2
>> days worth of files, Otherwise you may run into # of open files exceeds
>> error.
>>
>> Thanks
>>
>>
>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>> wrote:
>>
>> Hi,
>>
>>
>> Each time my MR job is run, a directory is created on the TaskTracker
>>
>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>
>> configuration).
>>
>>
>> I looked at the directory today, and it's hosting thousands of
>>
>> directories and more than 8GB of data there.
>>
>>
>> Is there a way to automatically delete this directory when the job is
>> done?
>>
>>
>> Thanks,
>>
>>
>> JM
>>
>>
>>
>

Re: Auto clean DistCache?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
For the situation I faced I was really a disk space issue, not related
to the number of files. It was writing on a small partition.

I will try with local.cache.size or
mapreduce.tasktracker.cache.local.size to see if I can keep the final
total size under 5GB... Else, I will go for a customed script to
delete all directories (and content) older than 2 or 3 days...

Thanks,

JM

2013/3/26 Abdelrahman Shettia <as...@hortonworks.com>:
> Let me clarify , If there are lots of files or directories up to 32K (
> Depending on the user's # of files sys os config) in those distributed cache
> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
> wont get initiated on those tasktracker machines. Hope this helps.
>
>
> Thanks
>
>
> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
> <vi...@hortonworks.com> wrote:
>>
>>
>> All the files are not opened at the same time ever, so you shouldn't see
>> any "# of open files exceeds error".
>>
>> Thanks,
>> +Vinod Kumar Vavilapalli
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>
>> Hi JM ,
>>
>> Actually these dirs need to be purged by a script that keeps the last 2
>> days worth of files, Otherwise you may run into # of open files exceeds
>> error.
>>
>> Thanks
>>
>>
>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
>> wrote:
>>
>> Hi,
>>
>>
>> Each time my MR job is run, a directory is created on the TaskTracker
>>
>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>
>> configuration).
>>
>>
>> I looked at the directory today, and it's hosting thousands of
>>
>> directories and more than 8GB of data there.
>>
>>
>> Is there a way to automatically delete this directory when the job is
>> done?
>>
>>
>> Thanks,
>>
>>
>> JM
>>
>>
>>
>

Re: Auto clean DistCache?

Posted by Abdelrahman Shettia <as...@hortonworks.com>.
Let me clarify , If there are lots of files or directories up to 32K (
Depending on the user's # of files sys os config) in
those distributed cache dirs, The OS will not be able to create any more
files/dirs, Thus M-R jobs wont get initiated on those tasktracker machines.
Hope this helps.


Thanks


On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> All the files are not opened at the same time ever, so you shouldn't see
> any "# of open files exceeds error".
>
> Thanks,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>
> Hi JM ,
>
> Actually these dirs need to be purged by a script that keeps the last 2
> days worth of files, Otherwise you may run into # of open files exceeds
> error.
>
> Thanks
>
>
> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
> wrote:
>
> Hi,
>
>
> Each time my MR job is run, a directory is created on the TaskTracker
>
> under mapred/local/taskTracker/hadoop/distcache (based on my
>
> configuration).
>
>
> I looked at the directory today, and it's hosting thousands of
>
> directories and more than 8GB of data there.
>
>
> Is there a way to automatically delete this directory when the job is done?
>
>
> Thanks,
>
>
> JM
>
>
>
>

Re: Auto clean DistCache?

Posted by Abdelrahman Shettia <as...@hortonworks.com>.
Let me clarify , If there are lots of files or directories up to 32K (
Depending on the user's # of files sys os config) in
those distributed cache dirs, The OS will not be able to create any more
files/dirs, Thus M-R jobs wont get initiated on those tasktracker machines.
Hope this helps.


Thanks


On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> All the files are not opened at the same time ever, so you shouldn't see
> any "# of open files exceeds error".
>
> Thanks,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>
> Hi JM ,
>
> Actually these dirs need to be purged by a script that keeps the last 2
> days worth of files, Otherwise you may run into # of open files exceeds
> error.
>
> Thanks
>
>
> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
> wrote:
>
> Hi,
>
>
> Each time my MR job is run, a directory is created on the TaskTracker
>
> under mapred/local/taskTracker/hadoop/distcache (based on my
>
> configuration).
>
>
> I looked at the directory today, and it's hosting thousands of
>
> directories and more than 8GB of data there.
>
>
> Is there a way to automatically delete this directory when the job is done?
>
>
> Thanks,
>
>
> JM
>
>
>
>

Re: Auto clean DistCache?

Posted by Abdelrahman Shettia <as...@hortonworks.com>.
Let me clarify , If there are lots of files or directories up to 32K (
Depending on the user's # of files sys os config) in
those distributed cache dirs, The OS will not be able to create any more
files/dirs, Thus M-R jobs wont get initiated on those tasktracker machines.
Hope this helps.


Thanks


On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> All the files are not opened at the same time ever, so you shouldn't see
> any "# of open files exceeds error".
>
> Thanks,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>
> Hi JM ,
>
> Actually these dirs need to be purged by a script that keeps the last 2
> days worth of files, Otherwise you may run into # of open files exceeds
> error.
>
> Thanks
>
>
> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
> wrote:
>
> Hi,
>
>
> Each time my MR job is run, a directory is created on the TaskTracker
>
> under mapred/local/taskTracker/hadoop/distcache (based on my
>
> configuration).
>
>
> I looked at the directory today, and it's hosting thousands of
>
> directories and more than 8GB of data there.
>
>
> Is there a way to automatically delete this directory when the job is done?
>
>
> Thanks,
>
>
> JM
>
>
>
>

Re: Auto clean DistCache?

Posted by Abdelrahman Shettia <as...@hortonworks.com>.
Let me clarify , If there are lots of files or directories up to 32K (
Depending on the user's # of files sys os config) in
those distributed cache dirs, The OS will not be able to create any more
files/dirs, Thus M-R jobs wont get initiated on those tasktracker machines.
Hope this helps.


Thanks


On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> All the files are not opened at the same time ever, so you shouldn't see
> any "# of open files exceeds error".
>
> Thanks,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>
> Hi JM ,
>
> Actually these dirs need to be purged by a script that keeps the last 2
> days worth of files, Otherwise you may run into # of open files exceeds
> error.
>
> Thanks
>
>
> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org>
> wrote:
>
> Hi,
>
>
> Each time my MR job is run, a directory is created on the TaskTracker
>
> under mapred/local/taskTracker/hadoop/distcache (based on my
>
> configuration).
>
>
> I looked at the directory today, and it's hosting thousands of
>
> directories and more than 8GB of data there.
>
>
> Is there a way to automatically delete this directory when the job is done?
>
>
> Thanks,
>
>
> JM
>
>
>
>

Re: Auto clean DistCache?

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
All the files are not opened at the same time ever, so you shouldn't see any "# of open files exceeds error".

Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:

> Hi JM ,
> 
> Actually these dirs need to be purged by a script that keeps the last 2 days worth of files, Otherwise you may run into # of open files exceeds error. 
> 
> Thanks
> 
> 
> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
> 
>> Hi,
>> 
>> Each time my MR job is run, a directory is created on the TaskTracker
>> under mapred/local/taskTracker/hadoop/distcache (based on my
>> configuration).
>> 
>> I looked at the directory today, and it's hosting thousands of
>> directories and more than 8GB of data there.
>> 
>> Is there a way to automatically delete this directory when the job is done?
>> 
>> Thanks,
>> 
>> JM
> 


Re: Auto clean DistCache?

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
All the files are not opened at the same time ever, so you shouldn't see any "# of open files exceeds error".

Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:

> Hi JM ,
> 
> Actually these dirs need to be purged by a script that keeps the last 2 days worth of files, Otherwise you may run into # of open files exceeds error. 
> 
> Thanks
> 
> 
> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
> 
>> Hi,
>> 
>> Each time my MR job is run, a directory is created on the TaskTracker
>> under mapred/local/taskTracker/hadoop/distcache (based on my
>> configuration).
>> 
>> I looked at the directory today, and it's hosting thousands of
>> directories and more than 8GB of data there.
>> 
>> Is there a way to automatically delete this directory when the job is done?
>> 
>> Thanks,
>> 
>> JM
> 


Re: Auto clean DistCache?

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
All the files are not opened at the same time ever, so you shouldn't see any "# of open files exceeds error".

Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:

> Hi JM ,
> 
> Actually these dirs need to be purged by a script that keeps the last 2 days worth of files, Otherwise you may run into # of open files exceeds error. 
> 
> Thanks
> 
> 
> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
> 
>> Hi,
>> 
>> Each time my MR job is run, a directory is created on the TaskTracker
>> under mapred/local/taskTracker/hadoop/distcache (based on my
>> configuration).
>> 
>> I looked at the directory today, and it's hosting thousands of
>> directories and more than 8GB of data there.
>> 
>> Is there a way to automatically delete this directory when the job is done?
>> 
>> Thanks,
>> 
>> JM
> 


Re: Auto clean DistCache?

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
All the files are not opened at the same time ever, so you shouldn't see any "# of open files exceeds error".

Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:

> Hi JM ,
> 
> Actually these dirs need to be purged by a script that keeps the last 2 days worth of files, Otherwise you may run into # of open files exceeds error. 
> 
> Thanks
> 
> 
> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
> 
>> Hi,
>> 
>> Each time my MR job is run, a directory is created on the TaskTracker
>> under mapred/local/taskTracker/hadoop/distcache (based on my
>> configuration).
>> 
>> I looked at the directory today, and it's hosting thousands of
>> directories and more than 8GB of data there.
>> 
>> Is there a way to automatically delete this directory when the job is done?
>> 
>> Thanks,
>> 
>> JM
> 


Re: Auto clean DistCache?

Posted by Abdelrhman Shettia <as...@hortonworks.com>.
Hi JM ,

Actually these dirs need to be purged by a script that keeps the last 2 days worth of files, Otherwise you may run into # of open files exceeds error. 

Thanks


On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:

> Hi,
> 
> Each time my MR job is run, a directory is created on the TaskTracker
> under mapred/local/taskTracker/hadoop/distcache (based on my
> configuration).
> 
> I looked at the directory today, and it's hosting thousands of
> directories and more than 8GB of data there.
> 
> Is there a way to automatically delete this directory when the job is done?
> 
> Thanks,
> 
> JM


Re: Auto clean DistCache?

Posted by Abdelrhman Shettia <as...@hortonworks.com>.
Hi JM ,

Actually these dirs need to be purged by a script that keeps the last 2 days worth of files, Otherwise you may run into # of open files exceeds error. 

Thanks


On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:

> Hi,
> 
> Each time my MR job is run, a directory is created on the TaskTracker
> under mapred/local/taskTracker/hadoop/distcache (based on my
> configuration).
> 
> I looked at the directory today, and it's hosting thousands of
> directories and more than 8GB of data there.
> 
> Is there a way to automatically delete this directory when the job is done?
> 
> Thanks,
> 
> JM


Re: Auto clean DistCache?

Posted by Abdelrhman Shettia <as...@hortonworks.com>.
Hi JM ,

Actually these dirs need to be purged by a script that keeps the last 2 days worth of files, Otherwise you may run into # of open files exceeds error. 

Thanks


On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:

> Hi,
> 
> Each time my MR job is run, a directory is created on the TaskTracker
> under mapred/local/taskTracker/hadoop/distcache (based on my
> configuration).
> 
> I looked at the directory today, and it's hosting thousands of
> directories and more than 8GB of data there.
> 
> Is there a way to automatically delete this directory when the job is done?
> 
> Thanks,
> 
> JM


Re: Auto clean DistCache?

Posted by Abdelrhman Shettia <as...@hortonworks.com>.
Hi JM ,

Actually these dirs need to be purged by a script that keeps the last 2 days worth of files, Otherwise you may run into # of open files exceeds error. 

Thanks


On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:

> Hi,
> 
> Each time my MR job is run, a directory is created on the TaskTracker
> under mapred/local/taskTracker/hadoop/distcache (based on my
> configuration).
> 
> I looked at the directory today, and it's hosting thousands of
> directories and more than 8GB of data there.
> 
> Is there a way to automatically delete this directory when the job is done?
> 
> Thanks,
> 
> JM


Re: Auto clean DistCache?

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
You can control the limit of these cache files, the default is 10GB (value of 10737418240L): Try changing local.cache.size or mapreduce.tasktracker.cache.local.size in mapred-site.xml

Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari wrote:

> Hi,
> 
> Each time my MR job is run, a directory is created on the TaskTracker
> under mapred/local/taskTracker/hadoop/distcache (based on my
> configuration).
> 
> I looked at the directory today, and it's hosting thousands of
> directories and more than 8GB of data there.
> 
> Is there a way to automatically delete this directory when the job is done?
> 
> Thanks,
> 
> JM


Re: Auto clean DistCache?

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
You can control the limit of these cache files, the default is 10GB (value of 10737418240L): Try changing local.cache.size or mapreduce.tasktracker.cache.local.size in mapred-site.xml

Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari wrote:

> Hi,
> 
> Each time my MR job is run, a directory is created on the TaskTracker
> under mapred/local/taskTracker/hadoop/distcache (based on my
> configuration).
> 
> I looked at the directory today, and it's hosting thousands of
> directories and more than 8GB of data there.
> 
> Is there a way to automatically delete this directory when the job is done?
> 
> Thanks,
> 
> JM


Re: Auto clean DistCache?

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
You can control the limit of these cache files, the default is 10GB (value of 10737418240L): Try changing local.cache.size or mapreduce.tasktracker.cache.local.size in mapred-site.xml

Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari wrote:

> Hi,
> 
> Each time my MR job is run, a directory is created on the TaskTracker
> under mapred/local/taskTracker/hadoop/distcache (based on my
> configuration).
> 
> I looked at the directory today, and it's hosting thousands of
> directories and more than 8GB of data there.
> 
> Is there a way to automatically delete this directory when the job is done?
> 
> Thanks,
> 
> JM