You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Andy Doddington <an...@doddington.net> on 2011/11/25 10:05:07 UTC

Passing data files via the distributed cache

I have a series of mappers that I would like to be passed data using the distributed cache mechanism. At the
moment, I am using HDFS to pass the data, but this seems wasteful to me, since they are all reading the same data.

Is there a piece of example code that shows how data files can be placed in the cache and accessed by mappers?

Thanks,

	Andy Doddington


Re: Passing data files via the distributed cache

Posted by Robert Evans <ev...@yahoo-inc.com>.
There is currently no way to delete the data from the cache when you are done.  It is garbage collected when the cache starts to fill up (in LRU order if you are on a newer release).  The DistributedCache.addCacheFile is modifying the JobConf behind the scenes for you.  If you want to dig into the details of what it is doing you can look at the source code for it.

--Bobby Evans

On 11/28/11 4:46 AM, "Andy Doddington" <an...@doddington.net> wrote:

Thanks for that link Prashant - very useful.

Two brief follow-up questions:

1) Having put data in the cache, I would like to be a good citizen by deleting the data from the cache once
    I've finished - how do I do that?
2) Would it be simpler to pass the data as a value in the jobConf object?

Thanks,

        Andy D.

On 25 Nov 2011, at 12:14, Prashant Kommireddi wrote:

> I believe you want to ship data to each node in your cluster before MR
> begins so the mappers can access files local to their machine. Hadoop
> tutorial on YDN has some good info on this.
>
> http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata
>
> -Prashant Kommireddi
>
> On Fri, Nov 25, 2011 at 1:05 AM, Andy Doddington <an...@doddington.net>wrote:
>
>> I have a series of mappers that I would like to be passed data using the
>> distributed cache mechanism. At the
>> moment, I am using HDFS to pass the data, but this seems wasteful to me,
>> since they are all reading the same data.
>>
>> Is there a piece of example code that shows how data files can be placed
>> in the cache and accessed by mappers?
>>
>> Thanks,
>>
>>       Andy Doddington
>>
>>



Re: Passing data files via the distributed cache

Posted by Andy Doddington <an...@doddington.net>.
Thanks for that link Prashant - very useful.

Two brief follow-up questions:

1) Having put data in the cache, I would like to be a good citizen by deleting the data from the cache once
    I’ve finished - how do I do that?
2) Would it be simpler to pass the data as a value in the jobConf object?

Thanks,

	Andy D.

On 25 Nov 2011, at 12:14, Prashant Kommireddi wrote:

> I believe you want to ship data to each node in your cluster before MR
> begins so the mappers can access files local to their machine. Hadoop
> tutorial on YDN has some good info on this.
> 
> http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata
> 
> -Prashant Kommireddi
> 
> On Fri, Nov 25, 2011 at 1:05 AM, Andy Doddington <an...@doddington.net>wrote:
> 
>> I have a series of mappers that I would like to be passed data using the
>> distributed cache mechanism. At the
>> moment, I am using HDFS to pass the data, but this seems wasteful to me,
>> since they are all reading the same data.
>> 
>> Is there a piece of example code that shows how data files can be placed
>> in the cache and accessed by mappers?
>> 
>> Thanks,
>> 
>>       Andy Doddington
>> 
>> 


Re: Passing data files via the distributed cache

Posted by Prashant Kommireddi <pr...@gmail.com>.
I believe you want to ship data to each node in your cluster before MR
begins so the mappers can access files local to their machine. Hadoop
tutorial on YDN has some good info on this.

http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata

-Prashant Kommireddi

On Fri, Nov 25, 2011 at 1:05 AM, Andy Doddington <an...@doddington.net>wrote:

> I have a series of mappers that I would like to be passed data using the
> distributed cache mechanism. At the
> moment, I am using HDFS to pass the data, but this seems wasteful to me,
> since they are all reading the same data.
>
> Is there a piece of example code that shows how data files can be placed
> in the cache and accessed by mappers?
>
> Thanks,
>
>        Andy Doddington
>
>