You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by akhil1988 <ak...@gmail.com> on 2009/06/25 19:32:21 UTC

Using addCacheArchive

Hi All!

I want a directory to be present in the local working directory of the task
for which I am using the following statements: 

DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
conf);
DistributedCache.createSymlink(conf);

>> Here Config is a directory which I have zipped and put at the given
>> location in HDFS

I have zipped the directory because the API doc of DistributedCache
(http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the
archive files are unzipped in the local cache directory :

DistributedCache can be used to distribute simple, read-only data/text files
and/or more complex types such as archives, jars etc. Archives (zip, tar and
tgz/tar.gz files) are un-archived at the slave nodes.

So, from my understanding of the API docs I expect that the Config.zip file
will be unzipped to Config directory and since I have SymLinked them I can
access the directory in the following manner from my map function:

FileInputStream fin = new FileInputStream("Config/file1.config");

But I get the FileNotFoundException on the execution of this statement.
Please let me know where I am going wrong.

Thanks,
Akhil
-- 
View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24207739.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Using addCacheArchive

Posted by Uri Shani <SH...@il.ibm.com>.

I followed this thread and happy it finally worked for you. Can you 
summerise for our benefit what was the final working alteration of the 
code in your initial thread note?
Thanks!!



From:
akhil1988 <ak...@gmail.com>
To:
core-user@hadoop.apache.org
Date:
03/07/2009 06:59 AM
Subject:
Re: Using addCacheArchive




Thanks Tim!

It works now. You are right, that I should ask the cache where they are 
and
use that path. Actually, this was the mistake I was doing. I assumed that
Config.zip will get unzipped to Config directory but actually Hadoop 
unzips
it to Config.zip only, that is it does not change it's name.

Thanks,
Akhil



Ted Dunning wrote:
> 
> This code assumes that the files are in the working directory of the
> mapper.
> 
> You should ask the cache where they are instead nad use the the paths 
that
> it gives you.
> 
> See the code on this page:
> 
>       http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata
> 
> On Fri, Jun 26, 2009 at 5:55 PM, akhil1988 <ak...@gmail.com> wrote:
> 
>> FileInputStream fin = new FileInputStream("Config/file1.config");
>>
>> where,
>> Config is a directory which contains many files/directories, one of 
which
>> is
>> file1.config
>>
>> It would be helpful to me if you can tell me what statements to use to
>> distribute a directory to the tasktrackers.
>> The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html
>> says
>> that archives are unzipped on the tasktrackers but I want an example of
>> how
>> to use this in case of a dreictory.
> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve
> 
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Using-addCacheArchive-tp24207739p24317261.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Using addCacheArchive

Posted by akhil1988 <ak...@gmail.com>.

Thanks Tim!

It works now. You are right, that I should ask the cache where they are and
use that path. Actually, this was the mistake I was doing. I assumed that
Config.zip will get unzipped to Config directory but actually Hadoop unzips
it to Config.zip only, that is it does not change it's name.

Thanks,
Akhil



Ted Dunning wrote:
> 
> This code assumes that the files are in the working directory of the
> mapper.
> 
> You should ask the cache where they are instead nad use the the paths that
> it gives you.
> 
> See the code on this page:
> 
>       http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata
> 
> On Fri, Jun 26, 2009 at 5:55 PM, akhil1988 <ak...@gmail.com> wrote:
> 
>> FileInputStream fin = new FileInputStream("Config/file1.config");
>>
>> where,
>> Config is a directory which contains many files/directories, one of which
>> is
>> file1.config
>>
>> It would be helpful to me if you can tell me what statements to use to
>> distribute a directory to the tasktrackers.
>> The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html
>> says
>> that archives are unzipped on the tasktrackers but I want an example of
>> how
>> to use this in case of a dreictory.
> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve
> 
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)
> 
> 

-- 
View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24317261.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Using addCacheArchive

Posted by Ted Dunning <te...@gmail.com>.

This code assumes that the files are in the working directory of the mapper.

You should ask the cache where they are instead nad use the the paths that
it gives you.

See the code on this page:

      http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata

On Fri, Jun 26, 2009 at 5:55 PM, akhil1988 <ak...@gmail.com> wrote:

> FileInputStream fin = new FileInputStream("Config/file1.config");
>
> where,
> Config is a directory which contains many files/directories, one of which
> is
> file1.config
>
> It would be helpful to me if you can tell me what statements to use to
> distribute a directory to the tasktrackers.
> The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html says
> that archives are unzipped on the tasktrackers but I want an example of how
> to use this in case of a dreictory.




-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: Using addCacheArchive

Posted by Ted Dunning <te...@gmail.com>.

Change FileReader to InputStreamReader and pass din to it and you should be
set.

BufferedReader wordReader = new BufferedReader(new InputStreamReader(din));


On Wed, Jul 1, 2009 at 9:23 PM, akhil1988 <ak...@gmail.com> wrote:

> FileSystem fs = FileSystem.get(conf);
> FSDataInputStream din = fs.open("/home/akhil1988/sample.txt");
>
> The method (below)that you gave does not work:
> Path cachePath= new Path("hdfs:///home/akhil1988/sample.txt");
> BufferedReader wordReader = new BufferedReader(new
> FileReader(cachePath.toString()));
>

Re: Using addCacheArchive

Posted by Chris Curtin <cu...@gmail.com>.

The code below works fine in my 10 node EC2 cluster with the file being
'shared' created dynamically by a previous map/reduce job in the same flow.

I also programitically push the file, not using the hadoop fs command, so I
don't know if file paths have anything to do with it.Maybe try it without
the 'hdfs:' prefix? I don't have that anywhere.

Chris

On Thu, Jul 2, 2009 at 12:23 AM, akhil1988 <ak...@gmail.com> wrote:

>
> Hi Chris!
>
> Sorry for the late reply!
>
> To push the file into HDFS is clear to me and it can be done using "hadoop
> fs -put" command also (prior to executing the job), which I generally use.
>
> The method to access a file in HDFS from Mapper/Reducer is the following:
> FileSystem fs = FileSystem.get(conf);
> FSDataInputStream din = fs.open("/home/akhil1988/sample.txt");
>
> The method (below)that you gave does not work:
> Path cachePath= new Path("hdfs:///home/akhil1988/sample.txt");
> BufferedReader wordReader = new BufferedReader(new
> FileReader(cachePath.toString()));
>
> A file in HDFS cannot be accessed through these standard Java function, it
> has to be accessed only via the method I have mentioned above. The API
> methods for FileSystem class are very limited and it only provides us to
> read a data file (containing java primitives) and not any binary files.
>
> In my specific problem, I am using a API (specific to my research-domain)
> which takes a path (String) as input and reads data from this path (which
> points to a binary file). So I just need a way in which I can access files
> (from tasktrackers) as we do via standard java functions. For this, we need
> the files to be present in the local filesystem of the tasktrackers. That
> is
> why I am using DistributedCache.
>
> I hope I am clear?? And if I am wrong anywhere, please let me know.
>
> Thanks,
> Akhil
>
>
>
>
>
> The API provides only this function to read a data file(containing java
> primitives), we cannot read any binary files.
>
>
>
>
> Well, what I wanted was to have a directory in the local filesystem of the
> tasktracker and not the HDFS because of the following reason:
>
>
>
>
> Chris Curtin-2 wrote:
> >
> > To push the file to HDFS (put it in the 'a_hdfsDirectory' directory)
> >
> > Configuration config = new Configuration();
> > FileSystem hdfs = FileSystem.get(config);
> > Path srcPath = new Path(a_directory + "/" + outputName);
> > Path dstPath = new Path(a_hdfsDirectory + "/" + outputName);
> > hdfs.copyFromLocalFile(srcPath, dstPath);
> >
> >
> > to read it from HDFS in your mapper or reducer:
> >
> > Configuration config = new Configuration();
> > FileSystem hdfs = FileSystem.get(config);
> > Path cachePath= new Path(a_hdfsDirectory + "/" + outputName);
> > BufferedReader wordReader = new BufferedReader(
> >         new FileReader(cachePath.toString()));
> >
> >
> >
> > On Fri, Jun 26, 2009 at 8:55 PM, akhil1988 <ak...@gmail.com> wrote:
> >
> >>
> >> Thanks Chris for your reply!
> >>
> >> Well, I could not understand much of what has been discussed on that
> >> forum.
> >> I am unaware of Cascading.
> >>
> >> My problem is simple - I want a directory to present in the local
> working
> >> directory of tasks so that I can access it from my map task in the
> >> following
> >> manner :
> >>
> >> FileInputStream fin = new FileInputStream("Config/file1.config");
> >>
> >> where,
> >> Config is a directory which contains many files/directories, one of
> which
> >> is
> >> file1.config
> >>
> >> It would be helpful to me if you can tell me what statements to use to
> >> distribute a directory to the tasktrackers.
> >> The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html
> >> says
> >> that archives are unzipped on the tasktrackers but I want an example of
> >> how
> >> to use this in case of a dreictory.
> >>
> >> Thanks,
> >> Akhil
> >>
> >>
> >>
> >> Chris Curtin-2 wrote:
> >> >
> >> > Hi,
> >> >
> >> > I've found it much easier to write the file to HDFS use the API, then
> >> pass
> >> > the 'path' to the file in HDFS as a property. You'll need to remember
> >> to
> >> > clean up the file after you're done with it.
> >> >
> >> > Example details are in this thread:
> >> >
> >>
> http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#
> >> >
> >> > Hope this helps,
> >> >
> >> > Chris
> >> >
> >> > On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 <ak...@gmail.com>
> >> wrote:
> >> >
> >> >>
> >> >> Please ask any questions if I am not clear above about the problem I
> >> am
> >> >> facing.
> >> >>
> >> >> Thanks,
> >> >> Akhil
> >> >>
> >> >> akhil1988 wrote:
> >> >> >
> >> >> > Hi All!
> >> >> >
> >> >> > I want a directory to be present in the local working directory of
> >> the
> >> >> > task for which I am using the following statements:
> >> >> >
> >> >> > DistributedCache.addCacheArchive(new
> >> URI("/home/akhil1988/Config.zip"),
> >> >> > conf);
> >> >> > DistributedCache.createSymlink(conf);
> >> >> >
> >> >> >>> Here Config is a directory which I have zipped and put at the
> >> given
> >> >> >>> location in HDFS
> >> >> >
> >> >> > I have zipped the directory because the API doc of DistributedCache
> >> >> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says
> >> that
> >> >> the
> >> >> > archive files are unzipped in the local cache directory :
> >> >> >
> >> >> > DistributedCache can be used to distribute simple, read-only
> >> data/text
> >> >> > files and/or more complex types such as archives, jars etc.
> Archives
> >> >> (zip,
> >> >> > tar and tgz/tar.gz files) are un-archived at the slave nodes.
> >> >> >
> >> >> > So, from my understanding of the API docs I expect that the
> >> Config.zip
> >> >> > file will be unzipped to Config directory and since I have
> SymLinked
> >> >> them
> >> >> > I can access the directory in the following manner from my map
> >> >> function:
> >> >> >
> >> >> > FileInputStream fin = new FileInputStream("Config/file1.config");
> >> >> >
> >> >> > But I get the FileNotFoundException on the execution of this
> >> statement.
> >> >> > Please let me know where I am going wrong.
> >> >> >
> >> >> > Thanks,
> >> >> > Akhil
> >> >> >
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
> >> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >> >>
> >> >>
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Using-addCacheArchive-tp24207739p24300915.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: Using addCacheArchive

Posted by akhil1988 <ak...@gmail.com>.

Please ignore the last two lines (after Thanks)!

Akhil


akhil1988 wrote:
> 
> Hi Chris!
> 
> Sorry for the late reply!
> 
> To push the file into HDFS is clear to me and it can be done using "hadoop
> fs -put" command also (prior to executing the job), which I generally use.
> 
> The method to access a file in HDFS from Mapper/Reducer is the following:
> FileSystem fs = FileSystem.get(conf);
> FSDataInputStream din = fs.open("/home/akhil1988/sample.txt");
> 
> The method (below)that you gave does not work:
> Path cachePath= new Path("hdfs:///home/akhil1988/sample.txt");
> BufferedReader wordReader = new BufferedReader(new
> FileReader(cachePath.toString()));
> 
> A file in HDFS cannot be accessed through these standard Java function, it
> has to be accessed only via the method I have mentioned above. The API
> methods for FileSystem class are very limited and it only provides us to
> read a data file (containing java primitives) and not any binary files.
> 
> In my specific problem, I am using a API (specific to my research-domain)
> which takes a path (String) as input and reads data from this path (which
> points to a binary file). So I just need a way in which I can access files
> (from tasktrackers) as we do via standard java functions. For this, we
> need the files to be present in the local filesystem of the tasktrackers.
> That is why I am using DistributedCache. 
> 
> I hope I am clear?? And if I am wrong anywhere, please let me know.
> 
> Thanks,
> Akhil
> 
> 
> 
> 
> 
> The API provides only this function to read a data file(containing java
> primitives), we cannot read any binary files. 
> 
> 
> 
> 
> Well, what I wanted was to have a directory in the local filesystem of the
> tasktracker and not the HDFS because of the following reason:
> 
> 
> 
> 
> Chris Curtin-2 wrote:
>> 
>> To push the file to HDFS (put it in the 'a_hdfsDirectory' directory)
>> 
>> Configuration config = new Configuration();
>> FileSystem hdfs = FileSystem.get(config);
>> Path srcPath = new Path(a_directory + "/" + outputName);
>> Path dstPath = new Path(a_hdfsDirectory + "/" + outputName);
>> hdfs.copyFromLocalFile(srcPath, dstPath);
>> 
>> 
>> to read it from HDFS in your mapper or reducer:
>> 
>> Configuration config = new Configuration();
>> FileSystem hdfs = FileSystem.get(config);
>> Path cachePath= new Path(a_hdfsDirectory + "/" + outputName);
>> BufferedReader wordReader = new BufferedReader(
>>         new FileReader(cachePath.toString()));
>> 
>> 
>> 
>> On Fri, Jun 26, 2009 at 8:55 PM, akhil1988 <ak...@gmail.com> wrote:
>> 
>>>
>>> Thanks Chris for your reply!
>>>
>>> Well, I could not understand much of what has been discussed on that
>>> forum.
>>> I am unaware of Cascading.
>>>
>>> My problem is simple - I want a directory to present in the local
>>> working
>>> directory of tasks so that I can access it from my map task in the
>>> following
>>> manner :
>>>
>>> FileInputStream fin = new FileInputStream("Config/file1.config");
>>>
>>> where,
>>> Config is a directory which contains many files/directories, one of
>>> which
>>> is
>>> file1.config
>>>
>>> It would be helpful to me if you can tell me what statements to use to
>>> distribute a directory to the tasktrackers.
>>> The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html
>>> says
>>> that archives are unzipped on the tasktrackers but I want an example of
>>> how
>>> to use this in case of a dreictory.
>>>
>>> Thanks,
>>> Akhil
>>>
>>>
>>>
>>> Chris Curtin-2 wrote:
>>> >
>>> > Hi,
>>> >
>>> > I've found it much easier to write the file to HDFS use the API, then
>>> pass
>>> > the 'path' to the file in HDFS as a property. You'll need to remember
>>> to
>>> > clean up the file after you're done with it.
>>> >
>>> > Example details are in this thread:
>>> >
>>> http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#
>>> >
>>> > Hope this helps,
>>> >
>>> > Chris
>>> >
>>> > On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 <ak...@gmail.com>
>>> wrote:
>>> >
>>> >>
>>> >> Please ask any questions if I am not clear above about the problem I
>>> am
>>> >> facing.
>>> >>
>>> >> Thanks,
>>> >> Akhil
>>> >>
>>> >> akhil1988 wrote:
>>> >> >
>>> >> > Hi All!
>>> >> >
>>> >> > I want a directory to be present in the local working directory of
>>> the
>>> >> > task for which I am using the following statements:
>>> >> >
>>> >> > DistributedCache.addCacheArchive(new
>>> URI("/home/akhil1988/Config.zip"),
>>> >> > conf);
>>> >> > DistributedCache.createSymlink(conf);
>>> >> >
>>> >> >>> Here Config is a directory which I have zipped and put at the
>>> given
>>> >> >>> location in HDFS
>>> >> >
>>> >> > I have zipped the directory because the API doc of DistributedCache
>>> >> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says
>>> that
>>> >> the
>>> >> > archive files are unzipped in the local cache directory :
>>> >> >
>>> >> > DistributedCache can be used to distribute simple, read-only
>>> data/text
>>> >> > files and/or more complex types such as archives, jars etc.
>>> Archives
>>> >> (zip,
>>> >> > tar and tgz/tar.gz files) are un-archived at the slave nodes.
>>> >> >
>>> >> > So, from my understanding of the API docs I expect that the
>>> Config.zip
>>> >> > file will be unzipped to Config directory and since I have
>>> SymLinked
>>> >> them
>>> >> > I can access the directory in the following manner from my map
>>> >> function:
>>> >> >
>>> >> > FileInputStream fin = new FileInputStream("Config/file1.config");
>>> >> >
>>> >> > But I get the FileNotFoundException on the execution of this
>>> statement.
>>> >> > Please let me know where I am going wrong.
>>> >> >
>>> >> > Thanks,
>>> >> > Akhil
>>> >> >
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
>>> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>> >>
>>> >>
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>>
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24300929.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Using addCacheArchive

Posted by akhil1988 <ak...@gmail.com>.

Hi Chris!

Sorry for the late reply!

To push the file into HDFS is clear to me and it can be done using "hadoop
fs -put" command also (prior to executing the job), which I generally use.

The method to access a file in HDFS from Mapper/Reducer is the following:
FileSystem fs = FileSystem.get(conf);
FSDataInputStream din = fs.open("/home/akhil1988/sample.txt");

The method (below)that you gave does not work:
Path cachePath= new Path("hdfs:///home/akhil1988/sample.txt");
BufferedReader wordReader = new BufferedReader(new
FileReader(cachePath.toString()));

A file in HDFS cannot be accessed through these standard Java function, it
has to be accessed only via the method I have mentioned above. The API
methods for FileSystem class are very limited and it only provides us to
read a data file (containing java primitives) and not any binary files.

In my specific problem, I am using a API (specific to my research-domain)
which takes a path (String) as input and reads data from this path (which
points to a binary file). So I just need a way in which I can access files
(from tasktrackers) as we do via standard java functions. For this, we need
the files to be present in the local filesystem of the tasktrackers. That is
why I am using DistributedCache. 

I hope I am clear?? And if I am wrong anywhere, please let me know.

Thanks,
Akhil





The API provides only this function to read a data file(containing java
primitives), we cannot read any binary files. 




Well, what I wanted was to have a directory in the local filesystem of the
tasktracker and not the HDFS because of the following reason:




Chris Curtin-2 wrote:
> 
> To push the file to HDFS (put it in the 'a_hdfsDirectory' directory)
> 
> Configuration config = new Configuration();
> FileSystem hdfs = FileSystem.get(config);
> Path srcPath = new Path(a_directory + "/" + outputName);
> Path dstPath = new Path(a_hdfsDirectory + "/" + outputName);
> hdfs.copyFromLocalFile(srcPath, dstPath);
> 
> 
> to read it from HDFS in your mapper or reducer:
> 
> Configuration config = new Configuration();
> FileSystem hdfs = FileSystem.get(config);
> Path cachePath= new Path(a_hdfsDirectory + "/" + outputName);
> BufferedReader wordReader = new BufferedReader(
>         new FileReader(cachePath.toString()));
> 
> 
> 
> On Fri, Jun 26, 2009 at 8:55 PM, akhil1988 <ak...@gmail.com> wrote:
> 
>>
>> Thanks Chris for your reply!
>>
>> Well, I could not understand much of what has been discussed on that
>> forum.
>> I am unaware of Cascading.
>>
>> My problem is simple - I want a directory to present in the local working
>> directory of tasks so that I can access it from my map task in the
>> following
>> manner :
>>
>> FileInputStream fin = new FileInputStream("Config/file1.config");
>>
>> where,
>> Config is a directory which contains many files/directories, one of which
>> is
>> file1.config
>>
>> It would be helpful to me if you can tell me what statements to use to
>> distribute a directory to the tasktrackers.
>> The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html
>> says
>> that archives are unzipped on the tasktrackers but I want an example of
>> how
>> to use this in case of a dreictory.
>>
>> Thanks,
>> Akhil
>>
>>
>>
>> Chris Curtin-2 wrote:
>> >
>> > Hi,
>> >
>> > I've found it much easier to write the file to HDFS use the API, then
>> pass
>> > the 'path' to the file in HDFS as a property. You'll need to remember
>> to
>> > clean up the file after you're done with it.
>> >
>> > Example details are in this thread:
>> >
>> http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#
>> >
>> > Hope this helps,
>> >
>> > Chris
>> >
>> > On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 <ak...@gmail.com>
>> wrote:
>> >
>> >>
>> >> Please ask any questions if I am not clear above about the problem I
>> am
>> >> facing.
>> >>
>> >> Thanks,
>> >> Akhil
>> >>
>> >> akhil1988 wrote:
>> >> >
>> >> > Hi All!
>> >> >
>> >> > I want a directory to be present in the local working directory of
>> the
>> >> > task for which I am using the following statements:
>> >> >
>> >> > DistributedCache.addCacheArchive(new
>> URI("/home/akhil1988/Config.zip"),
>> >> > conf);
>> >> > DistributedCache.createSymlink(conf);
>> >> >
>> >> >>> Here Config is a directory which I have zipped and put at the
>> given
>> >> >>> location in HDFS
>> >> >
>> >> > I have zipped the directory because the API doc of DistributedCache
>> >> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says
>> that
>> >> the
>> >> > archive files are unzipped in the local cache directory :
>> >> >
>> >> > DistributedCache can be used to distribute simple, read-only
>> data/text
>> >> > files and/or more complex types such as archives, jars etc. Archives
>> >> (zip,
>> >> > tar and tgz/tar.gz files) are un-archived at the slave nodes.
>> >> >
>> >> > So, from my understanding of the API docs I expect that the
>> Config.zip
>> >> > file will be unzipped to Config directory and since I have SymLinked
>> >> them
>> >> > I can access the directory in the following manner from my map
>> >> function:
>> >> >
>> >> > FileInputStream fin = new FileInputStream("Config/file1.config");
>> >> >
>> >> > But I get the FileNotFoundException on the execution of this
>> statement.
>> >> > Please let me know where I am going wrong.
>> >> >
>> >> > Thanks,
>> >> > Akhil
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
>> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24300915.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Using addCacheArchive

Posted by Chris Curtin <cu...@gmail.com>.

To push the file to HDFS (put it in the 'a_hdfsDirectory' directory)

Configuration config = new Configuration();
FileSystem hdfs = FileSystem.get(config);
Path srcPath = new Path(a_directory + "/" + outputName);
Path dstPath = new Path(a_hdfsDirectory + "/" + outputName);
hdfs.copyFromLocalFile(srcPath, dstPath);


to read it from HDFS in your mapper or reducer:

Configuration config = new Configuration();
FileSystem hdfs = FileSystem.get(config);
Path cachePath= new Path(a_hdfsDirectory + "/" + outputName);
BufferedReader wordReader = new BufferedReader(
        new FileReader(cachePath.toString()));



On Fri, Jun 26, 2009 at 8:55 PM, akhil1988 <ak...@gmail.com> wrote:

>
> Thanks Chris for your reply!
>
> Well, I could not understand much of what has been discussed on that forum.
> I am unaware of Cascading.
>
> My problem is simple - I want a directory to present in the local working
> directory of tasks so that I can access it from my map task in the
> following
> manner :
>
> FileInputStream fin = new FileInputStream("Config/file1.config");
>
> where,
> Config is a directory which contains many files/directories, one of which
> is
> file1.config
>
> It would be helpful to me if you can tell me what statements to use to
> distribute a directory to the tasktrackers.
> The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html says
> that archives are unzipped on the tasktrackers but I want an example of how
> to use this in case of a dreictory.
>
> Thanks,
> Akhil
>
>
>
> Chris Curtin-2 wrote:
> >
> > Hi,
> >
> > I've found it much easier to write the file to HDFS use the API, then
> pass
> > the 'path' to the file in HDFS as a property. You'll need to remember to
> > clean up the file after you're done with it.
> >
> > Example details are in this thread:
> >
> http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#
> >
> > Hope this helps,
> >
> > Chris
> >
> > On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 <ak...@gmail.com> wrote:
> >
> >>
> >> Please ask any questions if I am not clear above about the problem I am
> >> facing.
> >>
> >> Thanks,
> >> Akhil
> >>
> >> akhil1988 wrote:
> >> >
> >> > Hi All!
> >> >
> >> > I want a directory to be present in the local working directory of the
> >> > task for which I am using the following statements:
> >> >
> >> > DistributedCache.addCacheArchive(new
> URI("/home/akhil1988/Config.zip"),
> >> > conf);
> >> > DistributedCache.createSymlink(conf);
> >> >
> >> >>> Here Config is a directory which I have zipped and put at the given
> >> >>> location in HDFS
> >> >
> >> > I have zipped the directory because the API doc of DistributedCache
> >> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
> >> the
> >> > archive files are unzipped in the local cache directory :
> >> >
> >> > DistributedCache can be used to distribute simple, read-only data/text
> >> > files and/or more complex types such as archives, jars etc. Archives
> >> (zip,
> >> > tar and tgz/tar.gz files) are un-archived at the slave nodes.
> >> >
> >> > So, from my understanding of the API docs I expect that the Config.zip
> >> > file will be unzipped to Config directory and since I have SymLinked
> >> them
> >> > I can access the directory in the following manner from my map
> >> function:
> >> >
> >> > FileInputStream fin = new FileInputStream("Config/file1.config");
> >> >
> >> > But I get the FileNotFoundException on the execution of this
> statement.
> >> > Please let me know where I am going wrong.
> >> >
> >> > Thanks,
> >> > Akhil
> >> >
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: Using addCacheArchive

Posted by akhil1988 <ak...@gmail.com>.

Thanks Chris for your reply!

Well, I could not understand much of what has been discussed on that forum.
I am unaware of Cascading.

My problem is simple - I want a directory to present in the local working
directory of tasks so that I can access it from my map task in the following
manner :

FileInputStream fin = new FileInputStream("Config/file1.config"); 

where,
Config is a directory which contains many files/directories, one of which is
file1.config

It would be helpful to me if you can tell me what statements to use to
distribute a directory to the tasktrackers.
The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html says
that archives are unzipped on the tasktrackers but I want an example of how
to use this in case of a dreictory.

Thanks,
Akhil



Chris Curtin-2 wrote:
> 
> Hi,
> 
> I've found it much easier to write the file to HDFS use the API, then pass
> the 'path' to the file in HDFS as a property. You'll need to remember to
> clean up the file after you're done with it.
> 
> Example details are in this thread:
> http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#
> 
> Hope this helps,
> 
> Chris
> 
> On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 <ak...@gmail.com> wrote:
> 
>>
>> Please ask any questions if I am not clear above about the problem I am
>> facing.
>>
>> Thanks,
>> Akhil
>>
>> akhil1988 wrote:
>> >
>> > Hi All!
>> >
>> > I want a directory to be present in the local working directory of the
>> > task for which I am using the following statements:
>> >
>> > DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
>> > conf);
>> > DistributedCache.createSymlink(conf);
>> >
>> >>> Here Config is a directory which I have zipped and put at the given
>> >>> location in HDFS
>> >
>> > I have zipped the directory because the API doc of DistributedCache
>> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
>> the
>> > archive files are unzipped in the local cache directory :
>> >
>> > DistributedCache can be used to distribute simple, read-only data/text
>> > files and/or more complex types such as archives, jars etc. Archives
>> (zip,
>> > tar and tgz/tar.gz files) are un-archived at the slave nodes.
>> >
>> > So, from my understanding of the API docs I expect that the Config.zip
>> > file will be unzipped to Config directory and since I have SymLinked
>> them
>> > I can access the directory in the following manner from my map
>> function:
>> >
>> > FileInputStream fin = new FileInputStream("Config/file1.config");
>> >
>> > But I get the FileNotFoundException on the execution of this statement.
>> > Please let me know where I am going wrong.
>> >
>> > Thanks,
>> > Akhil
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Using addCacheArchive

Posted by Chris Curtin <cu...@gmail.com>.

Hi,

I've found it much easier to write the file to HDFS use the API, then pass
the 'path' to the file in HDFS as a property. You'll need to remember to
clean up the file after you're done with it.

Example details are in this thread:
http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#

Hope this helps,

Chris

On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 <ak...@gmail.com> wrote:

>
> Please ask any questions if I am not clear above about the problem I am
> facing.
>
> Thanks,
> Akhil
>
> akhil1988 wrote:
> >
> > Hi All!
> >
> > I want a directory to be present in the local working directory of the
> > task for which I am using the following statements:
> >
> > DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
> > conf);
> > DistributedCache.createSymlink(conf);
> >
> >>> Here Config is a directory which I have zipped and put at the given
> >>> location in HDFS
> >
> > I have zipped the directory because the API doc of DistributedCache
> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
> the
> > archive files are unzipped in the local cache directory :
> >
> > DistributedCache can be used to distribute simple, read-only data/text
> > files and/or more complex types such as archives, jars etc. Archives
> (zip,
> > tar and tgz/tar.gz files) are un-archived at the slave nodes.
> >
> > So, from my understanding of the API docs I expect that the Config.zip
> > file will be unzipped to Config directory and since I have SymLinked them
> > I can access the directory in the following manner from my map function:
> >
> > FileInputStream fin = new FileInputStream("Config/file1.config");
> >
> > But I get the FileNotFoundException on the execution of this statement.
> > Please let me know where I am going wrong.
> >
> > Thanks,
> > Akhil
> >
>
> --
> View this message in context:
> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: Using addCacheArchive

Posted by akhil1988 <ak...@gmail.com>.

Yes, my HDFS paths are of the form /home/user-name/
And I have used these in DistributedCache's addCacheFiles method
successfully. 

Thanks,
Akhil



Amareshwari Sriramadasu wrote:
> 
> Is your hdfs path /home/akhil1988/Config.zip? Usually hdfs path is of the
> form /user/akhil1988/Config.zip.
> Just wondering if you are giving wrong path in the uri!
> 
> Thanks
> Amareshwari
> 
> akhil1988 wrote:
>> Thanks Amareshwari for your reply!
>>
>> The file Config.zip is lying in the HDFS, if it would not have been then
>> the
>> error would be reported by the jobtracker itself while executing the
>> statement:
>> DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
>> conf);
>>
>> But I get error in the map function when I try to access the Config
>> directory. 
>>
>> Now I am using the following statement but still getting the same error: 
>> DistributedCache.addCacheArchive(new
>> URI("/home/akhil1988/Config.zip#Config"), conf);
>>
>> Do you think whether there should be any problem in distributing a zipped
>> directory and then hadoop unzipping it recursively.
>>
>> Thanks!
>> Akhil
>>
>>
>>
>> Amareshwari Sriramadasu wrote:
>>   
>>> Hi Akhil,
>>>
>>> DistributedCache.addCacheArchive takes path on hdfs. From your code, it
>>> looks like you are passing local path.
>>> Also, if you want to create symlink, you should pass URI as
>>> hdfs://<path>#<linkname>, besides calling  
>>> DistributedCache.createSymlink(conf);
>>>
>>> Thanks
>>> Amareshwari
>>>
>>>
>>> akhil1988 wrote:
>>>     
>>>> Please ask any questions if I am not clear above about the problem I am
>>>> facing.
>>>>
>>>> Thanks,
>>>> Akhil
>>>>
>>>> akhil1988 wrote:
>>>>   
>>>>       
>>>>> Hi All!
>>>>>
>>>>> I want a directory to be present in the local working directory of the
>>>>> task for which I am using the following statements: 
>>>>>
>>>>> DistributedCache.addCacheArchive(new
>>>>> URI("/home/akhil1988/Config.zip"),
>>>>> conf);
>>>>> DistributedCache.createSymlink(conf);
>>>>>
>>>>>     
>>>>>         
>>>>>>> Here Config is a directory which I have zipped and put at the given
>>>>>>> location in HDFS
>>>>>>>         
>>>>>>>             
>>>>> I have zipped the directory because the API doc of DistributedCache
>>>>> (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
>>>>> the
>>>>> archive files are unzipped in the local cache directory :
>>>>>
>>>>> DistributedCache can be used to distribute simple, read-only data/text
>>>>> files and/or more complex types such as archives, jars etc. Archives
>>>>> (zip,
>>>>> tar and tgz/tar.gz files) are un-archived at the slave nodes.
>>>>>
>>>>> So, from my understanding of the API docs I expect that the Config.zip
>>>>> file will be unzipped to Config directory and since I have SymLinked
>>>>> them
>>>>> I can access the directory in the following manner from my map
>>>>> function:
>>>>>
>>>>> FileInputStream fin = new FileInputStream("Config/file1.config");
>>>>>
>>>>> But I get the FileNotFoundException on the execution of this
>>>>> statement.
>>>>> Please let me know where I am going wrong.
>>>>>
>>>>> Thanks,
>>>>> Akhil
>>>>>
>>>>>     
>>>>>         
>>>>   
>>>>       
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24214730.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Using addCacheArchive

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.

Is your hdfs path /home/akhil1988/Config.zip? Usually hdfs path is of the form /user/akhil1988/Config.zip.
Just wondering if you are giving wrong path in the uri!

Thanks
Amareshwari

akhil1988 wrote:
> Thanks Amareshwari for your reply!
>
> The file Config.zip is lying in the HDFS, if it would not have been then the
> error would be reported by the jobtracker itself while executing the
> statement:
> DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
> conf);
>
> But I get error in the map function when I try to access the Config
> directory. 
>
> Now I am using the following statement but still getting the same error: 
> DistributedCache.addCacheArchive(new
> URI("/home/akhil1988/Config.zip#Config"), conf);
>
> Do you think whether there should be any problem in distributing a zipped
> directory and then hadoop unzipping it recursively.
>
> Thanks!
> Akhil
>
>
>
> Amareshwari Sriramadasu wrote:
>   
>> Hi Akhil,
>>
>> DistributedCache.addCacheArchive takes path on hdfs. From your code, it
>> looks like you are passing local path.
>> Also, if you want to create symlink, you should pass URI as
>> hdfs://<path>#<linkname>, besides calling  
>> DistributedCache.createSymlink(conf);
>>
>> Thanks
>> Amareshwari
>>
>>
>> akhil1988 wrote:
>>     
>>> Please ask any questions if I am not clear above about the problem I am
>>> facing.
>>>
>>> Thanks,
>>> Akhil
>>>
>>> akhil1988 wrote:
>>>   
>>>       
>>>> Hi All!
>>>>
>>>> I want a directory to be present in the local working directory of the
>>>> task for which I am using the following statements: 
>>>>
>>>> DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
>>>> conf);
>>>> DistributedCache.createSymlink(conf);
>>>>
>>>>     
>>>>         
>>>>>> Here Config is a directory which I have zipped and put at the given
>>>>>> location in HDFS
>>>>>>         
>>>>>>             
>>>> I have zipped the directory because the API doc of DistributedCache
>>>> (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
>>>> the
>>>> archive files are unzipped in the local cache directory :
>>>>
>>>> DistributedCache can be used to distribute simple, read-only data/text
>>>> files and/or more complex types such as archives, jars etc. Archives
>>>> (zip,
>>>> tar and tgz/tar.gz files) are un-archived at the slave nodes.
>>>>
>>>> So, from my understanding of the API docs I expect that the Config.zip
>>>> file will be unzipped to Config directory and since I have SymLinked
>>>> them
>>>> I can access the directory in the following manner from my map function:
>>>>
>>>> FileInputStream fin = new FileInputStream("Config/file1.config");
>>>>
>>>> But I get the FileNotFoundException on the execution of this statement.
>>>> Please let me know where I am going wrong.
>>>>
>>>> Thanks,
>>>> Akhil
>>>>
>>>>     
>>>>         
>>>   
>>>       
>>
>>     
>
>

Re: Using addCacheArchive

Posted by akhil1988 <ak...@gmail.com>.

Thanks Amareshwari for your reply!

The file Config.zip is lying in the HDFS, if it would not have been then the
error would be reported by the jobtracker itself while executing the
statement:
DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
conf);

But I get error in the map function when I try to access the Config
directory. 

Now I am using the following statement but still getting the same error: 
DistributedCache.addCacheArchive(new
URI("/home/akhil1988/Config.zip#Config"), conf);

Do you think whether there should be any problem in distributing a zipped
directory and then hadoop unzipping it recursively.

Thanks!
Akhil



Amareshwari Sriramadasu wrote:
> 
> Hi Akhil,
> 
> DistributedCache.addCacheArchive takes path on hdfs. From your code, it
> looks like you are passing local path.
> Also, if you want to create symlink, you should pass URI as
> hdfs://<path>#<linkname>, besides calling  
> DistributedCache.createSymlink(conf);
> 
> Thanks
> Amareshwari
> 
> 
> akhil1988 wrote:
>> Please ask any questions if I am not clear above about the problem I am
>> facing.
>>
>> Thanks,
>> Akhil
>>
>> akhil1988 wrote:
>>   
>>> Hi All!
>>>
>>> I want a directory to be present in the local working directory of the
>>> task for which I am using the following statements: 
>>>
>>> DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
>>> conf);
>>> DistributedCache.createSymlink(conf);
>>>
>>>     
>>>>> Here Config is a directory which I have zipped and put at the given
>>>>> location in HDFS
>>>>>         
>>> I have zipped the directory because the API doc of DistributedCache
>>> (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
>>> the
>>> archive files are unzipped in the local cache directory :
>>>
>>> DistributedCache can be used to distribute simple, read-only data/text
>>> files and/or more complex types such as archives, jars etc. Archives
>>> (zip,
>>> tar and tgz/tar.gz files) are un-archived at the slave nodes.
>>>
>>> So, from my understanding of the API docs I expect that the Config.zip
>>> file will be unzipped to Config directory and since I have SymLinked
>>> them
>>> I can access the directory in the following manner from my map function:
>>>
>>> FileInputStream fin = new FileInputStream("Config/file1.config");
>>>
>>> But I get the FileNotFoundException on the execution of this statement.
>>> Please let me know where I am going wrong.
>>>
>>> Thanks,
>>> Akhil
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24214657.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Using addCacheArchive

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.

Hi Akhil,

DistributedCache.addCacheArchive takes path on hdfs. From your code, it looks like you are passing local path.
Also, if you want to create symlink, you should pass URI as hdfs://<path>#<linkname>, besides calling  
DistributedCache.createSymlink(conf);

Thanks
Amareshwari


akhil1988 wrote:
> Please ask any questions if I am not clear above about the problem I am
> facing.
>
> Thanks,
> Akhil
>
> akhil1988 wrote:
>   
>> Hi All!
>>
>> I want a directory to be present in the local working directory of the
>> task for which I am using the following statements: 
>>
>> DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
>> conf);
>> DistributedCache.createSymlink(conf);
>>
>>     
>>>> Here Config is a directory which I have zipped and put at the given
>>>> location in HDFS
>>>>         
>> I have zipped the directory because the API doc of DistributedCache
>> (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the
>> archive files are unzipped in the local cache directory :
>>
>> DistributedCache can be used to distribute simple, read-only data/text
>> files and/or more complex types such as archives, jars etc. Archives (zip,
>> tar and tgz/tar.gz files) are un-archived at the slave nodes.
>>
>> So, from my understanding of the API docs I expect that the Config.zip
>> file will be unzipped to Config directory and since I have SymLinked them
>> I can access the directory in the following manner from my map function:
>>
>> FileInputStream fin = new FileInputStream("Config/file1.config");
>>
>> But I get the FileNotFoundException on the execution of this statement.
>> Please let me know where I am going wrong.
>>
>> Thanks,
>> Akhil
>>
>>     
>
>

Re: Using addCacheArchive

Posted by akhil1988 <ak...@gmail.com>.

Please ask any questions if I am not clear above about the problem I am
facing.

Thanks,
Akhil

akhil1988 wrote:
> 
> Hi All!
> 
> I want a directory to be present in the local working directory of the
> task for which I am using the following statements: 
> 
> DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
> conf);
> DistributedCache.createSymlink(conf);
> 
>>> Here Config is a directory which I have zipped and put at the given
>>> location in HDFS
> 
> I have zipped the directory because the API doc of DistributedCache
> (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the
> archive files are unzipped in the local cache directory :
> 
> DistributedCache can be used to distribute simple, read-only data/text
> files and/or more complex types such as archives, jars etc. Archives (zip,
> tar and tgz/tar.gz files) are un-archived at the slave nodes.
> 
> So, from my understanding of the API docs I expect that the Config.zip
> file will be unzipped to Config directory and since I have SymLinked them
> I can access the directory in the following manner from my map function:
> 
> FileInputStream fin = new FileInputStream("Config/file1.config");
> 
> But I get the FileNotFoundException on the execution of this statement.
> Please let me know where I am going wrong.
> 
> Thanks,
> Akhil
> 

-- 
View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.