You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by abc xyz <fa...@yahoo.com> on 2010/07/08 21:04:05 UTC

reading distributed cache returns null pointer

Hello all,

As a new user of hadoop, I am having some problems with understanding some 
things. I am writing a program to load a file to the distributed cache and read 
this file in each mapper. In my driver program, I have added the file to my 
distributed cache using:
        
       Path p=new 
Path("hdfs://localhost:9100/user/denimLive/denim/DCache/Orders.txt");
        DistributedCache.addCacheFile(p.toUri(), conf);

In the configure method of the mapper, I am reading the file from cache using:
            Path[] cacheFiles=DistributedCache.getFileClassPaths(conf);
            BufferedReader joinReader=new BufferedReader(new 
FileReader(cacheFiles[0].toString()));

however, the cacheFiles variable has null value in it. 

There is something mentioned on the Yahoo tutorial for hadoop about distributed 
cache which I do not understand:

As a cautionary note: If you use the local JobRunner in Hadoop (i.e., what 
happens if you call JobClient.runJob()in a program with no or an empty 
hadoop-conf.xmlaccessible), then no local data directory is created; the 
getLocalCacheFiles()call will return an empty set of results. Unit test code 
should take this into account."

what does this mean? I am executing my program in pseudo-distributed mode on 
windows using Eclipse.

Any suggestion in this regard is highly valued. 

Thanks  in advance.

Re: reading distributed cache returns null pointer

Posted by abc xyz <fa...@yahoo.com>.

Thanks Rahul... That worked. Using DistributedCache.getCacheFiles() in 
distributed mode and DistributedCache.getLocalCacheFiles() in pseudo-distributed 
mode.




________________________________
From: Rahul Jain <rj...@gmail.com>
To: common-user@hadoop.apache.org
Sent: Sun, July 11, 2010 12:12:01 AM
Subject: Re: reading distributed cache returns null pointer

The DistributedCache behavior is not symmetrical in local mode vs
distributed mode.

As I replied earlier, you need to use

DistributedCache.getCacheFiles() in distributed mode.

In your code, you can put  a check:

if (getLocalCacheFiles()) returns null then use getCacheFiles()) instead. Or
use the right API depending upon the mode you are executing in.

-Rahul

On Sat, Jul 10, 2010 at 3:18 PM, abc xyz <fa...@yahoo.com> wrote:

> Hi,
>
> Thanks. Ok
>
> Path[] ps=DistributedCache.getLocalCacheFiles(cnf);
>
>  retreives for me the correct path in pseudo-distributed mode. But when I
> run my
> program in fully-distributed mode with 5 nodes, I get a null pointer.
> Theorcatically, if it worked on pseudo-distributed mode, it should work on
> fully-distributed mode as well. What possibilities can be there for this
> behavior?
>
> Cheers
>
>
>
>
> ________________________________
> From: Hemanth Yamijala <yh...@gmail.com>
> To: common-user@hadoop.apache.org
> Sent: Fri, July 9, 2010 10:21:19 AM
> Subject: Re: reading distributed cache returns null pointer
>
> Hi,
>
> > Thanks for the information. I got your point. What I specifically want to
> ask
> >is
> > that if I use the following method to read my file now in each mapper:
> >
> >            FileSystem        hdfs=FileSystem.get(conf);
> >              URI[] uris=DistributedCache.getCacheFiles(conf);
> >              Path my_path=new Path(uris[0].getPath());
> >
> >             if(hdfs.exists(my_path))
> >            {
> >                 FSDataInputStream    fs=hdfs.open(my_path);
> >                 while((str=fs.readLine())!=null)
> >                       System.out.println(str);
> >            }
> > would this method retrieve the file from HDFS? since I am using the
> Hadoop
> API?
> > not the local file API.
> >
>
> It would be instructive to look at the test code in
> src/test/mapred/org/apache/hadoop/mapred/TestMRWithDistributedCache.java.
> This gives a fair idea of how to access the files of DistributedCache
> from within the mapper. Specifically see how the LocalFileSystem is
> used to access the files. You could look at the same class in the
> branch-20 source code if you are using an older version of Hadoop.
>
> >
> > I may be understanding somehting horribly wrong. The situation is that
> now
> > my_path contains DCache/Orders.txt and if i am reading from here, this is
> the
> > path of file on HDFS as well. How does it know to pick the file from the
> local
> > file system, not the HDFS?
> >
> > Thanks again
> >
> >
> >
> >
> > ________________________________
> > From: Rahul Jain <rj...@gmail.com>
> > To: common-user@hadoop.apache.org
> > Sent: Fri, July 9, 2010 12:19:44 AM
> > Subject: Re: reading distributed cache returns null pointer
> >
> > Yes, distributed cache writes files to the local file system for each
> mapper
> > / reducer. So you should be able to access the file(s) using local file
> > system APIs.
> >
> > If the files were staying in HDFS there would be no point to using
> > distributed cache since all mappers already have access to the global
> HDFS
> > directories :).
> >
> > -Rahul
> >
> > On Thu, Jul 8, 2010 at 3:03 PM, abc xyz <fa...@yahoo.com> wrote:
> >
> >> Hi Rahul,
> >> Thanks. It worked. I was using getFileClassPaths() to get the paths to
> the
> >> files
> >> in the cache and then use this path to access the file. It should have
> >> worked
> >> but I don't know why that doesn't produce the required result.
> >>
> >> I added the file HDFS file DCache/Orders.txt to my distributed cache.
> After
> >> calling DistributedCache.getCacheFiles(conf); in the configure method of
> >> the
> >> mapper node, if I read the file now from the returned path (which
> happens
> >> to be
> >> DCache/Orders.txt) using the Hadoop API , would the file be read from
> the
> >> local
> >> directory of the mapper node? More specifically I am doing this:
> >>
> >>
> >>            FileSystem        hdfs=FileSystem.get(conf);
> >>             URI[] uris=DistributedCache.getCacheFiles(conf);
> >>             Path my_path=new Path(uris[0].getPath());
> >>
> >>            if(hdfs.exists(my_path))
> >>            {
> >>                FSDataInputStream    fs=hdfs.open(my_path);
> >>                while((str=fs.readLine())!=null)
> >>                      System.out.println(str);
> >>             }
> >>
> >> Thanks
> >>
> >>
> >> ________________________________
> >> From: Rahul Jain <rj...@gmail.com>
> >> To: common-user@hadoop.apache.org
> >> Sent: Thu, July 8, 2010 8:15:58 PM
> >> Subject: Re: reading distributed cache returns null pointer
> >>
> >> I am not sure why you are using getFileClassPaths() API to access
> files...
> >> here is what works for us:
> >>
> >> Add the file(s) to distributed cache using:
> >> DistributedCache.addCacheFile(p.toUri(), conf);
> >>
> >> Read the files on the mapper using:
> >>
> >> URI[] uris = DistributedCache.getCacheFiles(conf);
> >> // access one of the files:
> >> paths[0] = new Path(uris[0].getPath());
> >> // now follow hadoop or local file APIs to access the file...
> >>
> >>
> >> Did you try the above and did it not work ?
> >>
> >> -Rahul
> >>
> >> On Thu, Jul 8, 2010 at 12:04 PM, abc xyz <fa...@yahoo.com> wrote:
> >>
> >> > Hello all,
> >> >
> >> > As a new user of hadoop, I am having some problems with understanding
> >> some
> >> > things. I am writing a program to load a file to the distributed cache
> >> and
> >> > read
> >> > this file in each mapper. In my driver program, I have added the file
> to
> >> my
> >> > distributed cache using:
> >> >
> >> >        Path p=new
> >> > Path("hdfs://localhost:9100/user/denimLive/denim/DCache/Orders.txt");
> >> >         DistributedCache.addCacheFile(p.toUri(), conf);
> >> >
> >> > In the configure method of the mapper, I am reading the file from
> cache
> >> > using:
> >> >             Path[]
> cacheFiles=DistributedCache.getFileClassPaths(conf);
> >> >             BufferedReader joinReader=new BufferedReader(new
> >> > FileReader(cacheFiles[0].toString()));
> >> >
> >> > however, the cacheFiles variable has null value in it.
> >> >
> >> > There is something mentioned on the Yahoo tutorial for hadoop about
> >> > distributed
> >> > cache which I do not understand:
> >> >
> >> > As a cautionary note: If you use the local JobRunner in Hadoop (i.e.,
> >> what
> >> > happens if you call JobClient.runJob()in a program with no or an empty
> >> > hadoop-conf.xmlaccessible), then no local data directory is created;
> the
> >> > getLocalCacheFiles()call will return an empty set of results. Unit
> test
> >> > code
> >> > should take this into account."
> >> >
> >> > what does this mean? I am executing my program in pseudo-distributed
> mode
> >> > on
> >> > windows using Eclipse.
> >> >
> >> > Any suggestion in this regard is highly valued.
> >> >
> >> > Thanks  in advance.

Re: reading distributed cache returns null pointer

Posted by Rahul Jain <rj...@gmail.com>.

The DistributedCache behavior is not symmetrical in local mode vs
distributed mode.

As I replied earlier, you need to use

DistributedCache.getCacheFiles() in distributed mode.

In your code, you can put  a check:

if (getLocalCacheFiles()) returns null then use getCacheFiles()) instead. Or
use the right API depending upon the mode you are executing in.

-Rahul

On Sat, Jul 10, 2010 at 3:18 PM, abc xyz <fa...@yahoo.com> wrote:

> Hi,
>
> Thanks. Ok
>
> Path[] ps=DistributedCache.getLocalCacheFiles(cnf);
>
>  retreives for me the correct path in pseudo-distributed mode. But when I
> run my
> program in fully-distributed mode with 5 nodes, I get a null pointer.
> Theorcatically, if it worked on pseudo-distributed mode, it should work on
> fully-distributed mode as well. What possibilities can be there for this
> behavior?
>
> Cheers
>
>
>
>
> ________________________________
> From: Hemanth Yamijala <yh...@gmail.com>
> To: common-user@hadoop.apache.org
> Sent: Fri, July 9, 2010 10:21:19 AM
> Subject: Re: reading distributed cache returns null pointer
>
> Hi,
>
> > Thanks for the information. I got your point. What I specifically want to
> ask
> >is
> > that if I use the following method to read my file now in each mapper:
> >
> >            FileSystem        hdfs=FileSystem.get(conf);
> >              URI[] uris=DistributedCache.getCacheFiles(conf);
> >              Path my_path=new Path(uris[0].getPath());
> >
> >             if(hdfs.exists(my_path))
> >            {
> >                 FSDataInputStream    fs=hdfs.open(my_path);
> >                 while((str=fs.readLine())!=null)
> >                       System.out.println(str);
> >            }
> > would this method retrieve the file from HDFS? since I am using the
> Hadoop
> API?
> > not the local file API.
> >
>
> It would be instructive to look at the test code in
> src/test/mapred/org/apache/hadoop/mapred/TestMRWithDistributedCache.java.
> This gives a fair idea of how to access the files of DistributedCache
> from within the mapper. Specifically see how the LocalFileSystem is
> used to access the files. You could look at the same class in the
> branch-20 source code if you are using an older version of Hadoop.
>
> >
> > I may be understanding somehting horribly wrong. The situation is that
> now
> > my_path contains DCache/Orders.txt and if i am reading from here, this is
> the
> > path of file on HDFS as well. How does it know to pick the file from the
> local
> > file system, not the HDFS?
> >
> > Thanks again
> >
> >
> >
> >
> > ________________________________
> > From: Rahul Jain <rj...@gmail.com>
> > To: common-user@hadoop.apache.org
> > Sent: Fri, July 9, 2010 12:19:44 AM
> > Subject: Re: reading distributed cache returns null pointer
> >
> > Yes, distributed cache writes files to the local file system for each
> mapper
> > / reducer. So you should be able to access the file(s) using local file
> > system APIs.
> >
> > If the files were staying in HDFS there would be no point to using
> > distributed cache since all mappers already have access to the global
> HDFS
> > directories :).
> >
> > -Rahul
> >
> > On Thu, Jul 8, 2010 at 3:03 PM, abc xyz <fa...@yahoo.com> wrote:
> >
> >> Hi Rahul,
> >> Thanks. It worked. I was using getFileClassPaths() to get the paths to
> the
> >> files
> >> in the cache and then use this path to access the file. It should have
> >> worked
> >> but I don't know why that doesn't produce the required result.
> >>
> >> I added the file HDFS file DCache/Orders.txt to my distributed cache.
> After
> >> calling DistributedCache.getCacheFiles(conf); in the configure method of
> >> the
> >> mapper node, if I read the file now from the returned path (which
> happens
> >> to be
> >> DCache/Orders.txt) using the Hadoop API , would the file be read from
> the
> >> local
> >> directory of the mapper node? More specifically I am doing this:
> >>
> >>
> >>            FileSystem        hdfs=FileSystem.get(conf);
> >>             URI[] uris=DistributedCache.getCacheFiles(conf);
> >>             Path my_path=new Path(uris[0].getPath());
> >>
> >>            if(hdfs.exists(my_path))
> >>            {
> >>                FSDataInputStream    fs=hdfs.open(my_path);
> >>                while((str=fs.readLine())!=null)
> >>                      System.out.println(str);
> >>             }
> >>
> >> Thanks
> >>
> >>
> >> ________________________________
> >> From: Rahul Jain <rj...@gmail.com>
> >> To: common-user@hadoop.apache.org
> >> Sent: Thu, July 8, 2010 8:15:58 PM
> >> Subject: Re: reading distributed cache returns null pointer
> >>
> >> I am not sure why you are using getFileClassPaths() API to access
> files...
> >> here is what works for us:
> >>
> >> Add the file(s) to distributed cache using:
> >> DistributedCache.addCacheFile(p.toUri(), conf);
> >>
> >> Read the files on the mapper using:
> >>
> >> URI[] uris = DistributedCache.getCacheFiles(conf);
> >> // access one of the files:
> >> paths[0] = new Path(uris[0].getPath());
> >> // now follow hadoop or local file APIs to access the file...
> >>
> >>
> >> Did you try the above and did it not work ?
> >>
> >> -Rahul
> >>
> >> On Thu, Jul 8, 2010 at 12:04 PM, abc xyz <fa...@yahoo.com> wrote:
> >>
> >> > Hello all,
> >> >
> >> > As a new user of hadoop, I am having some problems with understanding
> >> some
> >> > things. I am writing a program to load a file to the distributed cache
> >> and
> >> > read
> >> > this file in each mapper. In my driver program, I have added the file
> to
> >> my
> >> > distributed cache using:
> >> >
> >> >        Path p=new
> >> > Path("hdfs://localhost:9100/user/denimLive/denim/DCache/Orders.txt");
> >> >         DistributedCache.addCacheFile(p.toUri(), conf);
> >> >
> >> > In the configure method of the mapper, I am reading the file from
> cache
> >> > using:
> >> >             Path[]
> cacheFiles=DistributedCache.getFileClassPaths(conf);
> >> >             BufferedReader joinReader=new BufferedReader(new
> >> > FileReader(cacheFiles[0].toString()));
> >> >
> >> > however, the cacheFiles variable has null value in it.
> >> >
> >> > There is something mentioned on the Yahoo tutorial for hadoop about
> >> > distributed
> >> > cache which I do not understand:
> >> >
> >> > As a cautionary note: If you use the local JobRunner in Hadoop (i.e.,
> >> what
> >> > happens if you call JobClient.runJob()in a program with no or an empty
> >> > hadoop-conf.xmlaccessible), then no local data directory is created;
> the
> >> > getLocalCacheFiles()call will return an empty set of results. Unit
> test
> >> > code
> >> > should take this into account."
> >> >
> >> > what does this mean? I am executing my program in pseudo-distributed
> mode
> >> > on
> >> > windows using Eclipse.
> >> >
> >> > Any suggestion in this regard is highly valued.
> >> >
> >> > Thanks  in advance.
> >> >
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >
>
>
>
>
>

Re: reading distributed cache returns null pointer

Posted by abc xyz <fa...@yahoo.com>.

Hi,

Thanks. Ok             

Path[] ps=DistributedCache.getLocalCacheFiles(cnf);

 retreives for me the correct path in pseudo-distributed mode. But when I run my 
program in fully-distributed mode with 5 nodes, I get a null pointer. 
Theorcatically, if it worked on pseudo-distributed mode, it should work on 
fully-distributed mode as well. What possibilities can be there for this 
behavior?

Cheers




________________________________
From: Hemanth Yamijala <yh...@gmail.com>
To: common-user@hadoop.apache.org
Sent: Fri, July 9, 2010 10:21:19 AM
Subject: Re: reading distributed cache returns null pointer

Hi,

> Thanks for the information. I got your point. What I specifically want to ask 
>is
> that if I use the following method to read my file now in each mapper:
>
>            FileSystem        hdfs=FileSystem.get(conf);
>              URI[] uris=DistributedCache.getCacheFiles(conf);
>              Path my_path=new Path(uris[0].getPath());
>
>             if(hdfs.exists(my_path))
>            {
>                 FSDataInputStream    fs=hdfs.open(my_path);
>                 while((str=fs.readLine())!=null)
>                       System.out.println(str);
>            }
> would this method retrieve the file from HDFS? since I am using the Hadoop 
API?
> not the local file API.
>

It would be instructive to look at the test code in
src/test/mapred/org/apache/hadoop/mapred/TestMRWithDistributedCache.java.
This gives a fair idea of how to access the files of DistributedCache
from within the mapper. Specifically see how the LocalFileSystem is
used to access the files. You could look at the same class in the
branch-20 source code if you are using an older version of Hadoop.

>
> I may be understanding somehting horribly wrong. The situation is that now
> my_path contains DCache/Orders.txt and if i am reading from here, this is the
> path of file on HDFS as well. How does it know to pick the file from the local
> file system, not the HDFS?
>
> Thanks again
>
>
>
>
> ________________________________
> From: Rahul Jain <rj...@gmail.com>
> To: common-user@hadoop.apache.org
> Sent: Fri, July 9, 2010 12:19:44 AM
> Subject: Re: reading distributed cache returns null pointer
>
> Yes, distributed cache writes files to the local file system for each mapper
> / reducer. So you should be able to access the file(s) using local file
> system APIs.
>
> If the files were staying in HDFS there would be no point to using
> distributed cache since all mappers already have access to the global HDFS
> directories :).
>
> -Rahul
>
> On Thu, Jul 8, 2010 at 3:03 PM, abc xyz <fa...@yahoo.com> wrote:
>
>> Hi Rahul,
>> Thanks. It worked. I was using getFileClassPaths() to get the paths to the
>> files
>> in the cache and then use this path to access the file. It should have
>> worked
>> but I don't know why that doesn't produce the required result.
>>
>> I added the file HDFS file DCache/Orders.txt to my distributed cache. After
>> calling DistributedCache.getCacheFiles(conf); in the configure method of
>> the
>> mapper node, if I read the file now from the returned path (which happens
>> to be
>> DCache/Orders.txt) using the Hadoop API , would the file be read from the
>> local
>> directory of the mapper node? More specifically I am doing this:
>>
>>
>>            FileSystem        hdfs=FileSystem.get(conf);
>>             URI[] uris=DistributedCache.getCacheFiles(conf);
>>             Path my_path=new Path(uris[0].getPath());
>>
>>            if(hdfs.exists(my_path))
>>            {
>>                FSDataInputStream    fs=hdfs.open(my_path);
>>                while((str=fs.readLine())!=null)
>>                      System.out.println(str);
>>             }
>>
>> Thanks
>>
>>
>> ________________________________
>> From: Rahul Jain <rj...@gmail.com>
>> To: common-user@hadoop.apache.org
>> Sent: Thu, July 8, 2010 8:15:58 PM
>> Subject: Re: reading distributed cache returns null pointer
>>
>> I am not sure why you are using getFileClassPaths() API to access files...
>> here is what works for us:
>>
>> Add the file(s) to distributed cache using:
>> DistributedCache.addCacheFile(p.toUri(), conf);
>>
>> Read the files on the mapper using:
>>
>> URI[] uris = DistributedCache.getCacheFiles(conf);
>> // access one of the files:
>> paths[0] = new Path(uris[0].getPath());
>> // now follow hadoop or local file APIs to access the file...
>>
>>
>> Did you try the above and did it not work ?
>>
>> -Rahul
>>
>> On Thu, Jul 8, 2010 at 12:04 PM, abc xyz <fa...@yahoo.com> wrote:
>>
>> > Hello all,
>> >
>> > As a new user of hadoop, I am having some problems with understanding
>> some
>> > things. I am writing a program to load a file to the distributed cache
>> and
>> > read
>> > this file in each mapper. In my driver program, I have added the file to
>> my
>> > distributed cache using:
>> >
>> >        Path p=new
>> > Path("hdfs://localhost:9100/user/denimLive/denim/DCache/Orders.txt");
>> >         DistributedCache.addCacheFile(p.toUri(), conf);
>> >
>> > In the configure method of the mapper, I am reading the file from cache
>> > using:
>> >             Path[] cacheFiles=DistributedCache.getFileClassPaths(conf);
>> >             BufferedReader joinReader=new BufferedReader(new
>> > FileReader(cacheFiles[0].toString()));
>> >
>> > however, the cacheFiles variable has null value in it.
>> >
>> > There is something mentioned on the Yahoo tutorial for hadoop about
>> > distributed
>> > cache which I do not understand:
>> >
>> > As a cautionary note: If you use the local JobRunner in Hadoop (i.e.,
>> what
>> > happens if you call JobClient.runJob()in a program with no or an empty
>> > hadoop-conf.xmlaccessible), then no local data directory is created; the
>> > getLocalCacheFiles()call will return an empty set of results. Unit test
>> > code
>> > should take this into account."
>> >
>> > what does this mean? I am executing my program in pseudo-distributed mode
>> > on
>> > windows using Eclipse.
>> >
>> > Any suggestion in this regard is highly valued.
>> >
>> > Thanks  in advance.
>> >
>> >
>> >
>> >
>>
>>
>>
>>
>>
>
>
>
>

Re: reading distributed cache returns null pointer

Posted by Hemanth Yamijala <yh...@gmail.com>.

Hi,

> Thanks for the information. I got your point. What I specifically want to ask is
> that if I use the following method to read my file now in each mapper:
>
>            FileSystem        hdfs=FileSystem.get(conf);
>              URI[] uris=DistributedCache.getCacheFiles(conf);
>              Path my_path=new Path(uris[0].getPath());
>
>             if(hdfs.exists(my_path))
>            {
>                 FSDataInputStream    fs=hdfs.open(my_path);
>                 while((str=fs.readLine())!=null)
>                       System.out.println(str);
>            }
> would this method retrieve the file from HDFS? since I am using the Hadoop API?
> not the local file API.
>

It would be instructive to look at the test code in
src/test/mapred/org/apache/hadoop/mapred/TestMRWithDistributedCache.java.
This gives a fair idea of how to access the files of DistributedCache
from within the mapper. Specifically see how the LocalFileSystem is
used to access the files. You could look at the same class in the
branch-20 source code if you are using an older version of Hadoop.

>
> I may be understanding somehting horribly wrong. The situation is that now
> my_path contains DCache/Orders.txt and if i am reading from here, this is the
> path of file on HDFS as well. How does it know to pick the file from the local
> file system, not the HDFS?
>
> Thanks again
>
>
>
>
> ________________________________
> From: Rahul Jain <rj...@gmail.com>
> To: common-user@hadoop.apache.org
> Sent: Fri, July 9, 2010 12:19:44 AM
> Subject: Re: reading distributed cache returns null pointer
>
> Yes, distributed cache writes files to the local file system for each mapper
> / reducer. So you should be able to access the file(s) using local file
> system APIs.
>
> If the files were staying in HDFS there would be no point to using
> distributed cache since all mappers already have access to the global HDFS
> directories :).
>
> -Rahul
>
> On Thu, Jul 8, 2010 at 3:03 PM, abc xyz <fa...@yahoo.com> wrote:
>
>> Hi Rahul,
>> Thanks. It worked. I was using getFileClassPaths() to get the paths to the
>> files
>> in the cache and then use this path to access the file. It should have
>> worked
>> but I don't know why that doesn't produce the required result.
>>
>> I added the file HDFS file DCache/Orders.txt to my distributed cache. After
>> calling DistributedCache.getCacheFiles(conf); in the configure method of
>> the
>> mapper node, if I read the file now from the returned path (which happens
>> to be
>> DCache/Orders.txt) using the Hadoop API , would the file be read from the
>> local
>> directory of the mapper node? More specifically I am doing this:
>>
>>
>>            FileSystem        hdfs=FileSystem.get(conf);
>>             URI[] uris=DistributedCache.getCacheFiles(conf);
>>             Path my_path=new Path(uris[0].getPath());
>>
>>            if(hdfs.exists(my_path))
>>            {
>>                FSDataInputStream    fs=hdfs.open(my_path);
>>                while((str=fs.readLine())!=null)
>>                      System.out.println(str);
>>             }
>>
>> Thanks
>>
>>
>> ________________________________
>> From: Rahul Jain <rj...@gmail.com>
>> To: common-user@hadoop.apache.org
>> Sent: Thu, July 8, 2010 8:15:58 PM
>> Subject: Re: reading distributed cache returns null pointer
>>
>> I am not sure why you are using getFileClassPaths() API to access files...
>> here is what works for us:
>>
>> Add the file(s) to distributed cache using:
>> DistributedCache.addCacheFile(p.toUri(), conf);
>>
>> Read the files on the mapper using:
>>
>> URI[] uris = DistributedCache.getCacheFiles(conf);
>> // access one of the files:
>> paths[0] = new Path(uris[0].getPath());
>> // now follow hadoop or local file APIs to access the file...
>>
>>
>> Did you try the above and did it not work ?
>>
>> -Rahul
>>
>> On Thu, Jul 8, 2010 at 12:04 PM, abc xyz <fa...@yahoo.com> wrote:
>>
>> > Hello all,
>> >
>> > As a new user of hadoop, I am having some problems with understanding
>> some
>> > things. I am writing a program to load a file to the distributed cache
>> and
>> > read
>> > this file in each mapper. In my driver program, I have added the file to
>> my
>> > distributed cache using:
>> >
>> >        Path p=new
>> > Path("hdfs://localhost:9100/user/denimLive/denim/DCache/Orders.txt");
>> >         DistributedCache.addCacheFile(p.toUri(), conf);
>> >
>> > In the configure method of the mapper, I am reading the file from cache
>> > using:
>> >             Path[] cacheFiles=DistributedCache.getFileClassPaths(conf);
>> >             BufferedReader joinReader=new BufferedReader(new
>> > FileReader(cacheFiles[0].toString()));
>> >
>> > however, the cacheFiles variable has null value in it.
>> >
>> > There is something mentioned on the Yahoo tutorial for hadoop about
>> > distributed
>> > cache which I do not understand:
>> >
>> > As a cautionary note: If you use the local JobRunner in Hadoop (i.e.,
>> what
>> > happens if you call JobClient.runJob()in a program with no or an empty
>> > hadoop-conf.xmlaccessible), then no local data directory is created; the
>> > getLocalCacheFiles()call will return an empty set of results. Unit test
>> > code
>> > should take this into account."
>> >
>> > what does this mean? I am executing my program in pseudo-distributed mode
>> > on
>> > windows using Eclipse.
>> >
>> > Any suggestion in this regard is highly valued.
>> >
>> > Thanks  in advance.
>> >
>> >
>> >
>> >
>>
>>
>>
>>
>>
>
>
>
>

Re: reading distributed cache returns null pointer

Posted by Denim Live <de...@yahoo.com>.

Hi Rahul,

Thanks for the information. I got your point. What I specifically want to ask is 
that if I use the following method to read my file now in each mapper:

            FileSystem        hdfs=FileSystem.get(conf);
              URI[] uris=DistributedCache.getCacheFiles(conf);
              Path my_path=new Path(uris[0].getPath());

             if(hdfs.exists(my_path))
            {
                 FSDataInputStream    fs=hdfs.open(my_path);
                 while((str=fs.readLine())!=null)
                       System.out.println(str);
            }
would this method retrieve the file from HDFS? since I am using the Hadoop API? 
not the local file API. 


I may be understanding somehting horribly wrong. The situation is that now 
my_path contains DCache/Orders.txt and if i am reading from here, this is the 
path of file on HDFS as well. How does it know to pick the file from the local 
file system, not the HDFS?

Thanks again




________________________________
From: Rahul Jain <rj...@gmail.com>
To: common-user@hadoop.apache.org
Sent: Fri, July 9, 2010 12:19:44 AM
Subject: Re: reading distributed cache returns null pointer

Yes, distributed cache writes files to the local file system for each mapper
/ reducer. So you should be able to access the file(s) using local file
system APIs.

If the files were staying in HDFS there would be no point to using
distributed cache since all mappers already have access to the global HDFS
directories :).

-Rahul

On Thu, Jul 8, 2010 at 3:03 PM, abc xyz <fa...@yahoo.com> wrote:

> Hi Rahul,
> Thanks. It worked. I was using getFileClassPaths() to get the paths to the
> files
> in the cache and then use this path to access the file. It should have
> worked
> but I don't know why that doesn't produce the required result.
>
> I added the file HDFS file DCache/Orders.txt to my distributed cache. After
> calling DistributedCache.getCacheFiles(conf); in the configure method of
> the
> mapper node, if I read the file now from the returned path (which happens
> to be
> DCache/Orders.txt) using the Hadoop API , would the file be read from the
> local
> directory of the mapper node? More specifically I am doing this:
>
>
>            FileSystem        hdfs=FileSystem.get(conf);
>             URI[] uris=DistributedCache.getCacheFiles(conf);
>             Path my_path=new Path(uris[0].getPath());
>
>            if(hdfs.exists(my_path))
>            {
>                FSDataInputStream    fs=hdfs.open(my_path);
>                while((str=fs.readLine())!=null)
>                      System.out.println(str);
>             }
>
> Thanks
>
>
> ________________________________
> From: Rahul Jain <rj...@gmail.com>
> To: common-user@hadoop.apache.org
> Sent: Thu, July 8, 2010 8:15:58 PM
> Subject: Re: reading distributed cache returns null pointer
>
> I am not sure why you are using getFileClassPaths() API to access files...
> here is what works for us:
>
> Add the file(s) to distributed cache using:
> DistributedCache.addCacheFile(p.toUri(), conf);
>
> Read the files on the mapper using:
>
> URI[] uris = DistributedCache.getCacheFiles(conf);
> // access one of the files:
> paths[0] = new Path(uris[0].getPath());
> // now follow hadoop or local file APIs to access the file...
>
>
> Did you try the above and did it not work ?
>
> -Rahul
>
> On Thu, Jul 8, 2010 at 12:04 PM, abc xyz <fa...@yahoo.com> wrote:
>
> > Hello all,
> >
> > As a new user of hadoop, I am having some problems with understanding
> some
> > things. I am writing a program to load a file to the distributed cache
> and
> > read
> > this file in each mapper. In my driver program, I have added the file to
> my
> > distributed cache using:
> >
> >        Path p=new
> > Path("hdfs://localhost:9100/user/denimLive/denim/DCache/Orders.txt");
> >         DistributedCache.addCacheFile(p.toUri(), conf);
> >
> > In the configure method of the mapper, I am reading the file from cache
> > using:
> >             Path[] cacheFiles=DistributedCache.getFileClassPaths(conf);
> >             BufferedReader joinReader=new BufferedReader(new
> > FileReader(cacheFiles[0].toString()));
> >
> > however, the cacheFiles variable has null value in it.
> >
> > There is something mentioned on the Yahoo tutorial for hadoop about
> > distributed
> > cache which I do not understand:
> >
> > As a cautionary note: If you use the local JobRunner in Hadoop (i.e.,
> what
> > happens if you call JobClient.runJob()in a program with no or an empty
> > hadoop-conf.xmlaccessible), then no local data directory is created; the
> > getLocalCacheFiles()call will return an empty set of results. Unit test
> > code
> > should take this into account."
> >
> > what does this mean? I am executing my program in pseudo-distributed mode
> > on
> > windows using Eclipse.
> >
> > Any suggestion in this regard is highly valued.
> >
> > Thanks  in advance.
> >
> >
> >
> >
>
>
>
>
>

Re: reading distributed cache returns null pointer

Posted by Rahul Jain <rj...@gmail.com>.

Yes, distributed cache writes files to the local file system for each mapper
/ reducer. So you should be able to access the file(s) using local file
system APIs.

 If the files were staying in HDFS there would be no point to using
distributed cache since all mappers already have access to the global HDFS
directories :).

-Rahul

On Thu, Jul 8, 2010 at 3:03 PM, abc xyz <fa...@yahoo.com> wrote:

> Hi Rahul,
> Thanks. It worked. I was using getFileClassPaths() to get the paths to the
> files
> in the cache and then use this path to access the file. It should have
> worked
> but I don't know why that doesn't produce the required result.
>
> I added the file HDFS file DCache/Orders.txt to my distributed cache. After
> calling DistributedCache.getCacheFiles(conf); in the configure method of
> the
> mapper node, if I read the file now from the returned path (which happens
> to be
> DCache/Orders.txt) using the Hadoop API , would the file be read from the
> local
> directory of the mapper node? More specifically I am doing this:
>
>
>            FileSystem        hdfs=FileSystem.get(conf);
>             URI[] uris=DistributedCache.getCacheFiles(conf);
>             Path my_path=new Path(uris[0].getPath());
>
>            if(hdfs.exists(my_path))
>            {
>                FSDataInputStream    fs=hdfs.open(my_path);
>                while((str=fs.readLine())!=null)
>                      System.out.println(str);
>             }
>
> Thanks
>
>
> ________________________________
> From: Rahul Jain <rj...@gmail.com>
> To: common-user@hadoop.apache.org
> Sent: Thu, July 8, 2010 8:15:58 PM
> Subject: Re: reading distributed cache returns null pointer
>
> I am not sure why you are using getFileClassPaths() API to access files...
> here is what works for us:
>
> Add the file(s) to distributed cache using:
> DistributedCache.addCacheFile(p.toUri(), conf);
>
> Read the files on the mapper using:
>
> URI[] uris = DistributedCache.getCacheFiles(conf);
> // access one of the files:
> paths[0] = new Path(uris[0].getPath());
> // now follow hadoop or local file APIs to access the file...
>
>
> Did you try the above and did it not work ?
>
> -Rahul
>
> On Thu, Jul 8, 2010 at 12:04 PM, abc xyz <fa...@yahoo.com> wrote:
>
> > Hello all,
> >
> > As a new user of hadoop, I am having some problems with understanding
> some
> > things. I am writing a program to load a file to the distributed cache
> and
> > read
> > this file in each mapper. In my driver program, I have added the file to
> my
> > distributed cache using:
> >
> >        Path p=new
> > Path("hdfs://localhost:9100/user/denimLive/denim/DCache/Orders.txt");
> >         DistributedCache.addCacheFile(p.toUri(), conf);
> >
> > In the configure method of the mapper, I am reading the file from cache
> > using:
> >             Path[] cacheFiles=DistributedCache.getFileClassPaths(conf);
> >             BufferedReader joinReader=new BufferedReader(new
> > FileReader(cacheFiles[0].toString()));
> >
> > however, the cacheFiles variable has null value in it.
> >
> > There is something mentioned on the Yahoo tutorial for hadoop about
> > distributed
> > cache which I do not understand:
> >
> > As a cautionary note: If you use the local JobRunner in Hadoop (i.e.,
> what
> > happens if you call JobClient.runJob()in a program with no or an empty
> > hadoop-conf.xmlaccessible), then no local data directory is created; the
> > getLocalCacheFiles()call will return an empty set of results. Unit test
> > code
> > should take this into account."
> >
> > what does this mean? I am executing my program in pseudo-distributed mode
> > on
> > windows using Eclipse.
> >
> > Any suggestion in this regard is highly valued.
> >
> > Thanks  in advance.
> >
> >
> >
> >
>
>
>
>
>

Re: reading distributed cache returns null pointer

Posted by abc xyz <fa...@yahoo.com>.

Hi Rahul,
Thanks. It worked. I was using getFileClassPaths() to get the paths to the files 
in the cache and then use this path to access the file. It should have worked 
but I don't know why that doesn't produce the required result.

I added the file HDFS file DCache/Orders.txt to my distributed cache. After 
calling DistributedCache.getCacheFiles(conf); in the configure method of the 
mapper node, if I read the file now from the returned path (which happens to be 
DCache/Orders.txt) using the Hadoop API , would the file be read from the local 
directory of the mapper node? More specifically I am doing this:

            FileSystem        hdfs=FileSystem.get(conf);
            URI[] uris=DistributedCache.getCacheFiles(conf);
            Path my_path=new Path(uris[0].getPath());

            if(hdfs.exists(my_path))
            {
                FSDataInputStream    fs=hdfs.open(my_path);
                while((str=fs.readLine())!=null)
                      System.out.println(str);
             }

Thanks

________________________________
From: Rahul Jain <rj...@gmail.com>
To: common-user@hadoop.apache.org
Sent: Thu, July 8, 2010 8:15:58 PM
Subject: Re: reading distributed cache returns null pointer

I am not sure why you are using getFileClassPaths() API to access files...
here is what works for us:

Add the file(s) to distributed cache using:
DistributedCache.addCacheFile(p.toUri(), conf);

Read the files on the mapper using:

URI[] uris = DistributedCache.getCacheFiles(conf);
// access one of the files:
paths[0] = new Path(uris[0].getPath());
// now follow hadoop or local file APIs to access the file...

Did you try the above and did it not work ?

-Rahul

On Thu, Jul 8, 2010 at 12:04 PM, abc xyz <fa...@yahoo.com> wrote:

> Hello all,
>
> As a new user of hadoop, I am having some problems with understanding some
> things. I am writing a program to load a file to the distributed cache and
> read
> this file in each mapper. In my driver program, I have added the file to my
> distributed cache using:
>
>        Path p=new
> Path("hdfs://localhost:9100/user/denimLive/denim/DCache/Orders.txt");
>         DistributedCache.addCacheFile(p.toUri(), conf);
>
> In the configure method of the mapper, I am reading the file from cache
> using:
>             Path[] cacheFiles=DistributedCache.getFileClassPaths(conf);
>             BufferedReader joinReader=new BufferedReader(new
> FileReader(cacheFiles[0].toString()));
>
> however, the cacheFiles variable has null value in it.
>
> There is something mentioned on the Yahoo tutorial for hadoop about
> distributed
> cache which I do not understand:
>
> As a cautionary note: If you use the local JobRunner in Hadoop (i.e., what
> happens if you call JobClient.runJob()in a program with no or an empty
> hadoop-conf.xmlaccessible), then no local data directory is created; the
> getLocalCacheFiles()call will return an empty set of results. Unit test
> code
> should take this into account."
>
> what does this mean? I am executing my program in pseudo-distributed mode
> on
> windows using Eclipse.
>
> Any suggestion in this regard is highly valued.
>
> Thanks  in advance.
>
>
>
>

Re: reading distributed cache returns null pointer

Posted by Rahul Jain <rj...@gmail.com>.

I am not sure why you are using getFileClassPaths() API to access files...
here is what works for us:

Add the file(s) to distributed cache using:
DistributedCache.addCacheFile(p.toUri(), conf);

Read the files on the mapper using:

URI[] uris = DistributedCache.getCacheFiles(conf);
// access one of the files:
paths[0] = new Path(uris[0].getPath());
// now follow hadoop or local file APIs to access the file...


Did you try the above and did it not work ?

-Rahul

On Thu, Jul 8, 2010 at 12:04 PM, abc xyz <fa...@yahoo.com> wrote:

> Hello all,
>
> As a new user of hadoop, I am having some problems with understanding some
> things. I am writing a program to load a file to the distributed cache and
> read
> this file in each mapper. In my driver program, I have added the file to my
> distributed cache using:
>
>        Path p=new
> Path("hdfs://localhost:9100/user/denimLive/denim/DCache/Orders.txt");
>         DistributedCache.addCacheFile(p.toUri(), conf);
>
> In the configure method of the mapper, I am reading the file from cache
> using:
>             Path[] cacheFiles=DistributedCache.getFileClassPaths(conf);
>             BufferedReader joinReader=new BufferedReader(new
> FileReader(cacheFiles[0].toString()));
>
> however, the cacheFiles variable has null value in it.
>
> There is something mentioned on the Yahoo tutorial for hadoop about
> distributed
> cache which I do not understand:
>
> As a cautionary note: If you use the local JobRunner in Hadoop (i.e., what
> happens if you call JobClient.runJob()in a program with no or an empty
> hadoop-conf.xmlaccessible), then no local data directory is created; the
> getLocalCacheFiles()call will return an empty set of results. Unit test
> code
> should take this into account."
>
> what does this mean? I am executing my program in pseudo-distributed mode
> on
> windows using Eclipse.
>
> Any suggestion in this regard is highly valued.
>
> Thanks  in advance.
>
>
>
>