You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Sean Shanny <ss...@tripadvisor.com> on 2008/12/25 06:10:21 UTC
Having trouble accessing MapFiles in the DistributedCache
To all,
Version: hadoop-0.17.2.1-core.jar
I created a MapFile on a local node.
I put the files into the HDFS using the following commands:
$ bin/hadoop fs -copyFromLocal /tmp/ur/data /2008-12-19/url/data
$ bin/hadoop fs -copyFromLocal /tmp/ur/index /2008-12-19/url/index
and placed them in the DistributedCache using the following calls in
the JobConf class:
DistributedCache.addCacheFile(new URI("/2008-12-19/url/data"), conf);
DistributedCache.addCacheFile(new URI("/2008-12-19/url/index"), conf);
What I cannot figure out how to do is actually access the MapFile now
within my Map code. I tried the following but I am getting file not
found errors when I try to run the job.
private FileSystem fs;
private MapFile.Reader myReader;
private Path[] localFiles;
....
public void configure(JobConf conf)
{
String[] s = conf.getStrings("map.input.file");
m_sFileName = s[0];
try
{
localFiles = DistributedCache.getLocalCacheFiles(conf);
for (Path localFile : localFiles)
{
String sFileName = localFile.getName();
if (sFileName.equalsIgnoreCase("data"))
{
System.out.println("Full Path: " +
localFile.toString());
System.out.println("Parent: " +
localFile.getParent().toString());
fs = FileSystem.get(localFile.toUri(), conf);
myReader = new MapFile.Reader(fs,
localFile.getParent().toString(), conf);
}
}
}
catch (IOException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
The following exception is thrown and I cannot figure out why it is
adding the extra data element at the end of the path. The data is
actually at
Task Logs: 'task_200812250002_0001_m_000000_0'
stdout logs
Full Path: /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
2008-12-19/url/data/data
Parent: /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
2008-12-19/url/data
stderr logs
java.io.FileNotFoundException: File does not exist: /tmp/hadoop-root/
mapred/local/taskTracker/archive/hdp01n/2008-12-19/url/data/data at
org
.apache
.hadoop
.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:
369) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:628)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:
1431) at org.apache.hadoop.io.SequenceFile
$Reader.<init>(SequenceFile.java:1426) at org.apache.hadoop.io.MapFile
$Reader.createDataFileReader(MapFile.java:301) at
org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:283) at
org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:272) at
org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:259) at
org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:252) at
com
.TripResearch
.warehouse.etl.EtlTestUrlMapLookup.configure(EtlTestUrlMapLookup.java:
84) at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:
58) at
org
.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
82) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:
58) at
org
.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
82) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:215) at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
The files do exist but I don't understand why they were placed in
their own directories. I would have expected both files to exist at /
2008-12-19/url/ not /2008-12-19/url/data/ and /2008-12-19/url/index/
ls -la /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
2008-12-19/url/data
total 740640
drwxr-xr-x 2 root root 4096 Dec 24 23:49 .
drwxr-xr-x 4 root root 4096 Dec 24 23:49 ..
-rwxr-xr-x 1 root root 751776245 Dec 24 23:49 data
-rw-r--r-- 1 root root 5873260 Dec 24 23:49 .data.crc
[root@hdp01n warehouse]# ls -la /tmp/hadoop-root/mapred/local/
taskTracker/archive/hdp01n/2008-12-19/url/index
total 2148
drwxr-xr-x 2 root root 4096 Dec 25 00:04 .
drwxr-xr-x 4 root root 4096 Dec 25 00:04 ..
-rwxr-xr-x 1 root root 2165220 Dec 25 00:04 index
-rw-r--r-- 1 root root 16924 Dec 25 00:04 .index.crc
....
I know I must be doing something really stupid here as I am sure this
has been done by lots of folks prior to my feeble attempt. I did a
google search but really could not come up with any examples of using
a MapFile on the DistributedCache.
Thanks.
--sean
Re: Having trouble accessing MapFiles in the DistributedCache
Posted by Sean Shanny <ss...@tripadvisor.com>.
Thanks for your suggestion but unfortunately it did not fix the issue.
Thanks.
--sean
Sean Shanny
sshanny@tripadvisor.com
On Dec 25, 2008, at 8:19 AM, Devaraj Das wrote:
> IIRC, enabling symlink creation for your files should solve the
> problem.
> Call DistributedCache.createSymLink(); before submitting your job.
>
>
>
> On 12/25/08 10:40 AM, "Sean Shanny" <ss...@tripadvisor.com> wrote:
>
>> To all,
>>
>> Version: hadoop-0.17.2.1-core.jar
>>
>> I created a MapFile on a local node.
>>
>> I put the files into the HDFS using the following commands:
>>
>> $ bin/hadoop fs -copyFromLocal /tmp/ur/data /2008-12-19/url/data
>> $ bin/hadoop fs -copyFromLocal /tmp/ur/index /2008-12-19/url/index
>>
>> and placed them in the DistributedCache using the following calls in
>> the JobConf class:
>>
>> DistributedCache.addCacheFile(new URI("/2008-12-19/url/data"), conf);
>> DistributedCache.addCacheFile(new URI("/2008-12-19/url/index"),
>> conf);
>>
>> What I cannot figure out how to do is actually access the MapFile now
>> within my Map code. I tried the following but I am getting file not
>> found errors when I try to run the job.
>>
>> private FileSystem fs;
>> private MapFile.Reader myReader;
>> private Path[] localFiles;
>>
>> ....
>>
>> public void configure(JobConf conf)
>> {
>> String[] s = conf.getStrings("map.input.file");
>> m_sFileName = s[0];
>>
>> try
>> {
>> localFiles = DistributedCache.getLocalCacheFiles(conf);
>>
>> for (Path localFile : localFiles)
>> {
>> String sFileName = localFile.getName();
>>
>> if (sFileName.equalsIgnoreCase("data"))
>> {
>> System.out.println("Full Path: " +
>> localFile.toString());
>> System.out.println("Parent: " +
>> localFile.getParent().toString());
>>
>> fs = FileSystem.get(localFile.toUri(), conf);
>> myReader = new MapFile.Reader(fs,
>> localFile.getParent().toString(), conf);
>> }
>> }
>> }
>> catch (IOException e)
>> {
>> // TODO Auto-generated catch block
>> e.printStackTrace();
>> }
>>
>> The following exception is thrown and I cannot figure out why it is
>> adding the extra data element at the end of the path. The data is
>> actually at
>>
>> Task Logs: 'task_200812250002_0001_m_000000_0'
>>
>> stdout logs
>> Full Path: /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
>> 2008-12-19/url/data/data
>> Parent: /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
>> 2008-12-19/url/data
>> stderr logs
>> java.io.FileNotFoundException: File does not exist: /tmp/hadoop-root/
>> mapred/local/taskTracker/archive/hdp01n/2008-12-19/url/data/data at
>> org
>> .apache
>> .hadoop
>> .dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:
>> 369) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:
>> 628)
>> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:
>> 1431) at org.apache.hadoop.io.SequenceFile
>> $Reader.<init>(SequenceFile.java:1426) at
>> org.apache.hadoop.io.MapFile
>> $Reader.createDataFileReader(MapFile.java:301) at
>> org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:283) at
>> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:272) at
>> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:259) at
>> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:252) at
>> com
>> .TripResearch
>> .warehouse
>> .etl.EtlTestUrlMapLookup.configure(EtlTestUrlMapLookup.java:
>> 84) at
>> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:
>> 58) at
>> org
>> .apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
>> 82) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:
>> 33)
>> at
>> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:
>> 58) at
>> org
>> .apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
>> 82) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:215) at
>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:
>> 2122)
>>
>> The files do exist but I don't understand why they were placed in
>> their own directories. I would have expected both files to exist
>> at /
>> 2008-12-19/url/ not /2008-12-19/url/data/ and /2008-12-19/url/index/
>>
>> ls -la /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
>> 2008-12-19/url/data
>> total 740640
>> drwxr-xr-x 2 root root 4096 Dec 24 23:49 .
>> drwxr-xr-x 4 root root 4096 Dec 24 23:49 ..
>> -rwxr-xr-x 1 root root 751776245 Dec 24 23:49 data
>> -rw-r--r-- 1 root root 5873260 Dec 24 23:49 .data.crc
>>
>> [root@hdp01n warehouse]# ls -la /tmp/hadoop-root/mapred/local/
>> taskTracker/archive/hdp01n/2008-12-19/url/index
>> total 2148
>> drwxr-xr-x 2 root root 4096 Dec 25 00:04 .
>> drwxr-xr-x 4 root root 4096 Dec 25 00:04 ..
>> -rwxr-xr-x 1 root root 2165220 Dec 25 00:04 index
>> -rw-r--r-- 1 root root 16924 Dec 25 00:04 .index.crc
>>
>> ....
>>
>> I know I must be doing something really stupid here as I am sure this
>> has been done by lots of folks prior to my feeble attempt. I did a
>> google search but really could not come up with any examples of using
>> a MapFile on the DistributedCache.
>>
>> Thanks.
>>
>> --sean
>>
>>
>>
>>
>
>
Re: Having trouble accessing MapFiles in the DistributedCache
Posted by Devaraj Das <dd...@yahoo-inc.com>.
IIRC, enabling symlink creation for your files should solve the problem.
Call DistributedCache.createSymLink(); before submitting your job.
On 12/25/08 10:40 AM, "Sean Shanny" <ss...@tripadvisor.com> wrote:
> To all,
>
> Version: hadoop-0.17.2.1-core.jar
>
> I created a MapFile on a local node.
>
> I put the files into the HDFS using the following commands:
>
> $ bin/hadoop fs -copyFromLocal /tmp/ur/data /2008-12-19/url/data
> $ bin/hadoop fs -copyFromLocal /tmp/ur/index /2008-12-19/url/index
>
> and placed them in the DistributedCache using the following calls in
> the JobConf class:
>
> DistributedCache.addCacheFile(new URI("/2008-12-19/url/data"), conf);
> DistributedCache.addCacheFile(new URI("/2008-12-19/url/index"), conf);
>
> What I cannot figure out how to do is actually access the MapFile now
> within my Map code. I tried the following but I am getting file not
> found errors when I try to run the job.
>
> private FileSystem fs;
> private MapFile.Reader myReader;
> private Path[] localFiles;
>
> ....
>
> public void configure(JobConf conf)
> {
> String[] s = conf.getStrings("map.input.file");
> m_sFileName = s[0];
>
> try
> {
> localFiles = DistributedCache.getLocalCacheFiles(conf);
>
> for (Path localFile : localFiles)
> {
> String sFileName = localFile.getName();
>
> if (sFileName.equalsIgnoreCase("data"))
> {
> System.out.println("Full Path: " +
> localFile.toString());
> System.out.println("Parent: " +
> localFile.getParent().toString());
>
> fs = FileSystem.get(localFile.toUri(), conf);
> myReader = new MapFile.Reader(fs,
> localFile.getParent().toString(), conf);
> }
> }
> }
> catch (IOException e)
> {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
>
> The following exception is thrown and I cannot figure out why it is
> adding the extra data element at the end of the path. The data is
> actually at
>
> Task Logs: 'task_200812250002_0001_m_000000_0'
>
> stdout logs
> Full Path: /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
> 2008-12-19/url/data/data
> Parent: /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
> 2008-12-19/url/data
> stderr logs
> java.io.FileNotFoundException: File does not exist: /tmp/hadoop-root/
> mapred/local/taskTracker/archive/hdp01n/2008-12-19/url/data/data at
> org
> .apache
> .hadoop
> .dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:
> 369) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:628)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:
> 1431) at org.apache.hadoop.io.SequenceFile
> $Reader.<init>(SequenceFile.java:1426) at org.apache.hadoop.io.MapFile
> $Reader.createDataFileReader(MapFile.java:301) at
> org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:283) at
> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:272) at
> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:259) at
> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:252) at
> com
> .TripResearch
> .warehouse.etl.EtlTestUrlMapLookup.configure(EtlTestUrlMapLookup.java:
> 84) at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:
> 58) at
> org
> .apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
> 82) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:
> 58) at
> org
> .apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
> 82) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:215) at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
>
> The files do exist but I don't understand why they were placed in
> their own directories. I would have expected both files to exist at /
> 2008-12-19/url/ not /2008-12-19/url/data/ and /2008-12-19/url/index/
>
> ls -la /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
> 2008-12-19/url/data
> total 740640
> drwxr-xr-x 2 root root 4096 Dec 24 23:49 .
> drwxr-xr-x 4 root root 4096 Dec 24 23:49 ..
> -rwxr-xr-x 1 root root 751776245 Dec 24 23:49 data
> -rw-r--r-- 1 root root 5873260 Dec 24 23:49 .data.crc
>
> [root@hdp01n warehouse]# ls -la /tmp/hadoop-root/mapred/local/
> taskTracker/archive/hdp01n/2008-12-19/url/index
> total 2148
> drwxr-xr-x 2 root root 4096 Dec 25 00:04 .
> drwxr-xr-x 4 root root 4096 Dec 25 00:04 ..
> -rwxr-xr-x 1 root root 2165220 Dec 25 00:04 index
> -rw-r--r-- 1 root root 16924 Dec 25 00:04 .index.crc
>
> ....
>
> I know I must be doing something really stupid here as I am sure this
> has been done by lots of folks prior to my feeble attempt. I did a
> google search but really could not come up with any examples of using
> a MapFile on the DistributedCache.
>
> Thanks.
>
> --sean
>
>
>
>