You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by rishi pathak <ma...@gmail.com> on 2011/01/17 09:21:26 UTC

Nutch on a shared filesystem

Hi,
          Our setup has 2 data node with 16 cores each. We are trying to
setup nutch to use shared local filesystem
instead of HDFS. For single tasktracker, it works fine but for more than one
tasktracker it gives an error and comes out.
The error is related to tmp data dir for map/red asks.


#mapred. conf :

<configuration>

 <property>
    <name>mapred.job.tracker</name>
    <value>yc1.cn:9001</value>
 </property>

 <property>
    <name>mapred.system.dir</name>

<value>/home/internal/sysadmin/nazgul/hadoop/dfs/local/mapredSystemDir/</value>
 </property>

 <property>
    <name>mapred.local.dir</name>

<value>/home/internal/sysadmin/nazgul/hadoop/dfs/local/mapredLocalDir/</value>
    <!--<value>/tmp/</value> -->
 </property>

 <property>
    <name>mapred.tasktracker.map.task.maximum</name>
    <value>16</value>
 </property>

 <property>
    <name>mapred.tasktracker.map.task.maximum</name>
    <value>16</value>
 </property>

 <property>
    <name>mapreduce.cluster.local.dir</name>

<value>/home/internal/sysadmin/nazgul/hadoop/dfs/local/mapredClusterLocalDir/</value>
 </property>

</configuration>



# Error ########

java.io.IOException: The temporary job-output directory
file:/tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199/_temporary
doesn't exist!
        at
org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:204)
        at
org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:234)
        at
org.apache.hadoop.mapred.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:48)
        at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:433)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

Injector: Merging injected urls into crawl db.
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
        at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:226)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)



-- 
---
Rishi Pathak
National PARAM Supercomputing Facility
C-DAC, Pune, India

Re: Nutch on a shared filesystem

Posted by rishi pathak <ma...@gmail.com>.
Hello Alex,
                 We have tried the setup with HDFS and worked fine. The
shared filesstem talked in here
is a Lustre parallel filesystem and is mounted on all the compute
nodes(tasktracker).
The problem as it seems to me is not about different nodes messing up but
temp data written
by one tasktracker on one node and being accessed by another. The dir :
/tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199/_temporary does exists
on the second node.


On Mon, Jan 17, 2011 at 1:59 PM, Alex McLintock <al...@gmail.com>wrote:

> I'm not sure if you can do this (I would recommend HDFS instead of a shared
> area) but can you insert the hostname of the node into the temp dir? That
> might stop separate nodes from messing up each others temp areas.
>
> (However I am guessing here)
>
>
> On 17 January 2011 08:21, rishi pathak <ma...@gmail.com> wrote:
>
> > # Error ########
> >
> > java.io.IOException: The temporary job-output directory
> > file:/tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199/_temporary
> > doesn't exist!
> >
> >
>



-- 
---
Rishi Pathak
National PARAM Supercomputing Facility
C-DAC, Pune, India

Re: Nutch on a shared filesystem

Posted by Alex McLintock <al...@gmail.com>.
I'm not sure if you can do this (I would recommend HDFS instead of a shared
area) but can you insert the hostname of the node into the temp dir? That
might stop separate nodes from messing up each others temp areas.

(However I am guessing here)


On 17 January 2011 08:21, rishi pathak <ma...@gmail.com> wrote:

> # Error ########
>
> java.io.IOException: The temporary job-output directory
> file:/tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199/_temporary
> doesn't exist!
>
>