You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Peter Thygesen <th...@infopaq.dk> on 2008/01/04 18:30:12 UTC

crawling and writing to hdfs

If I want to use nutch and hdsf, does each nutch crawler then have to be
a datanode, or can the crawler just write to hdfs without being a
member.

 

I have no problems connecting to the hdfs from a plain nutch crawler
server I have set up, but when I start crawling I get an exception.

 

Injector: starting

Injector: crawlDb: crawl-20080104174550/crawldb

Injector: urlDir: crawled

Injector: Converting injected urls to crawl db entries.

Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
java.io.IOException:
/mnt/data/hadoop-datastore/hadoop-hadoop/mapred/system/job_200712171708_
0002/job.xml: No such file or directory

Kind regards,

Peter Thygesen

RE: crawling and writing to hdfs

Posted by Peter Thygesen <th...@infopaq.dk>.

Glad to hear that. But as I wrote I can't get it to work unless the
crawler is a datanode. :(

\Peter

-----Original Message-----
From: Dennis Kubes [mailto:kubes@apache.org] 
Sent: 6. januar 2008 02:21
To: nutch-user@lucene.apache.org
Subject: Re: crawling and writing to hdfs

The crawlers do NOT have to be datanodes.  It is possible to have the 
MapReduce tasktrackers and the datanodes separate although there is 
reduced network overhead and optimized scheduling of tasks when they are

on the same machines.

Dennis Kubes

Peter Thygesen wrote:
> If I want to use nutch and hdsf, does each nutch crawler then have to
be
> a datanode, or can the crawler just write to hdfs without being a
> member.
> 
>  
> 
> I have no problems connecting to the hdfs from a plain nutch crawler
> server I have set up, but when I start crawling I get an exception.
> 
>  
> 
> Injector: starting
> 
> Injector: crawlDb: crawl-20080104174550/crawldb
> 
> Injector: urlDir: crawled
> 
> Injector: Converting injected urls to crawl db entries.
> 
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
> java.io.IOException:
>
/mnt/data/hadoop-datastore/hadoop-hadoop/mapred/system/job_200712171708_
> 0002/job.xml: No such file or directory
> 
> Kind regards,
> 
> Peter Thygesen 
> 
>  
> 
>

Re: crawling and writing to hdfs

Posted by Dennis Kubes <ku...@apache.org>.

The crawlers do NOT have to be datanodes.  It is possible to have the 
MapReduce tasktrackers and the datanodes separate although there is 
reduced network overhead and optimized scheduling of tasks when they are 
on the same machines.

Dennis Kubes

Peter Thygesen wrote:
> If I want to use nutch and hdsf, does each nutch crawler then have to be
> a datanode, or can the crawler just write to hdfs without being a
> member.
> 
>  
> 
> I have no problems connecting to the hdfs from a plain nutch crawler
> server I have set up, but when I start crawling I get an exception.
> 
>  
> 
> Injector: starting
> 
> Injector: crawlDb: crawl-20080104174550/crawldb
> 
> Injector: urlDir: crawled
> 
> Injector: Converting injected urls to crawl db entries.
> 
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
> java.io.IOException:
> /mnt/data/hadoop-datastore/hadoop-hadoop/mapred/system/job_200712171708_
> 0002/job.xml: No such file or directory
> 
> Kind regards,
> 
> Peter Thygesen 
> 
>  
> 
>