You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Peter Thygesen <th...@infopaq.dk> on 2008/01/04 18:30:12 UTC
crawling and writing to hdfs
If I want to use nutch and hdsf, does each nutch crawler then have to be
a datanode, or can the crawler just write to hdfs without being a
member.
I have no problems connecting to the hdfs from a plain nutch crawler
server I have set up, but when I start crawling I get an exception.
Injector: starting
Injector: crawlDb: crawl-20080104174550/crawldb
Injector: urlDir: crawled
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
java.io.IOException:
/mnt/data/hadoop-datastore/hadoop-hadoop/mapred/system/job_200712171708_
0002/job.xml: No such file or directory
Kind regards,
Peter Thygesen
RE: crawling and writing to hdfs
Posted by Peter Thygesen <th...@infopaq.dk>.
Glad to hear that. But as I wrote I can't get it to work unless the
crawler is a datanode. :(
\Peter
-----Original Message-----
From: Dennis Kubes [mailto:kubes@apache.org]
Sent: 6. januar 2008 02:21
To: nutch-user@lucene.apache.org
Subject: Re: crawling and writing to hdfs
The crawlers do NOT have to be datanodes. It is possible to have the
MapReduce tasktrackers and the datanodes separate although there is
reduced network overhead and optimized scheduling of tasks when they are
on the same machines.
Dennis Kubes
Peter Thygesen wrote:
> If I want to use nutch and hdsf, does each nutch crawler then have to
be
> a datanode, or can the crawler just write to hdfs without being a
> member.
>
>
>
> I have no problems connecting to the hdfs from a plain nutch crawler
> server I have set up, but when I start crawling I get an exception.
>
>
>
> Injector: starting
>
> Injector: crawlDb: crawl-20080104174550/crawldb
>
> Injector: urlDir: crawled
>
> Injector: Converting injected urls to crawl db entries.
>
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
> java.io.IOException:
>
/mnt/data/hadoop-datastore/hadoop-hadoop/mapred/system/job_200712171708_
> 0002/job.xml: No such file or directory
>
> Kind regards,
>
> Peter Thygesen
>
>
>
>
Re: crawling and writing to hdfs
Posted by Dennis Kubes <ku...@apache.org>.
The crawlers do NOT have to be datanodes. It is possible to have the
MapReduce tasktrackers and the datanodes separate although there is
reduced network overhead and optimized scheduling of tasks when they are
on the same machines.
Dennis Kubes
Peter Thygesen wrote:
> If I want to use nutch and hdsf, does each nutch crawler then have to be
> a datanode, or can the crawler just write to hdfs without being a
> member.
>
>
>
> I have no problems connecting to the hdfs from a plain nutch crawler
> server I have set up, but when I start crawling I get an exception.
>
>
>
> Injector: starting
>
> Injector: crawlDb: crawl-20080104174550/crawldb
>
> Injector: urlDir: crawled
>
> Injector: Converting injected urls to crawl db entries.
>
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
> java.io.IOException:
> /mnt/data/hadoop-datastore/hadoop-hadoop/mapred/system/job_200712171708_
> 0002/job.xml: No such file or directory
>
> Kind regards,
>
> Peter Thygesen
>
>
>
>