You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Jérôme Charron <je...@gmail.com> on 2006/07/06 17:54:41 UTC

Error with Hadoop-0.4.0

Hi,

I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK
1.5
(more precisely since HADOOP-129 and File replacement by Path).

In my environment, the crawl command terminate with the following error:
2006-07-06 17:41:49,735 ERROR mapred.JobClient (JobClient.java:submitJob(273))
- Input directory /localpath/crawl/crawldb/current in local is invalid.
Exception in thread "main" java.io.IOException: Input directory
/localpathcrawl/crawldb/current in local is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

By looking at the Nutch code, and simply changing the line 145 of Injector
by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))
all is working fine. By taking a closer look at CrawlDb code, I finaly don"t
understand why there is the following line in the createJob method:
job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));

For curiosity, if a hadoop guru can explain why there is such a
regression...

Does somebody have the same error?

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: Error with Hadoop-0.4.0

Posted by Doug Cutting <cu...@apache.org>.

Sami Siren wrote:
> Patch works for me.

OK.  I just committed it.

Thanks!

Doug

Re: Error with Hadoop-0.4.0

Posted by Sami Siren <ss...@gmail.com>.

Doug Cutting wrote:

>  Jérôme Charron wrote:
>
> > In my environment, the crawl command terminate with the following
> > error: 2006-07-06 17:41:49,735 ERROR mapred.JobClient
> > (JobClient.java:submitJob(273)) - Input directory
> > /localpath/crawl/crawldb/current in local is invalid. Exception in
> > thread "main" java.io.IOException: Input directory
> > /localpathcrawl/crawldb/current in local is invalid. at
> > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at
> > org.apache.nutch.crawl.Injector.inject(Injector.java:146) at
> > org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>
>
>  Hadoop 0.4.0 by default requires all input directories to exist,
>  where previous releases did not. So we need to either create an
>  empty "current" directory or change the InputFormat used in
>  CrawlDb.createJob() to be one that overrides
>  InputFormat.areValidInputDirectories(). The former is probably
>  easier. I've attached a patch. Does this fix things for folks?
>

Patch works for me.
--
 Sami Siren

Re: Error with Hadoop-0.4.0

Posted by Doug Cutting <cu...@apache.org>.

Jérôme Charron wrote:
> In my environment, the crawl command terminate with the following error:
> 2006-07-06 17:41:49,735 ERROR mapred.JobClient 
> (JobClient.java:submitJob(273))
> - Input directory /localpath/crawl/crawldb/current in local is invalid.
> Exception in thread "main" java.io.IOException: Input directory
> /localpathcrawl/crawldb/current in local is invalid.
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>        at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

Hadoop 0.4.0 by default requires all input directories to exist, where 
previous releases did not.  So we need to either create an empty 
"current" directory or change the InputFormat used in 
CrawlDb.createJob() to be one that overrides 
InputFormat.areValidInputDirectories().  The former is probably easier. 
  I've attached a patch.  Does this fix things for folks?

Doug

Re: Error with Hadoop-0.4.0

Posted by Andrzej Bialecki <ab...@getopt.org>.

Jérôme Charron wrote:
>
> What I suggest, is simply to remove the line 75 in createJob method from
> CrawlDb :
> setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
> In fact, this method is only used by Injector.inject() and 
> CrawlDb.update()
> and
> the inputPath setted in createJob is not needed neither by 
> Injector.inject()
> nor
> CrawlDb.update() methods.

Hold your horses - it IS needed, otherwise you will lose the original 
information from CrawlDB.

>
> If no objection, I will commit this change tomorrow.

-1.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Error with Hadoop-0.4.0

Posted by Andrzej Bialecki <ab...@getopt.org>.

Stefan Groschupf wrote:
> We tried your suggested fix:
>> Injector
>> by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
>> (tempDir))

I suspect that this is not the right solution - have you actually tested 
that the resulting db contains all entries from the input dirs?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Error with Hadoop-0.4.0

Posted by Stefan Groschupf <sg...@media-style.com>.

We tried your suggested fix:
> Injector
> by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
> (tempDir))

and this worked without any problem.

Thanks for catching that, this saved us a lot of time.
Stefan

On 07.07.2006, at 16:08, Jérôme Charron wrote:

>> I have the same problem on a distribute environment! :-(
>> So I think can confirm this is a bug.
>
> Thanks for this feedback Stefan.
>
>
>> We should fix that.
>
> What I suggest, is simply to remove the line 75 in createJob method  
> from
> CrawlDb :
> setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
> In fact, this method is only used by Injector.inject() and  
> CrawlDb.update()
> and
> the inputPath setted in createJob is not needed neither by  
> Injector.inject()
> nor
> CrawlDb.update() methods.
>
> If no objection, I will commit this change tomorrow.
>
> Regards
>
> Jérôme
>
> -- 
> http://motrech.free.fr/
> http://www.frutch.org/

Re: Error with Hadoop-0.4.0

Posted by Jérôme Charron <je...@gmail.com>.

> I have the same problem on a distribute environment! :-(
> So I think can confirm this is a bug.

Thanks for this feedback Stefan.


> We should fix that.

What I suggest, is simply to remove the line 75 in createJob method from
CrawlDb :
setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
In fact, this method is only used by Injector.inject() and CrawlDb.update()
and
the inputPath setted in createJob is not needed neither by Injector.inject()
nor
CrawlDb.update() methods.

If no objection, I will commit this change tomorrow.

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: Error with Hadoop-0.4.0

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi Jérôme,

I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.
We should fix that.

Stefan

On 06.07.2006, at 08:54, Jérôme Charron wrote:

> Hi,
>
> I encountered some problems with Nutch trunk version.
> In fact it seems to be related to changes related to Hadoop-0.4.0  
> and JDK
> 1.5
> (more precisely since HADOOP-129 and File replacement by Path).
>
> In my environment, the crawl command terminate with the following  
> error:
> 2006-07-06 17:41:49,735 ERROR mapred.JobClient  
> (JobClient.java:submitJob(273))
> - Input directory /localpath/crawl/crawldb/current in local is  
> invalid.
> Exception in thread "main" java.io.IOException: Input directory
> /localpathcrawl/crawldb/current in local is invalid.
>        at org.apache.hadoop.mapred.JobClient.submitJob 
> (JobClient.java:274)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
> 327)
>        at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>
> By looking at the Nutch code, and simply changing the line 145 of  
> Injector
> by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
> (tempDir))
> all is working fine. By taking a closer look at CrawlDb code, I  
> finaly don"t
> understand why there is the following line in the createJob method:
> job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
>
> For curiosity, if a hadoop guru can explain why there is such a
> regression...
>
> Does somebody have the same error?
>
> Regards
>
> Jérôme
>
> -- 
> http://motrech.free.fr/
> http://www.frutch.org/

Re: Error with Hadoop-0.4.0

Posted by Jérôme Charron <je...@gmail.com>.

> > I encountered some problems with Nutch trunk version.
> > In fact it seems to be related to changes related to Hadoop-0.4.0 and
> JDK
> > 1.5
> > (more precisely since HADOOP-129 and File replacement by Path).
> > Does somebody have the same error?
>
> I am not seeing this (just run inject on a single machine(linux)
> configuration, local fs without problems ).

Thanks for your feedback Sami.
The strange think is that I have exactly the same behavior on two different
boxes !!

Jérôme

Re: Error with Hadoop-0.4.0

Posted by Sami Siren <ss...@gmail.com>.

Gal Nitzan wrote:

>To get the same behavior, just try to inject to a new crawldb that doesn't
>exist.
>
>The reason many doesn't get it is that crawldb already exists in their
>environment.
>
>
>  
>
true, I was injecting to existing crawldb.

--
 Sami Siren

RE: Error with Hadoop-0.4.0

Posted by Gal Nitzan <gn...@usa.net>.

To get the same behavior, just try to inject to a new crawldb that doesn't
exist.

The reason many doesn't get it is that crawldb already exists in their
environment.

-----Original Message-----
From: Sami Siren [mailto:ssiren@gmail.com] 
Sent: Thursday, July 06, 2006 7:23 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Error with Hadoop-0.4.0

Jérôme Charron wrote:

> Hi,
>
> I encountered some problems with Nutch trunk version.
> In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK
> 1.5
> (more precisely since HADOOP-129 and File replacement by Path).
> Does somebody have the same error?

I am not seeing this (just run inject on a single machine(linux) 
configuration, local fs without problems ).

--
 Sami Siren

Re: Error with Hadoop-0.4.0

Posted by Sami Siren <ss...@gmail.com>.

Jérôme Charron wrote:

> Hi,
>
> I encountered some problems with Nutch trunk version.
> In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK
> 1.5
> (more precisely since HADOOP-129 and File replacement by Path).
> Does somebody have the same error?

I am not seeing this (just run inject on a single machine(linux) 
configuration, local fs without problems ).

--
 Sami Siren