You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bryan Woliner <br...@gmail.com> on 2006/07/25 16:00:11 UTC

Two Errors in Nutch 0.8 Tutorial?

I am certainly far from a nutch expert, but it appears to me that there are
two errors in the current Nutch 0.8 tutorial.

First off, here is the version of Nutch 0.8 that I am using, in case there
has been changes made in newer version that invalidate my comments:

-bash-2.05b$ svn info
Path: .
URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
Repository Root: http://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 414318
Node Kind: directory
Schedule: normal
Last Changed Author: siren
Last Changed Rev: 414306
Last Changed Date: 2006-06-14 11:08:28 -0500 (Wed, 14 Jun 2006)
Properties Last Updated: 2006-06-14 12:00:57 -0500 (Wed, 14 Jun 2006)

Error #1:

Towards the end of the tutorial, the following command is found:

bin/nutch invertlinks crawl/linkdb crawl/segments


When I call this command verbatim, I get the following error:

2006-07-25 08:44:40,503 WARN  mapred.LocalJobRunner
(LocalJobRunner.java:run(119))
- job_8ly5hf
java.io.IOException: No input directories specified in: Configuration:
defaults: hadoop-default.xml , mapred-default.xml ,
/home/bryan/nutch-8d/hadoop/mapred/local/localRunner/job_8ly5hf.xmlfinal:
hadoop-site.xml
        at org.apache.hadoop.mapred.InputFormatBase.listPaths(
InputFormatBase.java:96)
        at org.apache.hadoop.mapred.SequenceFileInputFormat.listPaths(
SequenceFileInputFormat.java:37)
        at org.apache.hadoop.mapred.InputFormatBase.getSplits(
InputFormatBase.java:106)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
LocalJobRunner.java:80)
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:342)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:203)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:305)

I think the correct syntax for the command should be:

bin/nutch invertlinks crawl/linkdb crawl/segments/* (with the /* added
to the end).

Error #2:

The tutorial says that to index, the following command should be called:

bin/nutch index indexes crawl/linkdb crawl/segments/*

However, when I call that command I get the following error:

Usage: <index> <crawldb> <linkdb> <segment> ...

I believe the correct syntax should be:

bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

If these are indeed errors in the tutorial, perhaps someone with the
authority to do so would be kind enough the make the necessary
changes.

My two cents,
Bryan

Re: Two Errors in Nutch 0.8 Tutorial?

Posted by Matthew Holt <mh...@redhat.com>.

n/m it's there now..
Matt

Matthew Holt wrote:
> If you download the latest trunk copy of 0.8, bin/nutch will not even 
> be available.. is this supposed to be this way?
> Matt
>
> Bryan Woliner wrote:
>> I am certainly far from a nutch expert, but it appears to me that 
>> there are
>> two errors in the current Nutch 0.8 tutorial.
>>
>> First off, here is the version of Nutch 0.8 that I am using, in case 
>> there
>> has been changes made in newer version that invalidate my comments:
>>
>> -bash-2.05b$ svn info
>> Path: .
>> URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
>> Repository Root: http://svn.apache.org/repos/asf
>> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
>> Revision: 414318
>> Node Kind: directory
>> Schedule: normal
>> Last Changed Author: siren
>> Last Changed Rev: 414306
>> Last Changed Date: 2006-06-14 11:08:28 -0500 (Wed, 14 Jun 2006)
>> Properties Last Updated: 2006-06-14 12:00:57 -0500 (Wed, 14 Jun 2006)
>>
>> Error #1:
>>
>> Towards the end of the tutorial, the following command is found:
>>
>> bin/nutch invertlinks crawl/linkdb crawl/segments
>>
>>
>> When I call this command verbatim, I get the following error:
>>
>> 2006-07-25 08:44:40,503 WARN  mapred.LocalJobRunner
>> (LocalJobRunner.java:run(119))
>> - job_8ly5hf
>> java.io.IOException: No input directories specified in: Configuration:
>> defaults: hadoop-default.xml , mapred-default.xml ,
>> /home/bryan/nutch-8d/hadoop/mapred/local/localRunner/job_8ly5hf.xmlfinal: 
>>
>> hadoop-site.xml
>>        at org.apache.hadoop.mapred.InputFormatBase.listPaths(
>> InputFormatBase.java:96)
>>        at org.apache.hadoop.mapred.SequenceFileInputFormat.listPaths(
>> SequenceFileInputFormat.java:37)
>>        at org.apache.hadoop.mapred.InputFormatBase.getSplits(
>> InputFormatBase.java:106)
>>        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>> LocalJobRunner.java:80)
>> Exception in thread "main" java.io.IOException: Job failed!
>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:342)
>>        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:203)
>>        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:305)
>>
>> I think the correct syntax for the command should be:
>>
>> bin/nutch invertlinks crawl/linkdb crawl/segments/* (with the /* added
>> to the end).
>>
>> Error #2:
>>
>> The tutorial says that to index, the following command should be called:
>>
>> bin/nutch index indexes crawl/linkdb crawl/segments/*
>>
>> However, when I call that command I get the following error:
>>
>> Usage: <index> <crawldb> <linkdb> <segment> ...
>>
>> I believe the correct syntax should be:
>>
>> bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb 
>> crawl/segments/*
>>
>> If these are indeed errors in the tutorial, perhaps someone with the
>> authority to do so would be kind enough the make the necessary
>> changes.
>>
>> My two cents,
>> Bryan
>>
>

Re: Two Errors in Nutch 0.8 Tutorial?

Posted by Matthew Holt <mh...@redhat.com>.

If you download the latest trunk copy of 0.8, bin/nutch will not even be 
available.. is this supposed to be this way?
Matt

Bryan Woliner wrote:
> I am certainly far from a nutch expert, but it appears to me that 
> there are
> two errors in the current Nutch 0.8 tutorial.
>
> First off, here is the version of Nutch 0.8 that I am using, in case 
> there
> has been changes made in newer version that invalidate my comments:
>
> -bash-2.05b$ svn info
> Path: .
> URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
> Repository Root: http://svn.apache.org/repos/asf
> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
> Revision: 414318
> Node Kind: directory
> Schedule: normal
> Last Changed Author: siren
> Last Changed Rev: 414306
> Last Changed Date: 2006-06-14 11:08:28 -0500 (Wed, 14 Jun 2006)
> Properties Last Updated: 2006-06-14 12:00:57 -0500 (Wed, 14 Jun 2006)
>
> Error #1:
>
> Towards the end of the tutorial, the following command is found:
>
> bin/nutch invertlinks crawl/linkdb crawl/segments
>
>
> When I call this command verbatim, I get the following error:
>
> 2006-07-25 08:44:40,503 WARN  mapred.LocalJobRunner
> (LocalJobRunner.java:run(119))
> - job_8ly5hf
> java.io.IOException: No input directories specified in: Configuration:
> defaults: hadoop-default.xml , mapred-default.xml ,
> /home/bryan/nutch-8d/hadoop/mapred/local/localRunner/job_8ly5hf.xmlfinal:
> hadoop-site.xml
>        at org.apache.hadoop.mapred.InputFormatBase.listPaths(
> InputFormatBase.java:96)
>        at org.apache.hadoop.mapred.SequenceFileInputFormat.listPaths(
> SequenceFileInputFormat.java:37)
>        at org.apache.hadoop.mapred.InputFormatBase.getSplits(
> InputFormatBase.java:106)
>        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
> LocalJobRunner.java:80)
> Exception in thread "main" java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:342)
>        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:203)
>        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:305)
>
> I think the correct syntax for the command should be:
>
> bin/nutch invertlinks crawl/linkdb crawl/segments/* (with the /* added
> to the end).
>
> Error #2:
>
> The tutorial says that to index, the following command should be called:
>
> bin/nutch index indexes crawl/linkdb crawl/segments/*
>
> However, when I call that command I get the following error:
>
> Usage: <index> <crawldb> <linkdb> <segment> ...
>
> I believe the correct syntax should be:
>
> bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
>
> If these are indeed errors in the tutorial, perhaps someone with the
> authority to do so would be kind enough the make the necessary
> changes.
>
> My two cents,
> Bryan
>