You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Oleg V. Konovalov" <ko...@afterlogic.com> on 2007/02/21 15:42:02 UTC

Nutch 0.8.1 problems

Hello, colleagues!

I've a problems, connected with Nutch 0.8.1 startup/usage with Hadoop/Tomcat.

//System distribution - FC3 (clone), basic configuration. 
I've downloaded Apache Tomcat 5.5.20 - binary distr., Apache ANT 1.7.0 - binary distr. JDK 1.5.0-05 from SUN (also binary distr.).

Nutch builded successfully (with some warnings), but in build.xml we need to comment one block of code, else Ant won't build Nutch (am I right?):

<touch datetime="01/25/1971 2:00 pm">
      <fileset dir="${conf.dir}" includes="**/*.template"/>
      <fileset dir="${contrib.dir}" includes="**/*.template"/>
</touch>

Next step - running Tomcat, which is underlay for Nutch (in my case). Well, previously I build "nutch-*.war" file, so rename it to ROOT.war, place into "webapps" directory, and restarting tomcat, as described in the (thin) tutorials about Nutch. Web-part works with some problems, but it's future, as for nowtime - we need to run Hadoop, and Nutch must be able to work with it.

Hadoop builded (with same problems) configured (according to tutorial) and first problem has a place: in the hadoop-site.xml config no possibility to use recomended "local" values - Hadoop didn't start at all, so, replace these literal strings to "localhost:900x" (according to instructions) and trying to start Hadoop instance. Well, it starts, and possibly works, - I've see no errors (console/logs), so, trying to run Nutch.

And here we've a set of problems of different types.

Nutch didn't work... :( 

First, we try to generate crawldb/segments

switch to "nutch" user first:

bash-3.00# su - nutch

secondary - start Hadoop:

-bash-3.00$ cd ../search/
-bash-3.00$ bin/start-all.sh

starting namenode, logging to /nutch/search/../var/logs/hadoop-nutch-namenode-workstation15.dom2.out
localhost: starting datanode, logging to /nutch/search/../var/logs/hadoop-nutch-datanode-workstation15.dom2.out
starting jobtracker, logging to /nutch/search/../var/logs/hadoop-nutch-jobtracker-workstation15.dom2.out
localhost: starting tasktracker, logging to /nutch/search/../var/logs/hadoop-nutch-tasktracker-workstation15.dom2.out

OK, next, "generate":

-bash-3.00$ bin/nutch generate crawl/crawldb crawl/segments
Generator: starting
Generator: segment: crawl/segments/20070221171048
Generator: Selecting best-scoring urls due for fetch.
Exception in thread "main" java.io.IOException: Input directory /user/nutch/crawl/crawldb/current in localhost:9000 is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at org.apache.nutch.crawl.Generator.generate(Generator.java:319)
        at org.apache.nutch.crawl.Generator.main(Generator.java:395)

And so on...
>From this time-point, step-to-left, step-to-right - same results. I've tried many ways, but no success...

Main problems I've seen - no examples, no documentation, no tutorials, which must be usable and answering my questions ...

Any ideas? Somebody, help!

Thanks...

--
Oleg.

Re: Nutch 0.8.1 problems

Posted by "Oleg V. Konovalov" <ko...@afterlogic.com>.

On Wed, 21 Feb 2007 17:32:11 +0200
"Doğacan Güney" <do...@gmail.com> wrote:

[chainsaw]
 
> Very strange. I am not sure what the problem is then. Can you include
> the output of commands:
> 
> hadoop dfs -ls /nutch/filesystem/crawl/
> hadoop dfs -ls /nutch/filesystem/crawl/crawldb

[chainsaw]

With pleasure...

$ bin/hadoop dfs -ls /nutch/filesystem/crawl/
Found 0 items

$ bin/hadoop dfs -ls /nutch/filesystem/crawl/crawldb
Found 0 items


--
Oleg.

Re: Nutch 0.8.1 problems

Posted by Doğacan Güney <do...@gmail.com>.

On 2/21/07, Oleg V. Konovalov <ko...@afterlogic.com> wrote:
> Thanx, but... As I wrote earlier, - I've tried MANY WAYS, including recommended.
>
> For example:
>
> bin/nutch generate /nutch/filesystem/crawl/crawldb /nutch/filesystem/crawl/segments
> Generator: starting
> Generator: segment: /nutch/filesystem/crawl/segments/20070221175753
> Generator: Selecting best-scoring urls due for fetch.
> Exception in thread "main" java.io.IOException: Input directory /nutch/filesystem/crawl/crawldb/current in localhost:9000 is invalid.
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:319)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:395)
>
> /nutch/filesystem/crawl/crawldb/current EXISTS!

Very strange. I am not sure what the problem is then. Can you include
the output of commands:

hadoop dfs -ls /nutch/filesystem/crawl/
hadoop dfs -ls /nutch/filesystem/crawl/crawldb

>
> Any other ideas?
>
> --
> Oleg.
>
>


-- 
Doğacan Güney

Re: Nutch 0.8.1 problems

Posted by "Oleg V. Konovalov" <ko...@afterlogic.com>.

On Wed, 21 Feb 2007 16:45:39 +0200
"Doğacan Güney" <do...@gmail.com> wrote:

> Hi,
> [snip]
> > OK, next, "generate":
> You configured nutch to look for HDFS at localhost:9000. If default fs
> is configured to be HDFS and you give a relative path to any nutch
> command (like crawl/crawldb) then nutch (actually hadoop) will assume
> that you are accessing /user/<username>/<relative_path>. You either
> have to put your crawldb there or configure nutch to use local fs or
> change generate's arguments.
> [snip]

Thanx, but... As I wrote earlier, - I've tried MANY WAYS, including recommended.

For example:

bin/nutch generate /nutch/filesystem/crawl/crawldb /nutch/filesystem/crawl/segments
Generator: starting
Generator: segment: /nutch/filesystem/crawl/segments/20070221175753
Generator: Selecting best-scoring urls due for fetch.
Exception in thread "main" java.io.IOException: Input directory /nutch/filesystem/crawl/crawldb/current in localhost:9000 is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at org.apache.nutch.crawl.Generator.generate(Generator.java:319)
        at org.apache.nutch.crawl.Generator.main(Generator.java:395)

/nutch/filesystem/crawl/crawldb/current EXISTS!

Any other ideas?

--
Oleg.

Re: Nutch 0.8.1 problems

Posted by "Oleg V. Konovalov" <ko...@afterlogic.com>.

On Wed, 21 Feb 2007 16:45:39 +0200
"Doğacan Güney" <do...@gmail.com> wrote:

> [snip]
> 
> -- 
> Doğacan Güney

In additional to previous message:

Second try, after Hadoop restarting.

bin/nutch generate /nutch/filesystem/crawl/crawldb /nutch/filesystem/crawl/segments
Generator: starting
Generator: segment: /nutch/filesystem/crawl/segments/20070221180735
Generator: Selecting best-scoring urls due for fetch.
Exception in thread "main" org.apache.hadoop.ipc.RemoteException: java.io.IOException: failed to create file /nutch/filesystem/mapreduce/system/submit_duuc5v/.job.jar.crc on client workstation15.dom2 because target-length is 0, below MIN_REPLICATION (1)
        at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:388)
        at org.apache.hadoop.dfs.NameNode.create(NameNode.java:159)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:243)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:469)

        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:159)

Thanx for your help...
--
Oleg.

Re: Nutch 0.8.1 problems

Posted by Doğacan Güney <do...@gmail.com>.

Hi,

On 2/21/07, Oleg V. Konovalov <ko...@afterlogic.com> wrote:
[snip]
> OK, next, "generate":
>
> -bash-3.00$ bin/nutch generate crawl/crawldb crawl/segments
> Generator: starting
> Generator: segment: crawl/segments/20070221171048
> Generator: Selecting best-scoring urls due for fetch.
> Exception in thread "main" java.io.IOException: Input directory /user/nutch/crawl/crawldb/current in localhost:9000 is invalid.
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:319)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:395)
>

You configured nutch to look for HDFS at localhost:9000. If default fs
is configured to be HDFS and you give a relative path to any nutch
command (like crawl/crawldb) then nutch (actually hadoop) will assume
that you are accessing /user/<username>/<relative_path>. You either
have to put your crawldb there or configure nutch to use local fs or
change generate's arguments.

[snip]

-- 
Doğacan Güney