You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Neufeind (JIRA)" <ji...@apache.org> on 2006/05/20 20:24:30 UTC

[jira] Commented: (NUTCH-175) No input directories specified in: while crawing in nightly build from the 14.1.2006: sh ./nutch crawl urllist.txt -dir tmpdir

    [ http://issues.apache.org/jira/browse/NUTCH-175?page=comments#action_12412644 ] 

Stefan Neufeind commented on NUTCH-175:
---------------------------------------

My bad I didn't pay close attention when moving from 0.7 to 0.8. But I'd like to stress in this bug-entry that "urls" in the example-call to "nutch crawl" is no longer a file - but actually a directory containing files with urls in them.

RTFM - and now it works :-)

> No input directories specified in: while crawing in nightly build from the 14.1.2006: sh ./nutch crawl urllist.txt -dir tmpdir
> ------------------------------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-175
>          URL: http://issues.apache.org/jira/browse/NUTCH-175
>      Project: Nutch
>         Type: Bug

>  Environment: SUSE Linux 9.3
>     Reporter: Matthias Günter
>     Priority: Trivial

>
> guenter@deimos:~/workspace/lucene/nutch-nightly/bin> sh ./nutch crawl urllist.txt -dir tmpdir
> 060114 205612 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml
> 060114 205612 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/crawl-tool.xml
> 060114 205612 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml
> 060114 205612 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml
> 060114 205612 crawl started in: tmpdir
> 060114 205612 rootUrlDir = urllist.txt
> 060114 205612 threads = 10
> 060114 205612 depth = 5
> 060114 205612 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml
> 060114 205612 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/crawl-tool.xml
> 060114 205612 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml
> 060114 205612 Injector: starting
> 060114 205612 Injector: crawlDb: tmpdir/crawldb
> 060114 205612 Injector: urlDir: urllist.txt
> 060114 205612 Injector: Converting injected urls to crawl db entries.
> 060114 205612 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml
> 060114 205612 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/crawl-tool.xml
> 060114 205612 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml
> 060114 205612 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml
> 060114 205612 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml
> 060114 205612 Running job: job_n0o7ps
> 060114 205612 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml
> 060114 205613 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml
> 060114 205613 parsing /tmp/nutch/mapred/local/localRunner/job_n0o7ps.xml
> 060114 205613 parsing file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml
> java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml , /tmp/nutch/mapred/local/localRunner/job_n0o7ps.xml , nutch-site.xml
>         at org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:85)
>         at org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:95)
>         at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:63)
> 060114 205613  map 0%
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
> urllist.txt contains
>   http://www.mentor.ch
> PS: Is there a committer or developer (near Switzerland) who can support (paid support) with a mixed index for intranet, some internet sites and scanning of local drives (P:\ , S:\ etc)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira