You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alexander E Genaud <lx...@pobox.com> on 2006/07/29 16:42:13 UTC
nutch 0.8: invertlinks IOException segments/parse_data
Hello, I am receiving an IOException when running a Whole web crawl
via cygwin. Interestingly (to me at least), the error reads:
..../crawl/segments/parse_data
rather than
..../crawl/segments/20060729123456/parse_data
$ nutch-0.8/bin/nutch invertlinks crawl/linkdb crawl/segments
Exception in thread "main" java.io.IOException: Input directory
c:/alex/vicaya-root/trunk/dist/vicaya-0.2.0/vicaya/crawl
/segments/parse_data in local is invalid.
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:212)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:316)
My crawl deviates from the tutorial in that I am hitting localhost, I
have created the url seeds manually, my crawl/crawldb, etc directories
are in a different location, and my regex-urlfilter.txt looks like
this:
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
-[?*!@=]
-.*(/.+?)/.*?\1/.*?\1/
+^http://([a-z0-9]*\.)*localhost:8108/
Does anything seem immediately/obviously wrong to anyone?
Re: nutch 0.8: invertlinks IOException segments/parse_data
Posted by Sami Siren <ss...@gmail.com>.
please try
bin/nutch invertlinks crawl/linkdb -dir crawl/segments/
--
Sami Siren
Alexander E Genaud wrote:
> Hello, I am receiving an IOException when running a Whole web crawl
> via cygwin. Interestingly (to me at least), the error reads:
>
> ..../crawl/segments/parse_data
>
> rather than
>
> ..../crawl/segments/20060729123456/parse_data
>
>
> $ nutch-0.8/bin/nutch invertlinks crawl/linkdb crawl/segments
> Exception in thread "main" java.io.IOException: Input directory
> c:/alex/vicaya-root/trunk/dist/vicaya-0.2.0/vicaya/crawl
> /segments/parse_data in local is invalid.
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:212)
> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:316)
>
>
> My crawl deviates from the tutorial in that I am hitting localhost, I
> have created the url seeds manually, my crawl/crawldb, etc directories
> are in a different location, and my regex-urlfilter.txt looks like
> this:
>
>
> -^(file|ftp|mailto):
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
>
> -[?*!@=]
> -.*(/.+?)/.*?\1/.*?\1/
> +^http://([a-z0-9]*\.)*localhost:8108/
>
>
> Does anything seem immediately/obviously wrong to anyone?
>