You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Alexander E Genaud <lx...@pobox.com> on 2006/07/29 16:42:13 UTC

nutch 0.8: invertlinks IOException segments/parse_data

Hello, I am receiving an IOException when running a Whole web crawl
via cygwin. Interestingly (to me at least), the error reads:

..../crawl/segments/parse_data

rather than

..../crawl/segments/20060729123456/parse_data


$ nutch-0.8/bin/nutch invertlinks crawl/linkdb crawl/segments
Exception in thread "main" java.io.IOException: Input directory
c:/alex/vicaya-root/trunk/dist/vicaya-0.2.0/vicaya/crawl
/segments/parse_data in local is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:212)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:316)


My crawl deviates from the tutorial in that I am hitting localhost, I
have created the url seeds manually, my crawl/crawldb, etc directories
are in a different location, and my regex-urlfilter.txt looks like
this:


-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
-[?*!@=]
-.*(/.+?)/.*?\1/.*?\1/
+^http://([a-z0-9]*\.)*localhost:8108/


Does anything seem immediately/obviously wrong to anyone?

Re: nutch 0.8: invertlinks IOException segments/parse_data

Posted by Sami Siren <ss...@gmail.com>.

please try

bin/nutch invertlinks crawl/linkdb -dir crawl/segments/

--
  Sami Siren


Alexander E Genaud wrote:
> Hello, I am receiving an IOException when running a Whole web crawl
> via cygwin. Interestingly (to me at least), the error reads:
> 
> ..../crawl/segments/parse_data
> 
> rather than
> 
> ..../crawl/segments/20060729123456/parse_data
> 
> 
> $ nutch-0.8/bin/nutch invertlinks crawl/linkdb crawl/segments
> Exception in thread "main" java.io.IOException: Input directory
> c:/alex/vicaya-root/trunk/dist/vicaya-0.2.0/vicaya/crawl
> /segments/parse_data in local is invalid.
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:212)
>        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:316)
> 
> 
> My crawl deviates from the tutorial in that I am hitting localhost, I
> have created the url seeds manually, my crawl/crawldb, etc directories
> are in a different location, and my regex-urlfilter.txt looks like
> this:
> 
> 
> -^(file|ftp|mailto):
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ 
> 
> -[?*!@=]
> -.*(/.+?)/.*?\1/.*?\1/
> +^http://([a-z0-9]*\.)*localhost:8108/
> 
> 
> Does anything seem immediately/obviously wrong to anyone?
>