You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by oh...@cox.net on 2009/07/14 02:58:15 UTC

Just getting started w/tutorial- errors in crawl.log

Hi,

I've just gotten nutch installed, and am stepping through the tutorial at:

http://lucene.apache.org/nutch/tutorial8.html

It seems to be working, but I get a number of messages in crawl.log, like:

Error parsing: http://lucene.apache.org/skin/getMenu.js: org.apache.nutch.parse.ParseException: parser not found for contentType=application/javascript url=http://lucene.apache.org/skin/getMenu.js
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

Then, at the end of the log, I get:

LinkDb: adding segment: file:/opt/nutch-1.0/crawl.test/segments/20090713171413
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch-1.0/crawl.test/segments/20090713171413/parse_data
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)

I must have missed something, but being new, I can't figure out what is causing that problem?

Thanks,
Jim

Re: Just getting started w/tutorial- errors in crawl.log

Posted by oh...@cox.net.

Alex (et al),

There was/is plenty of space on the drive (>3GB).

I was trying the command line from the tutorial:

bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log

I'm re-running again, to see what happens.  If I get that error again, I'll delete the dirs, as yourself and xiao yang suggested.

Jim

---- Alex McLintock <al...@gmail.com> wrote: 
> > but I get a number of messages in crawl.log, like:
> >
> > Error parsing: http://lucene.apache.org/skin/getMenu.js: org.apache.nutch.parse.ParseException: parser not found for contentType=application/javascript url=http://lucene.apache.org/skin/getMenu.js
> >        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
> >        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
> >        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
> 
> I dont see this as an error to worry about. It is just saying that it
> has been directed to fetch a ".js" file but it doesnt know
> how to parse it looking for values to index or links to crawl. I dont
> see the need to do that with javascript so I would treat this "Error"
> as a warning.
> 
> 
> > Then, at the end of the log, I get:
> >
> > LinkDb: adding segment: file:/opt/nutch-1.0/crawl.test/segments/20090713171413
> > Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch-1.0/crawl.test/segments/20090713171413/parse_data
> >        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
> >        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
> >
> > I must have missed something, but being new, I can't figure out what is causing that problem?
> >
> > Thanks,
> > Jim
> 
> Have you told us what commands you ran? Is the hard disk full? What is
> actually in that segment? Does it contain perhaps an aborted run?
> 
> Can you simply delete that segment/directory if there isnt much data
> in there that you dont mind losing?
> 
> Goodluck.
> 
> Alex

Re: Just getting started w/tutorial- errors in crawl.log

Posted by Alex McLintock <al...@gmail.com>.

> but I get a number of messages in crawl.log, like:
>
> Error parsing: http://lucene.apache.org/skin/getMenu.js: org.apache.nutch.parse.ParseException: parser not found for contentType=application/javascript url=http://lucene.apache.org/skin/getMenu.js
>        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
>        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
>        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

I dont see this as an error to worry about. It is just saying that it
has been directed to fetch a ".js" file but it doesnt know
how to parse it looking for values to index or links to crawl. I dont
see the need to do that with javascript so I would treat this "Error"
as a warning.


> Then, at the end of the log, I get:
>
> LinkDb: adding segment: file:/opt/nutch-1.0/crawl.test/segments/20090713171413
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch-1.0/crawl.test/segments/20090713171413/parse_data
>        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
>        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
>
> I must have missed something, but being new, I can't figure out what is causing that problem?
>
> Thanks,
> Jim

Have you told us what commands you ran? Is the hard disk full? What is
actually in that segment? Does it contain perhaps an aborted run?

Can you simply delete that segment/directory if there isnt much data
in there that you dont mind losing?

Goodluck.

Alex

Re: Just getting started w/tutorial- errors in crawl.log

Posted by Beats <ta...@yahoo.com>.

hi jim,

what i think ur error statement says it couldn't  find plugin for parsing a
perticular content type.

go to parse-plugins.xml in conf directory.
there u will find different plugin id define for different Content type.

add perticular plugin-id in nutch-site.xml file under plugin.includes
property.

in ur case it is try adding parse-js

gud luck

Beats


ohaya wrote:
> 
> Hi,
> 
> I've just gotten nutch installed, and am stepping through the tutorial at:
> 
> http://lucene.apache.org/nutch/tutorial8.html
> 
> It seems to be working, but I get a number of messages in crawl.log, like:
> 
> Error parsing: http://lucene.apache.org/skin/getMenu.js:
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=application/javascript
> url=http://lucene.apache.org/skin/getMenu.js
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
> 
> Then, at the end of the log, I get:
> 
> LinkDb: adding segment:
> file:/opt/nutch-1.0/crawl.test/segments/20090713171413
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> file:/opt/nutch-1.0/crawl.test/segments/20090713171413/parse_data
>         at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
>         at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
>         at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
>         at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
>         at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>         at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
> 
> I must have missed something, but being new, I can't figure out what is
> causing that problem?
> 
> Thanks,
> Jim
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Just-getting-started-w-tutorial--errors-in-crawl.log-tp24472043p24476828.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Just getting started w/tutorial- errors in crawl.log

Posted by xiao yang <ya...@gmail.com>.

Hi, Jim

I got the second error too. It's because the previous crawl failed
abnormally.
There should be the following sub-directories in /segments/20090713171413:
content  crawl_fetch  crawl_generate  crawl_parse  parse_data  parse_text

My solution is deleting the corrupted directory and re-crawl.
If error still occurs, see logs/hadoop.log for details.

Xiao

On Tue, Jul 14, 2009 at 8:58 AM, <oh...@cox.net> wrote:

> Hi,
>
> I've just gotten nutch installed, and am stepping through the tutorial at:
>
> http://lucene.apache.org/nutch/tutorial8.html
>
> It seems to be working, but I get a number of messages in crawl.log, like:
>
> Error parsing: http://lucene.apache.org/skin/getMenu.js:
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=application/javascript url=
> http://lucene.apache.org/skin/getMenu.js
>        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
>
> Then, at the end of the log, I get:
>
> LinkDb: adding segment:
> file:/opt/nutch-1.0/crawl.test/segments/20090713171413
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> file:/opt/nutch-1.0/crawl.test/segments/20090713171413/parse_data
>        at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
>        at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
>        at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
>        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>
> I must have missed something, but being new, I can't figure out what is
> causing that problem?
>
> Thanks,
> Jim
>
>