You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Alex McLintock <al...@gmail.com> on 2010/06/13 16:13:18 UTC

What are the ParseStatus major codes?

I'm setting up Nutch trying to follow various tutorials and just tried
to separate out the fetching from parsing.

Unfortunately I got a confusing ArrayIndexOutOfBounds exception when
trying to parse. I couldn't figure out what it was complaining about
(line 96 of ParseSegment.java)

Adding this try catch block helped me out a bit, but still didn't
clear things up.



Index: ParseSegment.java
===================================================================
--- ParseSegment.java	(revision 953602)
+++ ParseSegment.java	(working copy)
@@ -92,9 +92,13 @@
       Text url = entry.getKey();
       Parse parse = entry.getValue();
       ParseStatus parseStatus = parse.getData().getStatus();
-
+
+      try {
       reporter.incrCounter("ParserStatus",
ParseStatus.majorCodes[parseStatus.getMajorCode()], 1);
-
+      } catch (ArrayIndexOutOfBoundsException e) {
+          LOG.error("Ununsual ParserStatus - possibly
misconfiguration : " + parseStatus.getMajorCode() );
+      }
+


It shouldn't abort parsing the whole "part" just because it can't
parse one type of file.



Now majorCodes looks like an Enumeration, but isn't one. Apparently
the unusual majorCode is "-56" according to the log below. I can't see
where that is coming from. For me it seems to be a problem with files
with mime-types application/atom+xml, application/rss+xml.



2010-06-13 14:57:09,089 WARN  parse.ParserFactory -
ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to
contentType applicati
on/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not
claim to support contentType: application/xhtml+xml
2010-06-13 14:57:09,286 INFO  crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2010-06-13 14:57:10,932 INFO  parse.ParserFactory - The parsing
plugins: [org.apache.nutch.parse.tika.Parser -
org.apache.nutch.parse.html.HtmlP
arser] are enabled via the plugin.includes system property, and all
claim to support the content type text/html, but they are not mapped
to it
in the parse-plugins.xml file
2010-06-13 14:57:11,404 INFO  parse.ParserFactory - The parsing
plugins: [org.apache.nutch.parse.tika.Parser] are enabled via the
plugin.include
s system property, and all claim to support the content type
application/atom+xml, but they are not mapped to it  in the
parse-plugins.xml file
2010-06-13 14:57:11,405 ERROR tika.TikaParser - Can't retrieve Tika
parser for mime-type application/atom+xml
2010-06-13 14:57:11,405 ERROR parse.Parser - Ununsual ParserStatus -
possibly misconfiguration : -56
2010-06-13 14:57:11,405 WARN  parse.Parser - Error parsing:
http://www.mytestsite.com/blog/atom.xml: UNKNOWN!(-56,0): Can't
retrieve Tika parser for mime-type application/atom+xml
2010-06-13 14:57:11,405 WARN  parse.ParserFactory - ParserFactory:
Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType
application/rss+xml via parse-plugins.xml, but not enabled via
plugin.includes in nutch-default.xml


Now presumably my configuration is wrong and I can't parse those mime
types. Should I care? I don't currently care about xml.

I am using code packaged as 1.1 Release Candidate but think that
trivial try/catch should be put on the ParseSegment.java anyway.

Anyone know how a parserStatus got a major code of -56 and should that
be possible?

Thanks

Alex

Re: What are the ParseStatus major codes?

Posted by Julien Nioche <li...@gmail.com>.
Alex,

This issue has been fixed in
https://issues.apache.org/jira/browse/NUTCH-818and should be part of
the latest RC (
http://people.apache.org/~mattmann/apache-nutch-1.1<http://people.apache.org/%7Emattmann/apache-nutch-1.1/rc2/>
)

HTH

Julien
-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

On 13 June 2010 15:13, Alex McLintock <al...@gmail.com> wrote:

> I'm setting up Nutch trying to follow various tutorials and just tried
> to separate out the fetching from parsing.
>
> Unfortunately I got a confusing ArrayIndexOutOfBounds exception when
> trying to parse. I couldn't figure out what it was complaining about
> (line 96 of ParseSegment.java)
>
> Adding this try catch block helped me out a bit, but still didn't
> clear things up.
>
>
>
> Index: ParseSegment.java
> ===================================================================
> --- ParseSegment.java   (revision 953602)
> +++ ParseSegment.java   (working copy)
> @@ -92,9 +92,13 @@
>       Text url = entry.getKey();
>       Parse parse = entry.getValue();
>       ParseStatus parseStatus = parse.getData().getStatus();
> -
> +
> +      try {
>       reporter.incrCounter("ParserStatus",
> ParseStatus.majorCodes[parseStatus.getMajorCode()], 1);
> -
> +      } catch (ArrayIndexOutOfBoundsException e) {
> +          LOG.error("Ununsual ParserStatus - possibly
> misconfiguration : " + parseStatus.getMajorCode() );
> +      }
> +
>
>
> It shouldn't abort parsing the whole "part" just because it can't
> parse one type of file.
>
>
>
> Now majorCodes looks like an Enumeration, but isn't one. Apparently
> the unusual majorCode is "-56" according to the log below. I can't see
> where that is coming from. For me it seems to be a problem with files
> with mime-types application/atom+xml, application/rss+xml.
>
>
>
> 2010-06-13 14:57:09,089 WARN  parse.ParserFactory -
> ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to
> contentType applicati
> on/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not
> claim to support contentType: application/xhtml+xml
> 2010-06-13 14:57:09,286 INFO  crawl.SignatureFactory - Using Signature
> impl: org.apache.nutch.crawl.MD5Signature
> 2010-06-13 14:57:10,932 INFO  parse.ParserFactory - The parsing
> plugins: [org.apache.nutch.parse.tika.Parser -
> org.apache.nutch.parse.html.HtmlP
> arser] are enabled via the plugin.includes system property, and all
> claim to support the content type text/html, but they are not mapped
> to it
> in the parse-plugins.xml file
> 2010-06-13 14:57:11,404 INFO  parse.ParserFactory - The parsing
> plugins: [org.apache.nutch.parse.tika.Parser] are enabled via the
> plugin.include
> s system property, and all claim to support the content type
> application/atom+xml, but they are not mapped to it  in the
> parse-plugins.xml file
> 2010-06-13 14:57:11,405 ERROR tika.TikaParser - Can't retrieve Tika
> parser for mime-type application/atom+xml
> 2010-06-13 14:57:11,405 ERROR parse.Parser - Ununsual ParserStatus -
> possibly misconfiguration : -56
> 2010-06-13 14:57:11,405 WARN  parse.Parser - Error parsing:
> http://www.mytestsite.com/blog/atom.xml: UNKNOWN!(-56,0): Can't
> retrieve Tika parser for mime-type application/atom+xml
> 2010-06-13 14:57:11,405 WARN  parse.ParserFactory - ParserFactory:
> Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType
> application/rss+xml via parse-plugins.xml, but not enabled via
> plugin.includes in nutch-default.xml
>
>
> Now presumably my configuration is wrong and I can't parse those mime
> types. Should I care? I don't currently care about xml.
>
> I am using code packaged as 1.1 Release Candidate but think that
> trivial try/catch should be put on the ParseSegment.java anyway.
>
> Anyone know how a parserStatus got a major code of -56 and should that
> be possible?
>
> Thanks
>
> Alex
>