You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Chip Calhoun <cc...@aip.org> on 2011/10/04 23:01:26 UTC

Unable to parse large XML files.

Hi everyone,

I've found that I'm unable to parse very large XML files. This doesn't seem to happen with other file formats. When I run any of the offending files through ParserChecker, I get something along the lines of:

# bin/nutch org.apache.nutch.parse.ParserChecker http://www.aip.org/history/ead/19990074.xml
---------
Url
---------------
http://www.aip.org/history/ead/19990074.xml---------
ParseData
---------
Version: 5
Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content
Title:
Outlinks: 0
Content Metadata:
Parse Metadata:

One thing which may or may not be relevant is that when I look XML files up in a browser the http:// at the beginning tends to disappear. That seems relevant because it seems like it might defeat my file.content.limit, http.content.limit, and ftp.content.limit<ftp://ftp.content.limit> properties. Is there a way around this?

Thanks,
Chip

RE: Unable to parse large XML files.

Posted by Chip Calhoun <cc...@aip.org>.

Hrm. No, it turns out I was wrong; I'd misread an error message. I've got the following in my nutch-site.xml:

<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the file://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the http.content.limit setting.
  </description>
</property>
<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be 
  truncated; otherwise, no truncation at all.
  </description>
 </property>
<property>
  <name>ftp.content.limit</name>
  <value>-1</value> 
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  Caution: classical ftp RFCs never defines partial transfer and, in fact,
  some ftp servers out there do not handle client side forced close-down very
  well. Our implementation tries its best to handle such situations smoothly.
  </description>
</property>

-----Original Message-----
From: Chip Calhoun [mailto:ccalhoun@aip.org] 
Sent: Wednesday, October 05, 2011 9:34 AM
To: 'user@nutch.apache.org'; 'markus.jelsma@openindex.io'
Subject: RE: Unable to parse large XML files.

Huh. It turns out my http.content.limit was fine, but I also needed a file.content.limit statement in nutch-site.xml to make this work. Thanks!

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Tuesday, October 04, 2011 7:41 PM
To: user@nutch.apache.org
Subject: Re: Unable to parse large XML files.


> Hi everyone,
> 
> I've found that I'm unable to parse very large XML files. This doesn't 
> seem to happen with other file formats. When I run any of the 
> offending files through ParserChecker, I get something along the lines of:
> 
> # bin/nutch org.apache.nutch.parse.ParserChecker
> http://www.aip.org/history/ead/19990074.xml --------- Url
> ---------------
> http://www.aip.org/history/ead/19990074.xml---------
> ParseData
> ---------
> Version: 5
> Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable 
> to successfully parse content Title:
> Outlinks: 0
> Content Metadata:
> Parse Metadata:
> 
> One thing which may or may not be relevant is that when I look XML 
> files up in a browser the http:// at the beginning tends to disappear.

You're using some fancy new browser? Some seem to do that.  Check your http.content.limit.

> That seems
> relevant because it seems like it might defeat my file.content.limit, 
> http.content.limit, and ftp.content.limit<ftp://ftp.content.limit>
> properties. Is there a way around this?
> 
> Thanks,
> Chip

RE: Unable to parse large XML files.

Posted by Chip Calhoun <cc...@aip.org>.

Huh. It turns out my http.content.limit was fine, but I also needed a file.content.limit statement in nutch-site.xml to make this work. Thanks!

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Tuesday, October 04, 2011 7:41 PM
To: user@nutch.apache.org
Subject: Re: Unable to parse large XML files.


> Hi everyone,
> 
> I've found that I'm unable to parse very large XML files. This doesn't 
> seem to happen with other file formats. When I run any of the 
> offending files through ParserChecker, I get something along the lines of:
> 
> # bin/nutch org.apache.nutch.parse.ParserChecker
> http://www.aip.org/history/ead/19990074.xml --------- Url
> ---------------
> http://www.aip.org/history/ead/19990074.xml---------
> ParseData
> ---------
> Version: 5
> Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable 
> to successfully parse content Title:
> Outlinks: 0
> Content Metadata:
> Parse Metadata:
> 
> One thing which may or may not be relevant is that when I look XML 
> files up in a browser the http:// at the beginning tends to disappear.

You're using some fancy new browser? Some seem to do that.  Check your http.content.limit.

> That seems
> relevant because it seems like it might defeat my file.content.limit, 
> http.content.limit, and ftp.content.limit<ftp://ftp.content.limit>
> properties. Is there a way around this?
> 
> Thanks,
> Chip

Re: Unable to parse large XML files.

Posted by Markus Jelsma <ma...@openindex.io>.

> Hi everyone,
> 
> I've found that I'm unable to parse very large XML files. This doesn't seem
> to happen with other file formats. When I run any of the offending files
> through ParserChecker, I get something along the lines of:
> 
> # bin/nutch org.apache.nutch.parse.ParserChecker
> http://www.aip.org/history/ead/19990074.xml ---------
> Url
> ---------------
> http://www.aip.org/history/ead/19990074.xml---------
> ParseData
> ---------
> Version: 5
> Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to
> successfully parse content Title:
> Outlinks: 0
> Content Metadata:
> Parse Metadata:
> 
> One thing which may or may not be relevant is that when I look XML files up
> in a browser the http:// at the beginning tends to disappear. 

You're using some fancy new browser? Some seem to do that.  Check your 
http.content.limit.

> That seems
> relevant because it seems like it might defeat my file.content.limit,
> http.content.limit, and ftp.content.limit<ftp://ftp.content.limit>
> properties. Is there a way around this?
> 
> Thanks,
> Chip