You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Chip Calhoun <cc...@aip.org> on 2011/10/26 16:45:33 UTC

Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

I've got a few very large (upwards of 3 MB) XML files I'm trying to index, and I'm having trouble. Previously I'd had trouble with the fetch; now that seems to be okay, but due to the size of the files the parse takes much too long.

Is there a good way to optimize this that I'm missing? Is lengthy parsing of XML a known problem? I recognize that part of my problem is that I'm doing my testing from my aging desktop PC, and it will run faster when I move things to the server, but it's still slow.

I do get the following weird message in my log when I run ParserChecker or the crawler:

2011-10-26 09:51:47,729 INFO  parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/xml, but they are not mapped to it  in the parse-plugins.xml file
2011-10-26 10:06:40,639 WARN  parse.ParseUtil - TIMEOUT parsing http://www.aip.org/history/ead/19990074.xml with org.apache.nutch.parse.tika.TikaParser@18355aa
2011-10-26 10:06:40,639 WARN  parse.ParseUtil - Unable to successfully parse content http://www.aip.org/history/ead/19990074.xml of type application/xml

My ParserChecker results look like this:

# bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://www.aip.org/history/ead/19990074.xml
---------
Url
---------------
http://www.aip.org/history/ead/19990074.xml---------
ParseData
---------
Version: 5
Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content
Title:
Outlinks: 0
Content Metadata:
Parse Metadata:
---------
ParseText
---------

And here's everything that might be relevant in my nutch-site.xml; I've tried it both with and without the urlmeta plugin, and that doesn't make a difference:

<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
 </property> 
<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the file://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the http.content.limit setting.
  </description>
</property>
<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be 
  truncated; otherwise, no truncation at all.
  </description>
 </property>
<property>
  <name>ftp.content.limit</name>
  <value>-1</value> 
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  Caution: classical ftp RFCs never defines partial transfer and, in fact,
  some ftp servers out there do not handle client side forced close-down very
  well. Our implementation tries its best to handle such situations smoothly.
  </description>
</property>
<property>
  <name>http.timeout</name>
  <value>4294967290</value>
  <description>The default network timeout, in milliseconds.</description>
</property>
<property>
  <name>ftp.timeout</name>
  <value>4294967290</value>
  <description>Default timeout for ftp client socket, in millisec.
  Please also see ftp.keep.connection below.</description>
</property>
<property>
  <name>ftp.server.timeout</name>
  <value>4294967290</value>
  <description>An estimation of ftp server idle time, in millisec.
  Typically it is 120000 millisec for many ftp servers out there.
  Better be conservative here. Together with ftp.timeout, it is used to
  decide if we need to delete (annihilate) current ftp.client instance and
  force to start another ftp.client instance anew. This is necessary because
  a fetcher thread may not be able to obtain next request from queue in time
  (due to idleness) before our ftp client times out or remote server
  disconnects. Used only when ftp.keep.connection is true (please see below).
  </description>
</property>
<property>
  <name>parser.timeout</name>
  <value>900</value>
  <description>Timeout in seconds for the parsing of a document, otherwise treats it as an exception and 
  moves on the the following documents. This parameter is applied to any Parser implementation. 
  Set to -1 to deactivate, bearing in mind that this could cause
  the parsing to crash because of a very long or corrupted document.
  </description>
</property>
<property>
  <name>fetcher.threads.fetch</name>
  <value>1</value>
  <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are 
    made at once (each FetcherThread handles one connection).</description>
</property>
 <property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|urlmeta</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
 </property>
 <property>
  <name>urlmeta.tags</name>
  <value>humanurl</value>
 </property>




-----Original Message-----
From: Chip Calhoun [mailto:ccalhoun@aip.org] 
Sent: Thursday, October 20, 2011 10:23 AM
To: 'markus.jelsma@openindex.io'; user@nutch.apache.org
Subject: RE: Good workaround for timeout?

Good to know! I was definitely exceeding that, so I've changed my properties.

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Thursday, October 20, 2011 10:00 AM
To: user@nutch.apache.org
Cc: Chip Calhoun
Subject: Re: Good workaround for timeout?



On Thursday 20 October 2011 15:56:01 Chip Calhoun wrote:
> I started out with a pretty high number in http.timeout, and I've 
> increased it to the fairly ridiculous 99999999999. Is there an upper 
> limit at which it would stop working properly?

It's interpreted as an Integer so don't exceed Integer.MAX_VALUE. Don't know how hadoop will handle for sure.

> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: Wednesday, October 19, 2011 4:57 PM
> To: user@nutch.apache.org
> Cc: Chip Calhoun
> Subject: Re: Good workaround for timeout?
> 
> > I'm using protocol-http, but I removed protocol-httpclient after you 
> > pointed out in another thread that it's broken. Unfortunately I'm 
> > not sure which properties are used by what, and I'm not sure how to 
> > find out. I added some more stuff to nutch-site.xml (I'll paste it 
> > at the end), and it seems to be working so far; but since this has 
> > been an intermittent problem, I can't be sure whether I've really 
> > fixed it or whether I'm getting lucky.
> 
> http.timeout is used in lib-http so it should work unless there's a 
> bug around. Does the problem persist for that one URL if you increase 
> this value to a more reasonable number, say 300?
> 
> > <property>
> > 
> >   <name>http.timeout</name>
> >   <value>99999999999</value>
> >   <description>The default network timeout, in
> > 
> > milliseconds.</description> </property> <property>
> > 
> >   <name>ftp.timeout</name>
> >   <value>9999999999</value>
> >   <description>Default timeout for ftp client socket, in millisec.
> >   Please also see ftp.keep.connection below.</description> 
> > </property>
> > 
> > <property>
> > 
> >   <name>ftp.server.timeout</name>
> >   <value>99999999999999999</value>
> >   <description>An estimation of ftp server idle time, in millisec.
> >   Typically it is 120000 millisec for many ftp servers out there.
> >   Better be conservative here. Together with ftp.timeout, it is used to
> >   decide if we need to delete (annihilate) current ftp.client instance
> >   and force to start another ftp.client instance anew. This is 
> > necessary
> > 
> > because a fetcher thread may not be able to obtain next request from 
> > queue in time (due to idleness) before our ftp client times out or 
> > remote server disconnects. Used only when ftp.keep.connection is 
> > true (please see below). </description> </property> <property>
> > 
> >   <name>parser.timeout</name>
> >   <value>300</value>
> >   <description>Timeout in seconds for the parsing of a document,
> > 
> > otherwise treats it as an exception and moves on the the following 
> > documents. This parameter is applied to any Parser implementation. 
> > Set to -1 to deactivate, bearing in mind that this could cause
> > 
> >   the parsing to crash because of a very long or corrupted document.
> >   </description>
> > 
> > </property>
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> > Sent: Wednesday, October 19, 2011 11:28 AM
> > To: user@nutch.apache.org
> > Subject: Re: Good workaround for timeout?
> > 
> > It is indeed. Tricky.
> > 
> > Are you going through some proxy? Are you using protocol-http or 
> > httpclient? Are you sure the http.time.out value is actually used in 
> > lib-http?
> > 
> > > If I'm reading the log correctly, it's the fetch:
> > > 
> > > 2011-10-19 11:18:11,405 INFO  fetcher.Fetcher - fetch of
> > > http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2
> > > 93 2D onal dsonLauren.xml failed with: 
> > > java.net.SocketTimeoutException:
> > > Read timed out
> > > 
> > > 
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> > > Sent: Wednesday, October 19, 2011 11:08 AM
> > > To: user@nutch.apache.org
> > > Subject: Re: Good workaround for timeout?
> > > 
> > > What is timing out, the fetch or the parse?
> > > 
> > > > I'm getting a fairly persistent  timeout on a particular page.
> > > > Other, smaller pages in this folder do fine, but this one times 
> > > > out most of the time. When it fails, my ParserChecker results 
> > > > look
> > > > like:
> > > > 
> > > > # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText
> > > > http://digital.lib.washington.edu/findingaids/view?docId=UA37_06
> > > > _2
> > > > 93 2D onal dsonLauren.xml Exception in thread "main"
> > > > java.lang.NullPointerException
> > > > 
> > > >         at
> > > > 
> > > > org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
> > > > 
> > > > I've stuck with the default value of "10" in my 
> > > > nutch-default.xml's fetcher.threads.fetch value, and I've added 
> > > > the following to
> > > > nutch-site.xml:
> > > > 
> > > > <property>
> > > > 
> > > >   <name>db.max.outlinks.per.page</name>
> > > >   <value>-1</value>
> > > >   <description>The maximum number of outlinks that we'll process
> > > > 
> > > > for
> > > > 
> > > > a
> > > > 
> > > > page. If this value is nonnegative (>=0), at most 
> > > > db.max.outlinks.per.page outlinks will be processed for a page; 
> > > > otherwise, all outlinks will be processed. </description> 
> > > > </property> <property>
> > > > 
> > > >   <name>file.content.limit</name>
> > > >   <value>-1</value>
> > > >   <description>The length limit for downloaded content using the
> > > >   file:// protocol, in bytes. If this value is nonnegative (>=0),
> > > >   content longer than it will be truncated; otherwise, no truncation
> > > >   at all. Do not confuse this setting with the http.content.limit
> > > >   setting.
> > > >   </description>
> > > > 
> > > > </property>
> > > > <property>
> > > > 
> > > >   <name>http.content.limit</name>
> > > >   <value>-1</value>
> > > >   <description>The length limit for downloaded content, in bytes.
> > > >   If this value is nonnegative (>=0), content longer than it will be
> > > >   truncated; otherwise, no truncation at all.
> > > >   </description>
> > > > 
> > > > </property>
> > > > <property>
> > > > 
> > > >   <name>ftp.content.limit</name>
> > > >   <value>-1</value>
> > > >   <description>The length limit for downloaded content, in bytes.
> > > >   If this value is nonnegative (>=0), content longer than it 
> > > > will
> > > > 
> > > > be
> > > > 
> > > > truncated; otherwise, no truncation at all.
> > > > 
> > > >   Caution: classical ftp RFCs never defines partial transfer and, in
> > > >   fact, some ftp servers out there do not handle client side
> > > > 
> > > > forced
> > > > 
> > > > close-down very well. Our implementation tries its best to 
> > > > handle such situations smoothly. </description> </property> 
> > > > <property>
> > > > 
> > > >   <name>http.timeout</name>
> > > >   <value>99999999999</value>
> > > >   <description>The default network timeout, in
> > > > 
> > > > milliseconds.</description> </property>
> > > > 
> > > > What else can I do? Thanks.
> > > > 
> > > > Chip

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

RE: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

Posted by Chip Calhoun <cc...@aip.org>.

Increasing parser.timeout to 3600 got me what I needed. I only have a few files this huge, so I'll live with that.

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Wednesday, October 26, 2011 10:55 AM
To: user@nutch.apache.org
Subject: Re: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

The actual parse which is producing time outs happens early in the process. 
There are, to my knowledge, no Nutch settings to make this faster or change its behaviour, it's all about the parser implementation.

Try increasing your parser.timeout setting.

On Wednesday 26 October 2011 16:45:33 Chip Calhoun wrote:
> I've got a few very large (upwards of 3 MB) XML files I'm trying to 
> index, and I'm having trouble. Previously I'd had trouble with the 
> fetch; now that seems to be okay, but due to the size of the files the 
> parse takes much too long.
> 
> Is there a good way to optimize this that I'm missing? Is lengthy 
> parsing of XML a known problem? I recognize that part of my problem is 
> that I'm doing my testing from my aging desktop PC, and it will run 
> faster when I move things to the server, but it's still slow.
> 
> I do get the following weird message in my log when I run 
> ParserChecker or the crawler:
> 
> 2011-10-26 09:51:47,729 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the 
> plugin.includes system property, and all claim to support the content 
> type application/xml, but they are not mapped to it  in the 
> parse-plugins.xml file 2011-10-26 10:06:40,639 WARN  parse.ParseUtil - 
> TIMEOUT parsing http://www.aip.org/history/ead/19990074.xml with 
> org.apache.nutch.parse.tika.TikaParser@18355aa 2011-10-26 10:06:40,639 
> WARN  parse.ParseUtil - Unable to successfully parse content 
> http://www.aip.org/history/ead/19990074.xml of type application/xml
> 
> My ParserChecker results look like this:
> 
> # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
> http://www.aip.org/history/ead/19990074.xml --------- Url
> ---------------
> http://www.aip.org/history/ead/19990074.xml---------
> ParseData
> ---------
> Version: 5
> Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable 
> to successfully parse content Title:
> Outlinks: 0
> Content Metadata:
> Parse Metadata:
> ---------
> ParseText
> ---------
> 
> And here's everything that might be relevant in my nutch-site.xml; 
> I've tried it both with and without the urlmeta plugin, and that 
> doesn't make a
> difference:
>

Re: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

Posted by Markus Jelsma <ma...@openindex.io>.

The actual parse which is producing time outs happens early in the process. 
There are, to my knowledge, no Nutch settings to make this faster or change 
its behaviour, it's all about the parser implementation.

Try increasing your parser.timeout setting.

On Wednesday 26 October 2011 16:45:33 Chip Calhoun wrote:
> I've got a few very large (upwards of 3 MB) XML files I'm trying to index,
> and I'm having trouble. Previously I'd had trouble with the fetch; now
> that seems to be okay, but due to the size of the files the parse takes
> much too long.
> 
> Is there a good way to optimize this that I'm missing? Is lengthy parsing
> of XML a known problem? I recognize that part of my problem is that I'm
> doing my testing from my aging desktop PC, and it will run faster when I
> move things to the server, but it's still slow.
> 
> I do get the following weird message in my log when I run ParserChecker or
> the crawler:
> 
> 2011-10-26 09:51:47,729 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content type
> application/xml, but they are not mapped to it  in the parse-plugins.xml
> file 2011-10-26 10:06:40,639 WARN  parse.ParseUtil - TIMEOUT parsing
> http://www.aip.org/history/ead/19990074.xml with
> org.apache.nutch.parse.tika.TikaParser@18355aa 2011-10-26 10:06:40,639
> WARN  parse.ParseUtil - Unable to successfully parse content
> http://www.aip.org/history/ead/19990074.xml of type application/xml
> 
> My ParserChecker results look like this:
> 
> # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText
> http://www.aip.org/history/ead/19990074.xml ---------
> Url
> ---------------
> http://www.aip.org/history/ead/19990074.xml---------
> ParseData
> ---------
> Version: 5
> Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to
> successfully parse content Title:
> Outlinks: 0
> Content Metadata:
> Parse Metadata:
> ---------
> ParseText
> ---------
> 
> And here's everything that might be relevant in my nutch-site.xml; I've
> tried it both with and without the urlmeta plugin, and that doesn't make a
> difference:
>