You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Karol Rybak <ka...@gmail.com> on 2007/06/28 11:33:54 UTC

Problem with ooParser

Hello, while crawling a large batch of documents i encountered a problem
with ooParser. It wouldn't be a big deal, however after that Fetcher2
stopped fetching completely so it looks like i'll have to kill it, which is
a waste of 800 000 fetched documents... Guess i'll have to fetch in smaller
batches. If you have any idea how to resume hung fetcher let me know...

The exception text:

2007-06-28 12:45:32,775 WARN  oo.OOParser - org.jdom.JDOMException: Error in
building: /nutch/search/office.dtd (No such file or directory)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
org.jdom.input.SAXBuilder.build(SAXBuilder.java:373)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
org.jdom.input.SAXBuilder.build(SAXBuilder.java:673)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
org.apache.nutch.parse.oo.OOParser.parseContent(OOParser.java:113)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
org.apache.nutch.parse.oo.OOParser.getParse(OOParser.java:82)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
org.apache.nutch.fetcher.Fetcher2$FetcherThread.output(Fetcher2.java:669)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:511)
2007-06-28 12:45:32,775 WARN  oo.OOParser - Caused by:
java.io.FileNotFoundException: /nutch/search/office.dtd (No such file or
directory)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
java.io.FileInputStream.open(Native
Method)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at java.io.FileInputStream
.<init>(FileInputStream.java:106)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at java.io.FileInputStream
.<init>(FileInputStream.java:66)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java
:70)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
sun.net.www.protocol.file.FileURLConnection.getInputStream(
FileURLConnection.java:161)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown
Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.jdom.input.SAXBuilder.build(SAXBuilder.java:354)
2007-06-28 12:45:32,776 WARN  oo.OOParser - ... 6 more