You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Rob Tulloh (JIRA)" <ji...@apache.org> on 2012/05/24 23:26:10 UTC
[jira] [Created] (TIKA-934) Tika in server mode stops responding
and reports NPE over and over in logs
Rob Tulloh created TIKA-934:
-------------------------------
Summary: Tika in server mode stops responding and reports NPE over and over in logs
Key: TIKA-934
URL: https://issues.apache.org/jira/browse/TIKA-934
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.1
Environment: CentOS 5.x
Reporter: Rob Tulloh
Priority: Critical
We run tika in server mode via:
/usr/java/jdk/bin/java -Dlog4j.app.name=-server -Djavax.xml.soap.MessageFactory=com.sun.xml.messaging.saaj.soap.ver1_1.SOAPMessageFactory1_1Impl -Dfile.encoding=UTF-8 -Djava.net.preferIPv4Stack=true -server -Xms256M -Xmx768M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/oom/content-extractor-8983.dump.1 -server -Xms500M -Xmx500M -jar /opt/tika/tika-app-1.1.jar --text --encoding=UTF-8 --server 8983
Our client talks to this over port 8983. We pass data via the socket and get the responses back. However, sometimes, tika will get into a bad state and stop responding.
When this happens, we see this in the logs over and over.
2012-05-24_20:12:33.88573 Caused by: java.lang.NullPointerException
2012-05-24_20:12:33.88576 at org.apache.tika.sax.XHTMLContentHandler.lazyEndHead(XHTMLContentHandler.java:157)
2012-05-24_20:12:33.88580 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
2012-05-24_20:12:33.88584 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:274)
2012-05-24_20:12:33.88589 at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:186)
2012-05-24_20:12:33.88593 at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:97)
2012-05-24_20:12:33.88597 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185)
2012-05-24_20:12:33.88602 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
2012-05-24_20:12:33.88606 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
2012-05-24_20:12:33.88611 ... 4 more
2012-05-24_20:12:49.28441 org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParse
r@6906daba
2012-05-24_20:12:49.28458 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
2012-05-24_20:12:49.28466 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
2012-05-24_20:12:49.28477 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
2012-05-24_20:12:49.28489 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
2012-05-24_20:12:49.28497 at org.apache.tika.cli.TikaCLI$TikaServer$1.run(TikaCLI.java:735)
2012-05-24_20:12:49.28509 Caused by: java.lang.NullPointerException
2012-05-24_20:12:49.28516 at org.apache.tika.sax.XHTMLContentHandler.lazyEndHead(XHTMLContentHandler.java:157)
2012-05-24_20:12:49.28524 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
2012-05-24_20:12:49.28532 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:274)
2012-05-24_20:12:49.28541 at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:186)
2012-05-24_20:12:49.28550 at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:97)
2012-05-24_20:12:49.28558 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185)
2012-05-24_20:12:49.28565 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
2012-05-24_20:12:49.28577 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
2012-05-24_20:12:49.28585 ... 4 more
We have tried to figure out what causes this with no success. We only know that once the server gets into this state, there is no recourse but to restart the tika service.
Other instances of tika we have running in the test environment continue to work. There is some combination of content or work that causes
tika to destabilize. Our working theory is that perhaps tika server is not thread safe and that may be causing this behavior.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-934) Tika in server mode stops responding
and reports NPE over and over in logs
Posted by "Rob Tulloh (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285068#comment-13285068 ]
Rob Tulloh commented on TIKA-934:
---------------------------------
Additional evidence of re-entrancy issues:
2012-05-22_19:10:39.31249 Caused by: java.util.ConcurrentModificationException
2012-05-22_19:10:39.31253 at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
2012-05-22_19:10:39.31257 at java.util.HashMap$KeyIterator.next(HashMap.java:828)
2012-05-22_19:10:39.31262 at java.util.AbstractCollection.toArray(AbstractCollection.java:171)
2012-05-22_19:10:39.31266 at org.apache.tika.metadata.Metadata.names(Metadata.java:171)
2012-05-22_19:10:39.31270 at org.apache.tika.sax.XHTMLContentHandler.lazyEndHead(XHTMLContentHandler.java:156)
2012-05-22_19:10:39.31275 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
2012-05-22_19:10:39.31280 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:281)
2012-05-22_19:10:39.31285 at org.apache.tika.parser.pdf.PDF2XHTML.startPage(PDF2XHTML.java:128)
2012-05-22_19:10:39.31289 at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:420)
2012-05-22_19:10:39.31293 at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
2012-05-22_19:10:39.31296 at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
2012-05-22_19:10:39.31300 at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:63)
2012-05-22_19:10:39.31304 at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:140)
2012-05-22_19:10:39.31308 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
2012-05-22_19:10:39.31312 ... 4 more
> Tika in server mode stops responding and reports NPE over and over in logs
> --------------------------------------------------------------------------
>
> Key: TIKA-934
> URL: https://issues.apache.org/jira/browse/TIKA-934
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.1
> Environment: CentOS 5.x
> Reporter: Rob Tulloh
> Priority: Critical
>
> We run tika in server mode via:
> /usr/java/jdk/bin/java -Dlog4j.app.name=-server -Djavax.xml.soap.MessageFactory=com.sun.xml.messaging.saaj.soap.ver1_1.SOAPMessageFactory1_1Impl -Dfile.encoding=UTF-8 -Djava.net.preferIPv4Stack=true -server -Xms256M -Xmx768M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/oom/content-extractor-8983.dump.1 -server -Xms500M -Xmx500M -jar /opt/tika/tika-app-1.1.jar --text --encoding=UTF-8 --server 8983
> Our client talks to this over port 8983. We pass data via the socket and get the responses back. However, sometimes, tika will get into a bad state and stop responding.
> When this happens, we see this in the logs over and over.
> 2012-05-24_20:12:33.88573 Caused by: java.lang.NullPointerException
> 2012-05-24_20:12:33.88576 at org.apache.tika.sax.XHTMLContentHandler.lazyEndHead(XHTMLContentHandler.java:157)
> 2012-05-24_20:12:33.88580 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
> 2012-05-24_20:12:33.88584 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:274)
> 2012-05-24_20:12:33.88589 at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:186)
> 2012-05-24_20:12:33.88593 at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:97)
> 2012-05-24_20:12:33.88597 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185)
> 2012-05-24_20:12:33.88602 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
> 2012-05-24_20:12:33.88606 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-05-24_20:12:33.88611 ... 4 more
> 2012-05-24_20:12:49.28441 org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParse
> r@6906daba
> 2012-05-24_20:12:49.28458 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> 2012-05-24_20:12:49.28466 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-05-24_20:12:49.28477 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 2012-05-24_20:12:49.28489 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
> 2012-05-24_20:12:49.28497 at org.apache.tika.cli.TikaCLI$TikaServer$1.run(TikaCLI.java:735)
> 2012-05-24_20:12:49.28509 Caused by: java.lang.NullPointerException
> 2012-05-24_20:12:49.28516 at org.apache.tika.sax.XHTMLContentHandler.lazyEndHead(XHTMLContentHandler.java:157)
> 2012-05-24_20:12:49.28524 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
> 2012-05-24_20:12:49.28532 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:274)
> 2012-05-24_20:12:49.28541 at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:186)
> 2012-05-24_20:12:49.28550 at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:97)
> 2012-05-24_20:12:49.28558 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185)
> 2012-05-24_20:12:49.28565 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
> 2012-05-24_20:12:49.28577 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-05-24_20:12:49.28585 ... 4 more
> We have tried to figure out what causes this with no success. We only know that once the server gets into this state, there is no recourse but to restart the tika service.
> Other instances of tika we have running in the test environment continue to work. There is some combination of content or work that causes
> tika to destabilize. Our working theory is that perhaps tika server is not thread safe and that may be causing this behavior.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-934) Tika in server mode stops responding
and reports NPE over and over in logs
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-934.
--------------------------------
Resolution: Fixed
Fix Version/s: 1.2
Assignee: Jukka Zitting
Thanks for the report! This was indeed the case, the CLI would keep just a single global Metadata instance, which ended up causing trouble when running in server mode. I fixed than in revision 1355719 by making the CLI use per-document Metadata instances.
> Tika in server mode stops responding and reports NPE over and over in logs
> --------------------------------------------------------------------------
>
> Key: TIKA-934
> URL: https://issues.apache.org/jira/browse/TIKA-934
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.1
> Environment: CentOS 5.x
> Reporter: Rob Tulloh
> Assignee: Jukka Zitting
> Priority: Critical
> Fix For: 1.2
>
>
> We run tika in server mode via:
> /usr/java/jdk/bin/java -Dlog4j.app.name=-server -Djavax.xml.soap.MessageFactory=com.sun.xml.messaging.saaj.soap.ver1_1.SOAPMessageFactory1_1Impl -Dfile.encoding=UTF-8 -Djava.net.preferIPv4Stack=true -server -Xms256M -Xmx768M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/oom/content-extractor-8983.dump.1 -server -Xms500M -Xmx500M -jar /opt/tika/tika-app-1.1.jar --text --encoding=UTF-8 --server 8983
> Our client talks to this over port 8983. We pass data via the socket and get the responses back. However, sometimes, tika will get into a bad state and stop responding.
> When this happens, we see this in the logs over and over.
> 2012-05-24_20:12:33.88573 Caused by: java.lang.NullPointerException
> 2012-05-24_20:12:33.88576 at org.apache.tika.sax.XHTMLContentHandler.lazyEndHead(XHTMLContentHandler.java:157)
> 2012-05-24_20:12:33.88580 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
> 2012-05-24_20:12:33.88584 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:274)
> 2012-05-24_20:12:33.88589 at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:186)
> 2012-05-24_20:12:33.88593 at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:97)
> 2012-05-24_20:12:33.88597 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185)
> 2012-05-24_20:12:33.88602 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
> 2012-05-24_20:12:33.88606 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-05-24_20:12:33.88611 ... 4 more
> 2012-05-24_20:12:49.28441 org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParse
> r@6906daba
> 2012-05-24_20:12:49.28458 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> 2012-05-24_20:12:49.28466 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-05-24_20:12:49.28477 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 2012-05-24_20:12:49.28489 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
> 2012-05-24_20:12:49.28497 at org.apache.tika.cli.TikaCLI$TikaServer$1.run(TikaCLI.java:735)
> 2012-05-24_20:12:49.28509 Caused by: java.lang.NullPointerException
> 2012-05-24_20:12:49.28516 at org.apache.tika.sax.XHTMLContentHandler.lazyEndHead(XHTMLContentHandler.java:157)
> 2012-05-24_20:12:49.28524 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
> 2012-05-24_20:12:49.28532 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:274)
> 2012-05-24_20:12:49.28541 at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:186)
> 2012-05-24_20:12:49.28550 at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:97)
> 2012-05-24_20:12:49.28558 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185)
> 2012-05-24_20:12:49.28565 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
> 2012-05-24_20:12:49.28577 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-05-24_20:12:49.28585 ... 4 more
> We have tried to figure out what causes this with no success. We only know that once the server gets into this state, there is no recourse but to restart the tika service.
> Other instances of tika we have running in the test environment continue to work. There is some combination of content or work that causes
> tika to destabilize. Our working theory is that perhaps tika server is not thread safe and that may be causing this behavior.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira