You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sébastien Michel (JIRA)" <ji...@apache.org> on 2008/12/07 20:47:44 UTC
[jira] Created: (TIKA-180) XHTMLContentHandler unable to extract
text from MSWord file
XHTMLContentHandler unable to extract text from MSWord file
-----------------------------------------------------------
Key: TIKA-180
URL: https://issues.apache.org/jira/browse/TIKA-180
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.2, 0.3
Environment: linux. SUN JVM 1.5.0_16-b02
Binary file indexing with Solr and Tika
Reporter: Sébastien Michel
the issue is reproducible with Solr svn / ExtractingRequestHandler + patch SOLR.284 and tika all versions
I tried with some MSWord files but didn't try with xls or ppt files.
See below an example of MSWord indexing with curl that returns an exception :
seb@gueuze:~$ curl http://localhost:8983/solr/update/extract?ext.idx.attr=false\&ext.def.fl=text\&ext.extract.only=true -F "myfile=@/tmp/TMB.doc"<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 </title>
</head>
<body><h2>HTTP ERROR: 500</h2><pre>java.io.IOException: The character '' is an invalid XML character
org.apache.solr.common.SolrException: java.io.IOException: The character '' is an invalid XML character
at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:160)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: java.io.IOException: The character '' is an invalid XML character
at org.apache.xml.serialize.BaseMarkupSerializer.characters(Unknown Source)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:85)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:130)
at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:136)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:78)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:146)
... 22 more
</pre>
<p>RequestURI=/solr/update/extract</p><p><i><small><a href="http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
After investigation, it seems that OfficeParser returns text and ISO control characters.
I don't know where is the best place to fix the issue (POI, tika OfficeParser, etc)
following a lazy patch that remove ISO control characters and try again when an exception occur
--- src/main/java/org/apache/tika/sax/XHTMLContentHandler.java (révision 723972)
+++ src/main/java/org/apache/tika/sax/XHTMLContentHandler.java (copie de travail)
@@ -132,7 +132,19 @@
public void element(String name, String value) throws SAXException {
startElement(name);
- characters(value);
+ try {
+ characters(value);
+ } catch (Exception e) {
+ int len = value.length();
+ StringBuffer buffer = new StringBuffer();
+
+ while (len > 0) {
+ if (!Character.isISOControl(value.charAt(len-1)))
+ buffer.append(value.charAt(len-1));
+ len--;
+ }
+ characters(buffer.toString());
+ }
endElement(name);
}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-180) XHTMLContentHandler unable to extract
text from MSWord file
Posted by "Sébastien Michel (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sébastien Michel updated TIKA-180:
----------------------------------
Attachment: TMB.doc
> XHTMLContentHandler unable to extract text from MSWord file
> -----------------------------------------------------------
>
> Key: TIKA-180
> URL: https://issues.apache.org/jira/browse/TIKA-180
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.2, 0.3
> Environment: linux. SUN JVM 1.5.0_16-b02
> Binary file indexing with Solr and Tika
> Reporter: Sébastien Michel
> Attachments: TMB.doc
>
>
> the issue is reproducible with Solr svn / ExtractingRequestHandler + patch SOLR.284 and tika all versions
> I tried with some MSWord files but didn't try with xls or ppt files.
> See below an example of MSWord indexing with curl that returns an exception :
> seb@gueuze:~$ curl http://localhost:8983/solr/update/extract?ext.idx.attr=false\&ext.def.fl=text\&ext.extract.only=true -F "myfile=@/tmp/TMB.doc"<html>
> <head>
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
> <title>Error 500 </title>
> </head>
> <body><h2>HTTP ERROR: 500</h2><pre>java.io.IOException: The character '' is an invalid XML character
> org.apache.solr.common.SolrException: java.io.IOException: The character '' is an invalid XML character
> at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:160)
> at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
> at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313)
> at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
> at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
> at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
> at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> at org.mortbay.jetty.Server.handle(Server.java:285)
> at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
> at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: java.io.IOException: The character '' is an invalid XML character
> at org.apache.xml.serialize.BaseMarkupSerializer.characters(Unknown Source)
> at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:85)
> at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:130)
> at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:136)
> at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:78)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
> at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:146)
> ... 22 more
> </pre>
> <p>RequestURI=/solr/update/extract</p><p><i><small><a href="http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
> After investigation, it seems that OfficeParser returns text and ISO control characters.
> I don't know where is the best place to fix the issue (POI, tika OfficeParser, etc)
> following a lazy patch that remove ISO control characters and try again when an exception occur
> --- src/main/java/org/apache/tika/sax/XHTMLContentHandler.java (révision 723972)
> +++ src/main/java/org/apache/tika/sax/XHTMLContentHandler.java (copie de travail)
> @@ -132,7 +132,19 @@
> public void element(String name, String value) throws SAXException {
> startElement(name);
> - characters(value);
> + try {
> + characters(value);
> + } catch (Exception e) {
> + int len = value.length();
> + StringBuffer buffer = new StringBuffer();
> +
> + while (len > 0) {
> + if (!Character.isISOControl(value.charAt(len-1)))
> + buffer.append(value.charAt(len-1));
> + len--;
> + }
> + characters(buffer.toString());
> + }
> endElement(name);
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-180) XHTMLContentHandler unable to extract
text from MSWord file
Posted by "Sébastien Michel (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654347#action_12654347 ]
Sébastien Michel commented on TIKA-180:
---------------------------------------
However the parsing is OK with tika cli
> XHTMLContentHandler unable to extract text from MSWord file
> -----------------------------------------------------------
>
> Key: TIKA-180
> URL: https://issues.apache.org/jira/browse/TIKA-180
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.2, 0.3
> Environment: linux. SUN JVM 1.5.0_16-b02
> Binary file indexing with Solr and Tika
> Reporter: Sébastien Michel
> Attachments: TMB.doc
>
>
> the issue is reproducible with Solr svn / ExtractingRequestHandler + patch SOLR.284 and tika all versions
> I tried with some MSWord files but didn't try with xls or ppt files.
> See below an example of MSWord indexing with curl that returns an exception :
> seb@gueuze:~$ curl http://localhost:8983/solr/update/extract?ext.idx.attr=false\&ext.def.fl=text\&ext.extract.only=true -F "myfile=@/tmp/TMB.doc"<html>
> <head>
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
> <title>Error 500 </title>
> </head>
> <body><h2>HTTP ERROR: 500</h2><pre>java.io.IOException: The character '' is an invalid XML character
> org.apache.solr.common.SolrException: java.io.IOException: The character '' is an invalid XML character
> at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:160)
> at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
> at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313)
> at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
> at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
> at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
> at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> at org.mortbay.jetty.Server.handle(Server.java:285)
> at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
> at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: java.io.IOException: The character '' is an invalid XML character
> at org.apache.xml.serialize.BaseMarkupSerializer.characters(Unknown Source)
> at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:85)
> at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:130)
> at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:136)
> at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:78)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
> at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:146)
> ... 22 more
> </pre>
> <p>RequestURI=/solr/update/extract</p><p><i><small><a href="http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
> After investigation, it seems that OfficeParser returns text and ISO control characters.
> I don't know where is the best place to fix the issue (POI, tika OfficeParser, etc)
> following a lazy patch that remove ISO control characters and try again when an exception occur
> --- src/main/java/org/apache/tika/sax/XHTMLContentHandler.java (révision 723972)
> +++ src/main/java/org/apache/tika/sax/XHTMLContentHandler.java (copie de travail)
> @@ -132,7 +132,19 @@
> public void element(String name, String value) throws SAXException {
> startElement(name);
> - characters(value);
> + try {
> + characters(value);
> + } catch (Exception e) {
> + int len = value.length();
> + StringBuffer buffer = new StringBuffer();
> +
> + while (len > 0) {
> + if (!Character.isISOControl(value.charAt(len-1)))
> + buffer.append(value.charAt(len-1));
> + len--;
> + }
> + characters(buffer.toString());
> + }
> endElement(name);
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (TIKA-180) XHTMLContentHandler unable to extract
text from MSWord file
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-180.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.3
Assignee: Jukka Zitting
I added a SafeContentHandler decorator class that prevents invalid XML characters (currently just the <0x20 control characters) from being outputted. This is important for any downstream applications that expect strict XML output from Tika.
I also made XHTMLContentHandler extend SafeContentHandler so all XHTML produced by Tika will automatically be "safe" XML.
Using the SafeContentHandler class is lossy (all invalid XML characters are replaced with spaces), but this shouldn't be a problem as the purpose of Tika is to extract text instead of binary data from input documents.
> XHTMLContentHandler unable to extract text from MSWord file
> -----------------------------------------------------------
>
> Key: TIKA-180
> URL: https://issues.apache.org/jira/browse/TIKA-180
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.2, 0.3
> Environment: linux. SUN JVM 1.5.0_16-b02
> Binary file indexing with Solr and Tika
> Reporter: Sébastien Michel
> Assignee: Jukka Zitting
> Fix For: 0.3
>
> Attachments: TMB.doc
>
>
> the issue is reproducible with Solr svn / ExtractingRequestHandler + patch SOLR.284 and tika all versions
> I tried with some MSWord files but didn't try with xls or ppt files.
> See below an example of MSWord indexing with curl that returns an exception :
> seb@gueuze:~$ curl http://localhost:8983/solr/update/extract?ext.idx.attr=false\&ext.def.fl=text\&ext.extract.only=true -F "myfile=@/tmp/TMB.doc"<html>
> <head>
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
> <title>Error 500 </title>
> </head>
> <body><h2>HTTP ERROR: 500</h2><pre>java.io.IOException: The character '' is an invalid XML character
> org.apache.solr.common.SolrException: java.io.IOException: The character '' is an invalid XML character
> at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:160)
> at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
> at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313)
> at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
> at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
> at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
> at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> at org.mortbay.jetty.Server.handle(Server.java:285)
> at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
> at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: java.io.IOException: The character '' is an invalid XML character
> at org.apache.xml.serialize.BaseMarkupSerializer.characters(Unknown Source)
> at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:85)
> at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:130)
> at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:136)
> at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:78)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
> at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:146)
> ... 22 more
> </pre>
> <p>RequestURI=/solr/update/extract</p><p><i><small><a href="http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
> After investigation, it seems that OfficeParser returns text and ISO control characters.
> I don't know where is the best place to fix the issue (POI, tika OfficeParser, etc)
> following a lazy patch that remove ISO control characters and try again when an exception occur
> --- src/main/java/org/apache/tika/sax/XHTMLContentHandler.java (révision 723972)
> +++ src/main/java/org/apache/tika/sax/XHTMLContentHandler.java (copie de travail)
> @@ -132,7 +132,19 @@
> public void element(String name, String value) throws SAXException {
> startElement(name);
> - characters(value);
> + try {
> + characters(value);
> + } catch (Exception e) {
> + int len = value.length();
> + StringBuffer buffer = new StringBuffer();
> +
> + while (len > 0) {
> + if (!Character.isISOControl(value.charAt(len-1)))
> + buffer.append(value.charAt(len-1));
> + len--;
> + }
> + characters(buffer.toString());
> + }
> endElement(name);
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.