You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sébastien Michel (JIRA)" <ji...@apache.org> on 2008/12/07 20:47:44 UTC

[jira] Created: (TIKA-180) XHTMLContentHandler unable to extract text from MSWord file

XHTMLContentHandler unable to extract text from MSWord file
-----------------------------------------------------------

                 Key: TIKA-180
                 URL: https://issues.apache.org/jira/browse/TIKA-180
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.2, 0.3
         Environment: linux. SUN JVM 1.5.0_16-b02
Binary file indexing with Solr and Tika
            Reporter: Sébastien Michel


the issue is reproducible with Solr svn / ExtractingRequestHandler + patch SOLR.284 and tika all versions
I tried with some MSWord files but didn't try with xls or ppt files. 

See below an example of MSWord indexing with curl that returns an exception :

  seb@gueuze:~$ curl http://localhost:8983/solr/update/extract?ext.idx.attr=false\&ext.def.fl=text\&ext.extract.only=true -F "myfile=@/tmp/TMB.doc"<html>                                                                                                                                           
<head>                                                                                                                                           
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>                                                                        
<title>Error 500 </title>                                                                                                                        
</head>                                                                                                                                          
<body><h2>HTTP ERROR: 500</h2><pre>java.io.IOException: The character '' is an invalid XML character                                             

org.apache.solr.common.SolrException: java.io.IOException: The character '' is an invalid XML character
        at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:160)    
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)               
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313)                                           
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)                     
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)                    
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)             
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)                            
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)                         
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)                            
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)                            
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)                               
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)        
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)                      
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)                            
        at org.mortbay.jetty.Server.handle(Server.java:285)                                                    
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)                             
        at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)                    
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
        at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: java.io.IOException: The character '' is an invalid XML character
        at org.apache.xml.serialize.BaseMarkupSerializer.characters(Unknown Source)
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:85)
        at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:130)
        at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:136)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:78)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
        at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:146)
        ... 22 more
</pre>
<p>RequestURI=/solr/update/extract</p><p><i><small><a href="http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>

After investigation, it seems that OfficeParser returns text and ISO control characters.
I don't know where is the best place to fix the issue (POI, tika OfficeParser, etc)
following a lazy patch that remove ISO control characters and try again when an exception occur

  --- src/main/java/org/apache/tika/sax/XHTMLContentHandler.java  (révision 723972)
+++ src/main/java/org/apache/tika/sax/XHTMLContentHandler.java  (copie de travail)
@@ -132,7 +132,19 @@

     public void element(String name, String value) throws SAXException {
         startElement(name);
-        characters(value);
+        try {
+               characters(value);
+        } catch (Exception e) {
+               int len = value.length();
+               StringBuffer buffer = new StringBuffer();
+
+               while (len > 0) {
+                if (!Character.isISOControl(value.charAt(len-1)))
+                     buffer.append(value.charAt(len-1));
+                len--;
+            }
+            characters(buffer.toString());
+        }
         endElement(name);
     }





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-180) XHTMLContentHandler unable to extract text from MSWord file

Posted by "Sébastien Michel (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sébastien Michel updated TIKA-180:
----------------------------------

    Attachment: TMB.doc

> XHTMLContentHandler unable to extract text from MSWord file
> -----------------------------------------------------------
>
>                 Key: TIKA-180
>                 URL: https://issues.apache.org/jira/browse/TIKA-180
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2, 0.3
>         Environment: linux. SUN JVM 1.5.0_16-b02
> Binary file indexing with Solr and Tika
>            Reporter: Sébastien Michel
>         Attachments: TMB.doc
>
>
> the issue is reproducible with Solr svn / ExtractingRequestHandler + patch SOLR.284 and tika all versions
> I tried with some MSWord files but didn't try with xls or ppt files. 
> See below an example of MSWord indexing with curl that returns an exception :
>   seb@gueuze:~$ curl http://localhost:8983/solr/update/extract?ext.idx.attr=false\&ext.def.fl=text\&ext.extract.only=true -F "myfile=@/tmp/TMB.doc"<html>                                                                                                                                           
> <head>                                                                                                                                           
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>                                                                        
> <title>Error 500 </title>                                                                                                                        
> </head>                                                                                                                                          
> <body><h2>HTTP ERROR: 500</h2><pre>java.io.IOException: The character '' is an invalid XML character                                             
> org.apache.solr.common.SolrException: java.io.IOException: The character '' is an invalid XML character
>         at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:160)    
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)               
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313)                                           
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)                     
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)                    
>         at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)             
>         at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)                            
>         at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)                         
>         at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)                            
>         at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)                            
>         at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)                               
>         at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)        
>         at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)                      
>         at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)                            
>         at org.mortbay.jetty.Server.handle(Server.java:285)                                                    
>         at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)                             
>         at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)                    
>         at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>         at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>         at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>         at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>         at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: java.io.IOException: The character '' is an invalid XML character
>         at org.apache.xml.serialize.BaseMarkupSerializer.characters(Unknown Source)
>         at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:85)
>         at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:130)
>         at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:136)
>         at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:78)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
>         at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:146)
>         ... 22 more
> </pre>
> <p>RequestURI=/solr/update/extract</p><p><i><small><a href="http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
> After investigation, it seems that OfficeParser returns text and ISO control characters.
> I don't know where is the best place to fix the issue (POI, tika OfficeParser, etc)
> following a lazy patch that remove ISO control characters and try again when an exception occur
>   --- src/main/java/org/apache/tika/sax/XHTMLContentHandler.java  (révision 723972)
> +++ src/main/java/org/apache/tika/sax/XHTMLContentHandler.java  (copie de travail)
> @@ -132,7 +132,19 @@
>      public void element(String name, String value) throws SAXException {
>          startElement(name);
> -        characters(value);
> +        try {
> +               characters(value);
> +        } catch (Exception e) {
> +               int len = value.length();
> +               StringBuffer buffer = new StringBuffer();
> +
> +               while (len > 0) {
> +                if (!Character.isISOControl(value.charAt(len-1)))
> +                     buffer.append(value.charAt(len-1));
> +                len--;
> +            }
> +            characters(buffer.toString());
> +        }
>          endElement(name);
>      }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-180) XHTMLContentHandler unable to extract text from MSWord file

Posted by "Sébastien Michel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654347#action_12654347 ] 

Sébastien Michel commented on TIKA-180:
---------------------------------------

However the parsing is OK with tika cli 

> XHTMLContentHandler unable to extract text from MSWord file
> -----------------------------------------------------------
>
>                 Key: TIKA-180
>                 URL: https://issues.apache.org/jira/browse/TIKA-180
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2, 0.3
>         Environment: linux. SUN JVM 1.5.0_16-b02
> Binary file indexing with Solr and Tika
>            Reporter: Sébastien Michel
>         Attachments: TMB.doc
>
>
> the issue is reproducible with Solr svn / ExtractingRequestHandler + patch SOLR.284 and tika all versions
> I tried with some MSWord files but didn't try with xls or ppt files. 
> See below an example of MSWord indexing with curl that returns an exception :
>   seb@gueuze:~$ curl http://localhost:8983/solr/update/extract?ext.idx.attr=false\&ext.def.fl=text\&ext.extract.only=true -F "myfile=@/tmp/TMB.doc"<html>                                                                                                                                           
> <head>                                                                                                                                           
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>                                                                        
> <title>Error 500 </title>                                                                                                                        
> </head>                                                                                                                                          
> <body><h2>HTTP ERROR: 500</h2><pre>java.io.IOException: The character '' is an invalid XML character                                             
> org.apache.solr.common.SolrException: java.io.IOException: The character '' is an invalid XML character
>         at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:160)    
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)               
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313)                                           
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)                     
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)                    
>         at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)             
>         at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)                            
>         at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)                         
>         at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)                            
>         at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)                            
>         at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)                               
>         at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)        
>         at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)                      
>         at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)                            
>         at org.mortbay.jetty.Server.handle(Server.java:285)                                                    
>         at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)                             
>         at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)                    
>         at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>         at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>         at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>         at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>         at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: java.io.IOException: The character '' is an invalid XML character
>         at org.apache.xml.serialize.BaseMarkupSerializer.characters(Unknown Source)
>         at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:85)
>         at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:130)
>         at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:136)
>         at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:78)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
>         at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:146)
>         ... 22 more
> </pre>
> <p>RequestURI=/solr/update/extract</p><p><i><small><a href="http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
> After investigation, it seems that OfficeParser returns text and ISO control characters.
> I don't know where is the best place to fix the issue (POI, tika OfficeParser, etc)
> following a lazy patch that remove ISO control characters and try again when an exception occur
>   --- src/main/java/org/apache/tika/sax/XHTMLContentHandler.java  (révision 723972)
> +++ src/main/java/org/apache/tika/sax/XHTMLContentHandler.java  (copie de travail)
> @@ -132,7 +132,19 @@
>      public void element(String name, String value) throws SAXException {
>          startElement(name);
> -        characters(value);
> +        try {
> +               characters(value);
> +        } catch (Exception e) {
> +               int len = value.length();
> +               StringBuffer buffer = new StringBuffer();
> +
> +               while (len > 0) {
> +                if (!Character.isISOControl(value.charAt(len-1)))
> +                     buffer.append(value.charAt(len-1));
> +                len--;
> +            }
> +            characters(buffer.toString());
> +        }
>          endElement(name);
>      }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-180) XHTMLContentHandler unable to extract text from MSWord file

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-180.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.3
         Assignee: Jukka Zitting

I added a SafeContentHandler decorator class that prevents invalid XML characters (currently just the <0x20 control characters) from being outputted. This is important for any downstream applications that expect strict XML output from Tika.

I also made XHTMLContentHandler extend SafeContentHandler so all XHTML produced by Tika will automatically be "safe" XML.

Using the SafeContentHandler class is lossy (all invalid XML characters are replaced with spaces), but this shouldn't be a problem as the purpose of Tika is to extract text instead of binary data from input documents.

> XHTMLContentHandler unable to extract text from MSWord file
> -----------------------------------------------------------
>
>                 Key: TIKA-180
>                 URL: https://issues.apache.org/jira/browse/TIKA-180
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2, 0.3
>         Environment: linux. SUN JVM 1.5.0_16-b02
> Binary file indexing with Solr and Tika
>            Reporter: Sébastien Michel
>            Assignee: Jukka Zitting
>             Fix For: 0.3
>
>         Attachments: TMB.doc
>
>
> the issue is reproducible with Solr svn / ExtractingRequestHandler + patch SOLR.284 and tika all versions
> I tried with some MSWord files but didn't try with xls or ppt files. 
> See below an example of MSWord indexing with curl that returns an exception :
>   seb@gueuze:~$ curl http://localhost:8983/solr/update/extract?ext.idx.attr=false\&ext.def.fl=text\&ext.extract.only=true -F "myfile=@/tmp/TMB.doc"<html>                                                                                                                                           
> <head>                                                                                                                                           
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>                                                                        
> <title>Error 500 </title>                                                                                                                        
> </head>                                                                                                                                          
> <body><h2>HTTP ERROR: 500</h2><pre>java.io.IOException: The character '' is an invalid XML character                                             
> org.apache.solr.common.SolrException: java.io.IOException: The character '' is an invalid XML character
>         at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:160)    
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)               
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313)                                           
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)                     
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)                    
>         at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)             
>         at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)                            
>         at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)                         
>         at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)                            
>         at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)                            
>         at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)                               
>         at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)        
>         at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)                      
>         at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)                            
>         at org.mortbay.jetty.Server.handle(Server.java:285)                                                    
>         at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)                             
>         at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)                    
>         at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>         at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>         at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>         at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>         at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: java.io.IOException: The character '' is an invalid XML character
>         at org.apache.xml.serialize.BaseMarkupSerializer.characters(Unknown Source)
>         at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:85)
>         at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:130)
>         at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:136)
>         at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:78)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
>         at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:146)
>         ... 22 more
> </pre>
> <p>RequestURI=/solr/update/extract</p><p><i><small><a href="http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
> After investigation, it seems that OfficeParser returns text and ISO control characters.
> I don't know where is the best place to fix the issue (POI, tika OfficeParser, etc)
> following a lazy patch that remove ISO control characters and try again when an exception occur
>   --- src/main/java/org/apache/tika/sax/XHTMLContentHandler.java  (révision 723972)
> +++ src/main/java/org/apache/tika/sax/XHTMLContentHandler.java  (copie de travail)
> @@ -132,7 +132,19 @@
>      public void element(String name, String value) throws SAXException {
>          startElement(name);
> -        characters(value);
> +        try {
> +               characters(value);
> +        } catch (Exception e) {
> +               int len = value.length();
> +               StringBuffer buffer = new StringBuffer();
> +
> +               while (len > 0) {
> +                if (!Character.isISOControl(value.charAt(len-1)))
> +                     buffer.append(value.charAt(len-1));
> +                len--;
> +            }
> +            characters(buffer.toString());
> +        }
>          endElement(name);
>      }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.