You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Harsh Fatepuria (JIRA)" <ji...@apache.org> on 2016/03/14 20:15:33 UTC

[jira] [Created] (TIKA-1902) Error while parsing a file using ContentHandler object (initialized using the BodyContentHandler object) for some files

Harsh Fatepuria created TIKA-1902:
-------------------------------------

             Summary: Error while parsing a file using ContentHandler object (initialized using the BodyContentHandler object) for some files
                 Key: TIKA-1902
                 URL: https://issues.apache.org/jira/browse/TIKA-1902
             Project: Tika
          Issue Type: Bug
          Components: handler, parser
    Affects Versions: 1.12
         Environment: Java
            Reporter: Harsh Fatepuria


Java Code:

public static String parseBodyToHTML(String filePath) throws IOException, SAXException, TikaException 
{
	    ContentHandler handler = new BodyContentHandler(new ToXMLContentHandler());
	 
	    AutoDetectParser parser = new AutoDetectParser();
	    Metadata metadata = new Metadata();
	    try (FileInputStream stream =new FileInputStream(new File(filePath))) {
	        parser.parse(stream, handler, metadata);
	        return handler.toString();
	    }
}


While using this function for some files, I get the following error:

Exception in thread "main" org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared
	at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
	at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
	at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
	at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
	at org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250)
	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
	at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:291)
	at org.apache.tika.parser.pdf.PDF2XHTML.startPage(PDF2XHTML.java:225)
	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:437)
	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383)
	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342)
	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
	at TTR.TTRAnalysis.parseBodyToHTML(TTRAnalysis.java:39)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)