You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by andrewtr <an...@compvue.com> on 2012/06/04 14:21:14 UTC

HTML styles and
  • tags are ignored
  • Hi:
    
    While I am parsing the PDF or Word document using AutoDetectParser the <li>,
    <ul> tags are converted as <p> tags. I need the exact HTML content what is
    been there for PDF or Word Document.
    
    I tried in several ways as below:
    
    ToHTMLContentHandler textHandler = new ToHTMLContentHandler();
    Metadata metadata = new Metadata();
    Parser parser = new AutoDetectParser();
    ParseContext context = new ParseContext();
    context.set(HtmlMapper.class, new IdentityHtmlMapper());
    parser.parse(in, textHandler, metadata, context);
    
    ---------------------------------------------------------
    
    SAXTransformerFactory factory =
    (SAXTransformerFactory)SAXTransformerFactory.newInstance();
    TransformerHandler handler = factory.newTransformerHandler();
    handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
    handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
    handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "utf-8");
    handler.setResult(new StreamResult(writer));
    System.out.println(handler.toString());
    return handler;
    
    But the <li> tags are been replaced with <p> tags with class but the CSS
    style is not seen in the parsed HTML output.
    
    Any help is appreciated.
    
    --
    View this message in context: http://lucene.472066.n3.nabble.com/HTML-styles-and-li-tags-are-ignored-tp3987550.html
    Sent from the Apache Tika - Development mailing list archive at Nabble.com.
    

    Re: HTML styles and
  • tags are ignored
  • Posted by Jukka Zitting <ju...@gmail.com>.
    Hi,
    
    On Mon, Jun 4, 2012 at 2:21 PM, andrewtr <an...@compvue.com> wrote:
    > While I am parsing the PDF or Word document using AutoDetectParser the <li>,
    > <ul> tags are converted as <p> tags. I need the exact HTML content what is
    > been there for PDF or Word Document.
    
    <li> and <ul> tags in PDF or Word? I assume you rather mean the native
    list formatting of those document types?
    
    The Tika parsers for PDF and Office documents could/should
    automatically map such formatting to equivalent XHTML constructs, but
    I don't think they currently do. You'll need to look into the source
    code to see how to make that happen.
    
    BR,
    
    Jukka Zitting