You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Cheng Leong (JIRA)" <ji...@apache.org> on 2014/01/22 22:48:19 UTC

[jira] [Created] (PDFBOX-1860) HTML converter escapes formatting close tags

Cheng Leong created PDFBOX-1860:
-----------------------------------

             Summary: HTML converter escapes formatting close tags
                 Key: PDFBOX-1860
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1860
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.3
            Reporter: Cheng Leong
            Priority: Minor
         Attachments: pdftest.pdf

Bug introduced by PDFBOX-1213 in 1.8.3 for HTML style information.
Bold style tags are opened correctly, but the close tags are html-escaped.

{noformat}
~/work/pdfbox ((1.8.3))$ java -jar app/target/pdfbox-app-1.8.3.jar ExtractText -html -nonSeq -console pdftest.pdf 
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html><head><title>1725.PDF</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div style="page-break-before:always; page-break-after:always"><div><p>E:\M55\!\1725.fm 2003-01-01 18:15 P Tagg, IPM, University of Liverpool
</p>
<p><b>A VERY SMALL PDF FILE
&lt;/b&gt;</p>
<p><b>A VERY SMALL PDF FILE
&lt;/b&gt;</p>
<p><b>A VERY SMALL PDF FILE
&lt;/b&gt;</p>
<p><b>A VERY SMALL PDF FILE
&lt;/b&gt;</p>
<p><b>A VERY SMALL PDF FILE
&lt;/b&gt;</p>
<p><b>A VERY SMALL PDF FILE&lt;/b&gt;</p>

</div></div>
</body></html>
{noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)