You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Cheng Leong (JIRA)" <ji...@apache.org> on 2014/01/22 22:48:19 UTC
[jira] [Created] (PDFBOX-1860) HTML converter escapes formatting
close tags
Cheng Leong created PDFBOX-1860:
-----------------------------------
Summary: HTML converter escapes formatting close tags
Key: PDFBOX-1860
URL: https://issues.apache.org/jira/browse/PDFBOX-1860
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.3
Reporter: Cheng Leong
Priority: Minor
Attachments: pdftest.pdf
Bug introduced by PDFBOX-1213 in 1.8.3 for HTML style information.
Bold style tags are opened correctly, but the close tags are html-escaped.
{noformat}
~/work/pdfbox ((1.8.3))$ java -jar app/target/pdfbox-app-1.8.3.jar ExtractText -html -nonSeq -console pdftest.pdf
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html><head><title>1725.PDF</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div style="page-break-before:always; page-break-after:always"><div><p>E:\M55\!\1725.fm 2003-01-01 18:15 P Tagg, IPM, University of Liverpool
</p>
<p><b>A VERY SMALL PDF FILE
</b></p>
<p><b>A VERY SMALL PDF FILE
</b></p>
<p><b>A VERY SMALL PDF FILE
</b></p>
<p><b>A VERY SMALL PDF FILE
</b></p>
<p><b>A VERY SMALL PDF FILE
</b></p>
<p><b>A VERY SMALL PDF FILE</b></p>
</div></div>
</body></html>
{noformat}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)