You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Dmytro Sadovnychyi (JIRA)" <ji...@apache.org> on 2018/08/16 19:51:00 UTC
[jira] [Created] (TIKA-2708) Bold text is omitted from PDF
documents
Dmytro Sadovnychyi created TIKA-2708:
----------------------------------------
Summary: Bold text is omitted from PDF documents
Key: TIKA-2708
URL: https://issues.apache.org/jira/browse/TIKA-2708
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.18
Reporter: Dmytro Sadovnychyi
Attachments: 10fu7MbhtFooYKpV2M9XBW.pdf
When using `java -jar pdfbox-app-2.0.9.jar ExtractText -html 10fu7MbhtFooYKpV2M9XBW.pdf result.html` the bold text appears inside of "<b>" tags, meanwhile HTML produced by Tika Server 1.18 has those tags omitted. Is it something expected, any way to match the results with PDFBox?
Sample PDF attached, question is about the first line with "Exhibit 10.2".
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)