You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Dmytro Sadovnychyi (JIRA)" <ji...@apache.org> on 2018/08/16 19:51:00 UTC

[jira] [Created] (TIKA-2708) Bold text is omitted from PDF documents

Dmytro Sadovnychyi created TIKA-2708:
----------------------------------------

             Summary: Bold text is omitted from PDF documents
                 Key: TIKA-2708
                 URL: https://issues.apache.org/jira/browse/TIKA-2708
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.18
            Reporter: Dmytro Sadovnychyi
         Attachments: 10fu7MbhtFooYKpV2M9XBW.pdf

When using `java -jar pdfbox-app-2.0.9.jar ExtractText -html 10fu7MbhtFooYKpV2M9XBW.pdf result.html` the bold text appears inside of "<b>" tags, meanwhile HTML produced by Tika Server 1.18 has those tags omitted. Is it something expected, any way to match the results with PDFBox?

Sample PDF attached, question is about the first line with "Exhibit 10.2".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)