You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/06/17 21:47:10 UTC

[jira] [Closed] (PDFBOX-1919) Span tags are not implemented

     [ https://issues.apache.org/jira/browse/PDFBOX-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson closed PDFBOX-1919.
-------------------------------

    Resolution: Won't Fix

Closing as "won't fix", as it doesn't look like there's anything to fix here. If anyone is interested in extracting marked content, don't forget about PDFMarkedContentExtractor.

> Span tags are not implemented
> -----------------------------
>
>                 Key: PDFBOX-1919
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1919
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.5, 1.8.6, 2.0.0
>            Reporter: Corentin Regal
>         Attachments: PDFBOX-1919.AdobeReader.txt, PDFBOX-1919.pdf, PDFBOX-1919.txt
>
>
> The font descriptor flags are not set.
> They are described in the document "PDF reference 1.7" at : 5.7.1 Font Descriptor Flags
> The methods in PDFontDescriptor are ready but never called :
> setFlags()
> setSerif()
> setAllCap() which is used in a lot of PDF
> ...
> I saw some TODO that relate to that issue in the code, is it planned to be implemented soon?
> ---
> UPDATE: This issue turned out to be caused by span tags and how Adobe Acrobat handles copy & paste. Ultimately the PDF contains poor quality ToUnicode mappings and poor quality span tags, so there's no real fix.



--
This message was sent by Atlassian JIRA
(v6.2#6252)