You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2014/10/13 19:43:34 UTC
[jira] [Closed] (PDFBOX-662) PDFTextStripper character suppression

     [ https://issues.apache.org/jira/browse/PDFBOX-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler closed PDFBOX-662.
-------------------------------------
    Resolution: Fixed
      Assignee: Andreas Lehmkühler

The text extraction of the pdf in question works well since 1.4.0

> PDFTextStripper character suppression
> -------------------------------------
>
>                 Key: PDFBOX-662
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-662
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.0.0
>         Environment: any
>            Reporter: Mel Martinez
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.4.0
>
>
> When parsing the file posted as an example for PDFBox-659, I noticed that numerous characters were missing from the extracted text.
> They are getting 'suppressed' in the PDFTextStripper.processTextPosition(TextPosition) method in a section that is meant to try to filter duplicate chars found in some MS Word - generated documents.
> The problem is that the filter is over-zealous (in the case of this document) and matches real characters against other real characters in the text.  Example
>    This is some text that has the letter 'e' in it multiple times.
> The filter might match one of the later 'e's to an earlier 'e' incorrectly (for example, the one at the end of 'some'), resulting in the extracted text:
>    This is some text that has the letter 'e' in it multiple tims.
> .
> From what I can tell this is because it is using the raw, padded coordinates rather than resolved coordinates.
> The example PDF document (see PDFBOX-659) has pages that use both positive and negative raw coordinates that upon my cursory inspection don't always resolve on the same offset point.
> The suppression test logic compares textposition elements that seem to have different offsets, possibly due to different amounts of padding.  Thus the 'overlap' that it detects is wrong.  Its not comparing apples to apples.
> The document renders perfectly in Acrobat,  so I believe we are not handling the coordinates correctly.
> A workaround is possible through suppressing the filtering by setting the 
> PDFTextStripper.setSuppressDuplicateOverlappingText(boolean)
> attribute to false.  But that is just hiding the fact that the logic is wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)