You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Mel Martinez (JIRA)" <ji...@apache.org> on 2010/03/15 23:28:28 UTC

[jira] Created: (PDFBOX-662) PDFTextStripper character suppression

PDFTextStripper character suppression
-------------------------------------

Key: PDFBOX-662
URL: https://issues.apache.org/jira/browse/PDFBOX-662
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.0.0
Environment: any
Reporter: Mel Martinez

When parsing the file posted as an example for PDFBox-659, I noticed that numerous characters were missing from the extracted text.

They are getting 'suppressed' in the PDFTextStripper.processTextPosition(TextPosition) method in a section that is meant to try to filter duplicate chars found in some MS Word - generated documents.

The problem is that the filter is over-zealous (in the case of this document) and matches real characters against other real characters in the text. Example

This is some text that has the letter 'e' in it multiple times.

The filter might match one of the later 'e's to an earlier 'e' incorrectly (for example, the one at the end of 'some'), resulting in the extracted text:

This is some text that has the letter 'e' in it multiple tims.
.
>From what I can tell this is because it is using the raw, padded coordinates rather than resolved coordinates.

The example PDF document (see PDFBOX-659) has pages that use both positive and negative raw coordinates that upon my cursory inspection don't always resolve on the same offset point.

The suppression test logic compares textposition elements that seem to have different offsets, possibly due to different amounts of padding. Thus the 'overlap' that it detects is wrong. Its not comparing apples to apples.

The document renders perfectly in Acrobat, so I believe we are not handling the coordinates correctly.

A workaround is possible through suppressing the filtering by setting the

PDFTextStripper.setSuppressDuplicateOverlappingText(boolean)

attribute to false. But that is just hiding the fact that the logic is wrong.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.