You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2018/09/27 05:56:00 UTC
[jira] [Assigned] (PDFBOX-4313) PDFTextStripper groups unrelated
chunks into words
[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler reassigned PDFBOX-4313:
------------------------------------------
Assignee: Andreas Lehmkühler
> PDFTextStripper groups unrelated chunks into words
> --------------------------------------------------
>
> Key: PDFBOX-4313
> URL: https://issues.apache.org/jira/browse/PDFBOX-4313
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.11
> Reporter: Emilian Bold
> Assignee: Andreas Lehmkühler
> Priority: Major
> Attachments: 1536938716546.pdf, PDFBOX-4313-Test.pdf, PDFBOX-4313-Test_sorted.txt, PDFBOX-4313-Test_unsorted.txt, PDFBOX-4313.pdf, PDFBOX4313Test.java, PDFBOX4313Test.java, crop-fisa-sintetica.png, pdfbox-words.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
> // test if our TextPosition starts after a new word would be expected to start
> if (expectedStartOfNextWordX != EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
> && expectedStartOfNextWordX < positionX &&
> // only bother adding a space if the last character was not a space
> lastPosition.getTextPosition().getUnicode() != null
> && !lastPosition.getTextPosition().getUnicode().endsWith(" "))
> {
> line.add(LineItem.getWordSeparator());
> }
> }}
> which seems to add a word separator only if the next char is "after" the current word. It never expects that the next char might be "before" the current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain PDF, it just seems that Oracle Reports generates these chunks in the reverse order.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org