You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2020/10/27 19:45:00 UTC

[jira] [Comment Edited] (PDFBOX-5002) PDFTextStripper sometimes fuses two words on different lines

    [ https://issues.apache.org/jira/browse/PDFBOX-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221702#comment-17221702 ] 

Tilman Hausherr edited comment on PDFBOX-5002 at 10/27/20, 7:44 PM:
--------------------------------------------------------------------

Seems nice. I need to review the result (differences) of tests with have my own, bigger test set.

The different extraction in the "EU" file could be problematic (although the result looks better). This is a test file of the Tabula project (there are many, but I kept that one as an early indictor of trouble). They don't want any extractions differences. 

The good thing is that the {{testTabula()}} test passes (it uses a different algorithm to get font heights). But I'd need to test the Tabula build too which has more tests.


was (Author: tilman):
Seems nice. I need review the result of tests with have my own, bigger test set.

The different extraction in the "EU" file could be problematic (although the result looks better). This is a test file of the Tabula project (there are many, but I kept that one as an early indictor of trouble). They don't want any extractions differences. 

The good thing is that the {{testTabula()}} test passes (it uses a different algorithm to get font heights). But I'd need to test the Tabula build too which has more tests.

> PDFTextStripper sometimes fuses two words on different lines
> ------------------------------------------------------------
>
>                 Key: PDFBOX-5002
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5002
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.21
>            Reporter: Thierry Guérin
>            Priority: Minor
>             Fix For: 2.0.22
>
>         Attachments: small&Big.pdf
>
>
> This happens when a text in a big font is followed by at least two lines of text in a smaller font: the last word of the first line is merged with the first word of the second line.
> On the attached PDF, the extracted text is :
> {noformat}
> (...) some text awith smaller font (...){noformat}
> instead of:
>  
> {noformat}
> (...) some text with a smaller font (...)
> {noformat}
> I often encounter this kind of problem on invoices, where the company address (small text at the top right) is next to the company name & logo (big centered text at the top).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org