You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Thierry Guérin (Jira)" <ji...@apache.org> on 2020/10/26 15:26:00 UTC
[jira] [Created] (PDFBOX-5002) PDFTextStripper sometimes fuses two
words on different lines
Thierry Guérin created PDFBOX-5002:
--------------------------------------
Summary: PDFTextStripper sometimes fuses two words on different lines
Key: PDFBOX-5002
URL: https://issues.apache.org/jira/browse/PDFBOX-5002
Project: PDFBox
Issue Type: Bug
Affects Versions: 2.0.21
Reporter: Thierry Guérin
Fix For: 2.0.22
Attachments: small&Big.pdf
This happens when a text in a big font is followed by at least two lines of text in a smaller font: the last word of the first line is merged with the first word of the second line.
On the attached PDF, the extracted text is :
{noformat}
(...) some text awith smaller font (...){noformat}
instead of:
{noformat}
(...) some text with a smaller font (...)
{noformat}
I often encounter this kind of problem on invoices, where the company address (small text at the top right) is next to the company name & logo (big centered text at the top).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org