You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Tilman Hausherr <TH...@t-online.de> on 2023/08/12 14:11:08 UTC

Re: PDFbox & soft hyphens

On 12.08.2023 16:03, tika@cid.is wrote:
> Hi all,
>
> [PDFBOX-371] was about the treatment of soft hyphens by PDFbox in the 
> context of extracting text from PDF.
> It looks like there is _no_ treatment of soft hyphens by PDFbox, at 
> least I did not found any information about it.
> Please prove me wrong or give me a hint how to get soft hyphens out of 
> a PDF as soft hyphens (which means as an "excentric" unicode or an 
> "excentric" string).
> Thanks
> Walter Claassen 


There were some issues over the years, see

https://issues.apache.org/jira/browse/TIKA-3314 (which I just resolved 
but was fixed long ago)

and

https://issues.apache.org/jira/browse/PDFBOX-5115

please test with the file there or with your own; if you're unsatisfied, 
upload it to a sharehoster and post the URL.

Tilman