You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2018/05/24 19:05:00 UTC

[jira] [Commented] (TIKA-2650) Soft-hyphen is not extracted properly

    [ https://issues.apache.org/jira/browse/TIKA-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489606#comment-16489606 ] 

Tim Allison commented on TIKA-2650:
-----------------------------------

Can you share with us exactly where the soft-hyphen isn't working?  I see it working sometimes.  Note that there is often a difference between the text as displayed and the text that is electronically stored (OCR'd?) within the PDF.

> Soft-hyphen is not extracted properly
> -------------------------------------
>
>                 Key: TIKA-2650
>                 URL: https://issues.apache.org/jira/browse/TIKA-2650
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 1.18
>            Reporter: Saurabh Patil
>            Priority: Blocker
>         Attachments: Peter Rabbit.pdf
>
>
> We are tring to extract text from PDF. if PDF having any big word at the end of line then after half word there is soft hyphen and remaining word goes to next line. but which extracting these text TIKA automatically replace hyphen with space.  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)