You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2020/06/19 08:39:00 UTC

[jira] [Closed] (PDFBOX-4881) Is it possible to properly extract text from this pdf?

     [ https://issues.apache.org/jira/browse/PDFBOX-4881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr closed PDFBOX-4881.
-----------------------------------
    Resolution: Won't Do

Closing this one, as it is almost impossible. An ideal solution would be to reconstruct /ToUnicode by using OCR for the individual glyphs, or better: by using the OCR results to create a huge database with glyph outlines. This would be a nice project for companies doing OCR as a service, because one would have better text extraction than from "pure" OCR.

> Is it possible to properly extract text from this pdf?
> ------------------------------------------------------
>
>                 Key: PDFBOX-4881
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4881
>             Project: PDFBox
>          Issue Type: Wish
>            Reporter: Alfred
>            Priority: Trivial
>         Attachments: Farsi.pdf
>
>
> This PDF has farsi characters, but probably the char codes are wrong and probably no mapping table.
> If there's any work to be done to support Farsi I would be happy to do that myself, I just need a pointer to the right direction.
>  
> Thank you!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org