You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2020/06/19 08:39:00 UTC
[jira] [Closed] (PDFBOX-4881) Is it possible to properly extract
text from this pdf?
[ https://issues.apache.org/jira/browse/PDFBOX-4881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr closed PDFBOX-4881.
-----------------------------------
Resolution: Won't Do
Closing this one, as it is almost impossible. An ideal solution would be to reconstruct /ToUnicode by using OCR for the individual glyphs, or better: by using the OCR results to create a huge database with glyph outlines. This would be a nice project for companies doing OCR as a service, because one would have better text extraction than from "pure" OCR.
> Is it possible to properly extract text from this pdf?
> ------------------------------------------------------
>
> Key: PDFBOX-4881
> URL: https://issues.apache.org/jira/browse/PDFBOX-4881
> Project: PDFBox
> Issue Type: Wish
> Reporter: Alfred
> Priority: Trivial
> Attachments: Farsi.pdf
>
>
> This PDF has farsi characters, but probably the char codes are wrong and probably no mapping table.
> If there's any work to be done to support Farsi I would be happy to do that myself, I just need a pointer to the right direction.
>
> Thank you!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org