You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ali Majdzadeh Kohbanani (JIRA)" <ji...@apache.org> on 2012/10/17 22:44:03 UTC
[jira] [Commented] (TIKA-713) Tika can not parse all of the persian
pdf files
[ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478353#comment-13478353 ]
Ali Majdzadeh Kohbanani commented on TIKA-713:
----------------------------------------------
Ahmad,
Could you please explain how Complex.pdf is generated? What tool is used in order to create the file? The fonts? Any specific configuration, etc. I have tested PDFBox in order to extract text from Complex.pdf and it performs very well. By contrast, any other PDF file that I test for text extraction using PDFBox have lots of errors. I have tested creating PDF files using PDFCreator and "Save as PDF" plugin in MS-Word. In the first case, the extracted text contains only junk characters and the latter some glyphs and ligatures are extracted wrongly. I have filed a bug report for PDFBox but in order to further testing PDFBox, I would like to know more about the method used in order to create Complex.pdf. Thanks a lot.
> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
> Key: TIKA-713
> URL: https://issues.apache.org/jira/browse/TIKA-713
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.9
> Reporter: Ahmad Ajiloo
> Attachments: Complex.pdf, ebrat.pdf, Simple2.pdf, Simple3.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
> --------------------------
> هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.
> ) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (
> همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:
> 1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي
> 4- خوردن عسل 5- خوردن عدس 6- خوردن گوشت نزديک گردن
> --------------------------
> Tike returns this output !
> --------------------------
> 92 @A 8 * B
> C9D !D ) (?) =/
> >
>
> (<) , 8 ;
> 8 #
> + 9!:
> L
> #) 4 M() * 0>
> * -3 IA J
> - 2 (+ G
> H -1
> (+ J 5#+C 0T J (+ O - 6 R . (+ O - 5 PH. (+ O -4
> --------------------------
> {quote}
> thanks a lot
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira