You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Ahmad Ajiloo (JIRA)" <ji...@apache.org> on 2011/09/13 08:00:11 UTC

[jira] [Created] (TIKA-713) Tika can not parse all of the persian pdf files

Tika can not parse all of the persian pdf files
-----------------------------------------------

                 Key: TIKA-713
                 URL: https://issues.apache.org/jira/browse/TIKA-713
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.9
            Reporter: Ahmad Ajiloo
             Fix For: 0.9


Hello
I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!

{quote}
I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
--------------------------
‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
--------------------------
Tike returns this output !
--------------------------
 92   @A   8 * B
   C9D  !D       ) (?)   =/
   >
 
 (<) ,    8 ;  
 8 #

   +  9!: 
     L
  #)    4   M() * 0>
 * -3    IA J  
  - 2   (+   G
 H  -1
 (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
--------------------------
{quote}
thanks a lot

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-713) Tika can not parse all of the persian pdf files

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-713:
-------------------------------

    Fix Version/s:     (was: 0.9)

> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
>                 Key: TIKA-713
>                 URL: https://issues.apache.org/jira/browse/TIKA-713
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Ahmad Ajiloo
>         Attachments: ebrat.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
> --------------------------
> ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
> ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
> ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
> ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
> ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
> --------------------------
> Tike returns this output !
> --------------------------
>  92   @A   8 * B
>    C9D  !D       ) (?)   =/
>    >
>  
>  (<) ,    8 ;  
>  8 #
>    +  9!: 
>      L
>   #)    4   M() * 0>
>  * -3    IA J  
>   - 2   (+   G
>  H  -1
>  (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
> --------------------------
> {quote}
> thanks a lot

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-713) Tika can not parse all of the persian pdf files

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140148#comment-13140148 ] 

Robert Muir commented on TIKA-713:
----------------------------------

Thanks for uploading another test file Ahmad, we'll take a look!
                
> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
>                 Key: TIKA-713
>                 URL: https://issues.apache.org/jira/browse/TIKA-713
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Ahmad Ajiloo
>         Attachments: Simple2.pdf, ebrat.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
> --------------------------
> ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
> ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
> ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
> ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
> ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
> --------------------------
> Tike returns this output !
> --------------------------
>  92   @A   8 * B
>    C9D  !D       ) (?)   =/
>    >
>  
>  (<) ,    8 ;  
>  8 #
>    +  9!: 
>      L
>   #)    4   M() * 0>
>  * -3    IA J  
>   - 2   (+   G
>  H  -1
>  (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
> --------------------------
> {quote}
> thanks a lot

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-713) Tika can not parse all of the persian pdf files

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119060#comment-13119060 ] 

Robert Muir commented on TIKA-713:
----------------------------------

I created PDFBOX-1127 for this with some screenshots and description of what is going on.
                
> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
>                 Key: TIKA-713
>                 URL: https://issues.apache.org/jira/browse/TIKA-713
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Ahmad Ajiloo
>         Attachments: ebrat.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
> --------------------------
> ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
> ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
> ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
> ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
> ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
> --------------------------
> Tike returns this output !
> --------------------------
>  92   @A   8 * B
>    C9D  !D       ) (?)   =/
>    >
>  
>  (<) ,    8 ;  
>  8 #
>    +  9!: 
>      L
>   #)    4   M() * 0>
>  * -3    IA J  
>   - 2   (+   G
>  H  -1
>  (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
> --------------------------
> {quote}
> thanks a lot

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-713) Tika can not parse all of the persian pdf files

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140479#comment-13140479 ] 

Robert Muir commented on TIKA-713:
----------------------------------

Thanks Ahmad, I took a quick glance (not a thorough inspection yet):
* Complex.pdf should work, I am able to copy/paste the text from Acrobat
* Simple3.pdf: Acrobat copy/paste yields the wrong persian characters. Could be a bug in the font.
* Simple2.pdf: This one might be hopeless. Acrobat copy/paste yields trash, I think it is a totally custom font encoding.

I will look in more depth later.
                
> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
>                 Key: TIKA-713
>                 URL: https://issues.apache.org/jira/browse/TIKA-713
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Ahmad Ajiloo
>         Attachments: Complex.pdf, Simple2.pdf, Simple3.pdf, ebrat.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
> --------------------------
> ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
> ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
> ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
> ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
> ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
> --------------------------
> Tike returns this output !
> --------------------------
>  92   @A   8 * B
>    C9D  !D       ) (?)   =/
>    >
>  
>  (<) ,    8 ;  
>  8 #
>    +  9!: 
>      L
>   #)    4   M() * 0>
>  * -3    IA J  
>   - 2   (+   G
>  H  -1
>  (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
> --------------------------
> {quote}
> thanks a lot

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-713) Tika can not parse all of the persian pdf files

Posted by "Ahmad Ajiloo (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121376#comment-13121376 ] 

Ahmad Ajiloo commented on TIKA-713:
-----------------------------------

Thanks a lot
                
> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
>                 Key: TIKA-713
>                 URL: https://issues.apache.org/jira/browse/TIKA-713
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Ahmad Ajiloo
>         Attachments: ebrat.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
> --------------------------
> ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
> ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
> ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
> ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
> ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
> --------------------------
> Tike returns this output !
> --------------------------
>  92   @A   8 * B
>    C9D  !D       ) (?)   =/
>    >
>  
>  (<) ,    8 ;  
>  8 #
>    +  9!: 
>      L
>   #)    4   M() * 0>
>  * -3    IA J  
>   - 2   (+   G
>  H  -1
>  (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
> --------------------------
> {quote}
> thanks a lot

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-713) Tika can not parse all of the persian pdf files

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119371#comment-13119371 ] 

Robert Muir commented on TIKA-713:
----------------------------------

This is now fixed in pdfbox's trunk. when tika upgrades to 1.7.0 i can attach a test.
                
> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
>                 Key: TIKA-713
>                 URL: https://issues.apache.org/jira/browse/TIKA-713
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Ahmad Ajiloo
>         Attachments: ebrat.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
> --------------------------
> ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
> ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
> ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
> ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
> ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
> --------------------------
> Tike returns this output !
> --------------------------
>  92   @A   8 * B
>    C9D  !D       ) (?)   =/
>    >
>  
>  (<) ,    8 ;  
>  8 #
>    +  9!: 
>      L
>   #)    4   M() * 0>
>  * -3    IA J  
>   - 2   (+   G
>  H  -1
>  (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
> --------------------------
> {quote}
> thanks a lot

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-713) Tika can not parse all of the persian pdf files

Posted by "Ahmad Ajiloo (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140134#comment-13140134 ] 

Ahmad Ajiloo commented on TIKA-713:
-----------------------------------

I'm testing new Encoding.java file with other persian pdf files. there is a new file which name is Simple2.pdf that pdfbox can not parse it. please find the attachment.
thanks
                
> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
>                 Key: TIKA-713
>                 URL: https://issues.apache.org/jira/browse/TIKA-713
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Ahmad Ajiloo
>         Attachments: Simple2.pdf, ebrat.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
> --------------------------
> ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
> ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
> ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
> ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
> ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
> --------------------------
> Tike returns this output !
> --------------------------
>  92   @A   8 * B
>    C9D  !D       ) (?)   =/
>    >
>  
>  (<) ,    8 ;  
>  8 #
>    +  9!: 
>      L
>   #)    4   M() * 0>
>  * -3    IA J  
>   - 2   (+   G
>  H  -1
>  (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
> --------------------------
> {quote}
> thanks a lot

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-713) Tika can not parse all of the persian pdf files

Posted by "Ali Majdzadeh Kohbanani (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478353#comment-13478353 ] 

Ali Majdzadeh Kohbanani commented on TIKA-713:
----------------------------------------------

Ahmad,
Could you please explain how Complex.pdf is generated? What tool is used in order to create the file? The fonts? Any specific configuration, etc. I have tested PDFBox in order to extract text from Complex.pdf and it performs very well. By contrast, any other PDF file that I test for text extraction using PDFBox have lots of errors. I have tested creating PDF files using PDFCreator and "Save as PDF" plugin in MS-Word. In the first case, the extracted text contains only junk characters and the latter some glyphs and ligatures are extracted wrongly. I have filed a bug report for PDFBox but in order to further testing PDFBox, I would like to know more about the method used in order to create Complex.pdf. Thanks a lot.
                
> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
>                 Key: TIKA-713
>                 URL: https://issues.apache.org/jira/browse/TIKA-713
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Ahmad Ajiloo
>         Attachments: Complex.pdf, ebrat.pdf, Simple2.pdf, Simple3.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
> --------------------------
> ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
> ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
> ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
> ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
> ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
> --------------------------
> Tike returns this output !
> --------------------------
>  92   @A   8 * B
>    C9D  !D       ) (?)   =/
>    >
>  
>  (<) ,    8 ;  
>  8 #
>    +  9!: 
>      L
>   #)    4   M() * 0>
>  * -3    IA J  
>   - 2   (+   G
>  H  -1
>  (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
> --------------------------
> {quote}
> thanks a lot

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-713) Tika can not parse all of the persian pdf files

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103394#comment-13103394 ] 

Robert Muir commented on TIKA-713:
----------------------------------

Thanks Ahmad... I took a look at this PDF and I suspect this is the problem:

The fonts contained in the document have custom font encodings, I opened them up in fontforge and e.g. arabic alef maps to U+0006.
So thats why you see the garbage, its actually unrelated to ICU/bidirectional algorithm.

I think the reason copy/paste works fine in this document is because it probably has unicode PDF metadata... maybe PDFBox doesn't support this?

Disclaimer: I didn't look at any pdfbox code yet or really try to debug it.

> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
>                 Key: TIKA-713
>                 URL: https://issues.apache.org/jira/browse/TIKA-713
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Ahmad Ajiloo
>             Fix For: 0.9
>
>         Attachments: ebrat.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
> --------------------------
> ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
> ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
> ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
> ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
> ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
> --------------------------
> Tike returns this output !
> --------------------------
>  92   @A   8 * B
>    C9D  !D       ) (?)   =/
>    >
>  
>  (<) ,    8 ;  
>  8 #
>    +  9!: 
>      L
>   #)    4   M() * 0>
>  * -3    IA J  
>   - 2   (+   G
>  H  -1
>  (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
> --------------------------
> {quote}
> thanks a lot

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-713) Tika can not parse all of the persian pdf files

Posted by "Ahmad Ajiloo (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ahmad Ajiloo updated TIKA-713:
------------------------------

    Attachment: Simple2.pdf
    
> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
>                 Key: TIKA-713
>                 URL: https://issues.apache.org/jira/browse/TIKA-713
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Ahmad Ajiloo
>         Attachments: Simple2.pdf, ebrat.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
> --------------------------
> ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
> ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
> ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
> ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
> ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
> --------------------------
> Tike returns this output !
> --------------------------
>  92   @A   8 * B
>    C9D  !D       ) (?)   =/
>    >
>  
>  (<) ,    8 ;  
>  8 #
>    +  9!: 
>      L
>   #)    4   M() * 0>
>  * -3    IA J  
>   - 2   (+   G
>  H  -1
>  (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
> --------------------------
> {quote}
> thanks a lot

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-713) Tika can not parse all of the persian pdf files

Posted by "Ahmad Ajiloo (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ahmad Ajiloo updated TIKA-713:
------------------------------

    Attachment: Simple3.pdf
                Complex.pdf

I attached this two files for more researching. thanks for your attention
                
> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
>                 Key: TIKA-713
>                 URL: https://issues.apache.org/jira/browse/TIKA-713
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Ahmad Ajiloo
>         Attachments: Complex.pdf, Simple2.pdf, Simple3.pdf, ebrat.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
> --------------------------
> ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
> ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
> ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
> ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
> ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
> --------------------------
> Tike returns this output !
> --------------------------
>  92   @A   8 * B
>    C9D  !D       ) (?)   =/
>    >
>  
>  (<) ,    8 ;  
>  8 #
>    +  9!: 
>      L
>   #)    4   M() * 0>
>  * -3    IA J  
>   - 2   (+   G
>  H  -1
>  (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
> --------------------------
> {quote}
> thanks a lot

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-713) Tika can not parse all of the persian pdf files

Posted by "Ahmad Ajiloo (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ahmad Ajiloo updated TIKA-713:
------------------------------

    Attachment: ebrat.pdf

this is a persian pdf file that Tika can't parse it.

> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
>                 Key: TIKA-713
>                 URL: https://issues.apache.org/jira/browse/TIKA-713
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Ahmad Ajiloo
>             Fix For: 0.9
>
>         Attachments: ebrat.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text !
> --------------------------
> ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا "فباي آلاء ربكما تكذبان" بخواند.‬
> ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط "عثمانطه" تقريبا يك نصف صفحه است. (‬
> ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
> ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
> ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
> --------------------------
> Tike returns this output !
> --------------------------
>  92   @A   8 * B
>    C9D  !D       ) (?)   =/
>    >
>  
>  (<) ,    8 ;  
>  8 #
>    +  9!: 
>      L
>   #)    4   M() * 0>
>  * -3    IA J  
>   - 2   (+   G
>  H  -1
>  (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
> --------------------------
> {quote}
> thanks a lot

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira