You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Joseph Vychtrle (JIRA)" <ji...@apache.org> on 2011/04/01 02:02:07 UTC

[jira] [Created] (TIKA-630) Dealing with PDF documents produced from scanning programs

Dealing with PDF documents produced from scanning programs
----------------------------------------------------------

                 Key: TIKA-630
                 URL: https://issues.apache.org/jira/browse/TIKA-630
             Project: Tika
          Issue Type: Improvement
          Components: general
    Affects Versions: 1.0
            Reporter: Joseph Vychtrle
            Priority: Minor


Hey,

sorry I didn't post this to mailing list, I kinda didn't get the confirmation.

The issue is that often people don't even realize there is a difference in pdf documents (extracted from openoffice/ms office or pdf from a scanner software). And if Tika processes such a document, it detects pdf content type, but there are only images in there. I don't know how to deal with that. There should be a function that decides on the type of PDF document so that I can take it and use some OCR software for the PDF from scanner software.

If there is a way to do that, could please anybody explain how to do that ?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-630) Dealing with PDF documents from scanning programs

Posted by "Joseph Vychtrle (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph Vychtrle updated TIKA-630:
---------------------------------

    Summary: Dealing with PDF documents from scanning programs  (was: Dealing with PDF documents produced from scanning programs)

> Dealing with PDF documents from scanning programs
> -------------------------------------------------
>
>                 Key: TIKA-630
>                 URL: https://issues.apache.org/jira/browse/TIKA-630
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 1.0
>            Reporter: Joseph Vychtrle
>            Priority: Minor
>              Labels: ocr, pdf,
>
> Hey,
> sorry I didn't post this to mailing list, I kinda didn't get the confirmation.
> The issue is that often people don't even realize there is a difference in pdf documents (extracted from openoffice/ms office or pdf from a scanner software). And if Tika processes such a document, it detects pdf content type, but there are only images in there. I don't know how to deal with that. There should be a function that decides on the type of PDF document so that I can take it and use some OCR software for the PDF from scanner software.
> If there is a way to do that, could please anybody explain how to do that ?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira