You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Konstantin Avdeev (JIRA)" <ji...@apache.org> on 2016/05/01 10:00:19 UTC

[jira] [Commented] (TIKA-1963) Configuring Parsers: "high degree of control over which parsers are or aren't used" does not work

    [ https://issues.apache.org/jira/browse/TIKA-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265657#comment-15265657 ] 

Konstantin Avdeev commented on TIKA-1963:
-----------------------------------------

> you can't OCR a PDF
I believe, I can understand that :)

Guys, simple question again - is it possible with current implementation to configure the toolkit to enable Tesseract for PDF only? If not, are there any plans to make the "high degree of control" even more higher?
Thanks a lot!


> Configuring Parsers: "high degree of control over which parsers are or aren't used" does not work
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1963
>                 URL: https://issues.apache.org/jira/browse/TIKA-1963
>             Project: Tika
>          Issue Type: Bug
>          Components: config
>    Affects Versions: 1.12
>         Environment: windows, java version "1.8.0_73", 64 bit
>            Reporter: Konstantin Avdeev
>
> Hi everybody!
> I'm trying to white-list a particular mime-type for OCR with the following config:
> {code}
> <properties>
>   <parsers>
>     <parser class="org.apache.tika.parser.DefaultParser">
>       <mime-exclude>application/pdf</mime-exclude>
>       <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>     </parser>
>     <parser class="org.apache.tika.parser.pdf.PDFParser">
>       <mime>application/pdf</mime>
>     </parser>
>   </parsers>
> </properties>
> {code}
> So, the idea is - to enable the Tesseract parser for PDF format only.
> But this configuration disables the Tesseract completely.
> Is it the expected behaviour or a bug?
> Thank you!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)