You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Hudson (Jira)" <ji...@apache.org> on 2021/03/06 20:56:00 UTC

[jira] [Commented] (TIKA-3306) Clean up ocr routing in 2.0.0

    [ https://issues.apache.org/jira/browse/TIKA-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17296654#comment-17296654 ] 

Hudson commented on TIKA-3306:
------------------------------

UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk8 #166 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/166/])
TIKA-3306 add a parser-override content type (tallison: [https://github.com/apache/tika/commit/846cbc11a3eb3680a72db8992a1e7abf305b4e21])
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/csv/TextAndCSVParserTest.java
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/csv/TextAndCSVParser.java
* (edit) tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/pst/OutlookPSTParser.java
* (edit) tika-core/src/main/java/org/apache/tika/detect/CompositeDetector.java
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-mail-module/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
* (edit) tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-package/src/test/java/org/apache/tika/parser/microsoft/rtf/RTFParserTest.java
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-package/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
* (edit) tika-core/src/main/java/org/apache/tika/parser/AutoDetectParser.java
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-image-module/src/main/java/org/apache/tika/parser/image/AbstractImageParser.java
* (edit) tika-core/src/main/java/org/apache/tika/detect/OverrideDetector.java
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-mail-module/src/main/java/org/apache/tika/parser/mbox/MboxParser.java
* (edit) tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java
TIKA-3306 add a parser-override content type -- fix automatic refactoring in favor of parser override (tallison: [https://github.com/apache/tika/commit/8d95ec4b6764ae12212b1eab8987cc50964bdde7])
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-mail-module/src/main/java/org/apache/tika/parser/mbox/MboxParser.java
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
* (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-mail-module/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java


> Clean up ocr routing in 2.0.0
> -----------------------------
>
>                 Key: TIKA-3306
>                 URL: https://issues.apache.org/jira/browse/TIKA-3306
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Trivial
>
> I somewhat cleaned up ocr routing in 2.0.0 on an earlier issue.  What I didn't like about that is that we overrode/temporarily overwrote the content-type.  Let's add a "parser-override" content type to differentiate from a user override content type, and let's not overwrite the content-type for parser-content-type overrides.
>  
> In addition to avoiding the muddling of content-type, this fix will also prevent ocr- content types from being written into the xhtml metadata during OCR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)