You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/03/29 18:59:00 UTC

[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

    [ https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419580#comment-16419580 ] 

ASF GitHub Bot commented on TIKA-2582:
--------------------------------------

tballison closed pull request #222: Fix for TIKA-2582 contributed by ewanmellor.
URL: https://github.com/apache/tika/pull/222
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java b/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
index c8c8bc93e..d8723dd34 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
@@ -91,6 +91,9 @@
     // factor by which image is to be scaled.
     private int resize = 900;
 
+    // See setPageSeparator.
+    private String pageSeparator = "";
+
     // whether or not to preserve interword spacing
     private boolean preserveInterwordSpacing = false;
 
@@ -255,6 +258,25 @@ public void setPageSegMode(String pageSegMode) {
         this.pageSegMode = pageSegMode;
     }
 
+    /**
+     * @see #setPageSeparator(String pageSeparator)
+     */
+    public String getPageSeparator() {
+        return pageSeparator;
+    }
+
+    /**
+     * The page separator to use in plain text output.  This corresponds to Tesseract's page_separator config option.
+     * The default here is the empty string (i.e. no page separators).  Note that this is also the default in
+     * Tesseract 3.x, but in Tesseract 4.0 the default is to use the form feed control character.  We are overriding
+     * Tesseract 4.0's default here.
+     *
+     * @param pageSeparator
+     */
+    public void setPageSeparator(String pageSeparator) {
+        this.pageSeparator = pageSeparator;
+    }
+
     /**
      * Whether or not to maintain interword spacing.  Default is <code>false</code>.
      *
diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
index 08847fd74..3e15c4495 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
@@ -468,6 +468,7 @@ private void doOCR(File input, File output, TesseractOCRConfig config) throws IO
         String[] cmd = { config.getTesseractPath() + getTesseractProg(), input.getPath(), output.getPath(), "-l",
                 config.getLanguage(), "-psm", config.getPageSegMode(),
                 config.getOutputType().name().toLowerCase(Locale.US),
+                "-c", "page_separator=" + config.getPageSeparator(),
                 "-c",
                 (config.getPreserveInterwordSpacing())? "preserve_interword_spaces=1" : "preserve_interword_spaces=0"};
         ProcessBuilder pb = new ProcessBuilder(cmd);


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Tesseract 4.0 includes a FF character by default, breaking parsers
> ------------------------------------------------------------------
>
>                 Key: TIKA-2582
>                 URL: https://issues.apache.org/jira/browse/TIKA-2582
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>            Reporter: Ewan Mellor
>            Priority: Major
>
> Tesseract 4.0 includes a change to use form feed characters to separate pages by default in its text output. Previous versions used no separator unless you specified the include_page_breaks option.
> This confuses any parser that is not expecting the FF. ODFParserTest.testOO2Metadata fails, because it is expecting the output of a blank image to be the empty string, but now the FF is there.
> I haven't seen any other failures, but I expect that user code will now see either FF or U+FFFD where they are not expecting it (SafeContentHandler replaces the FF with U+FFFD when converting to text to XML).
> We should set the appropriate Tesseract options to disable this behavior unless explicitly requested by user code, to avoid the change in behavior.
> For reference, the Tesseract change is as follows:
> {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56
>  Merge: 3bb573ae aa6eb6bd
>  Author: zdenop <zd...@gmail.com>
>  Date: Tue Sep 19 08:41:08 2017 +0200
> Merge pull request #1140 from stweil/pagebreak
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> commit aa6eb6bd466101a3b89880f87580471a7694359d
>  Author: Stefan Weil <sw...@weilnetz.de>
>  Date: Mon Jun 12 19:42:45 2017 +0200
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> Now Tesseract adds a page break (normally form feed) by default.
> It is still possible to suppress page breaks by setting an empty
>  page_separator.
> Signed-off-by: Stefan Weil <sw...@weilnetz.de>
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)