You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by GitBox <gi...@apache.org> on 2021/02/03 19:09:00 UTC

[GitHub] [tika] peterkronenberg opened a new pull request #402: Tika 3286 - Check if lanaguage files exists and provide better error msg; Support script lanaguge files in script directory

peterkronenberg opened a new pull request #402:
URL: https://github.com/apache/tika/pull/402


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tika] peterkronenberg commented on a change in pull request #402: Tika 3286 - Check if lanaguage files exists and provide better error msg; Support script lanaguge files in script directory

Posted by GitBox <gi...@apache.org>.

peterkronenberg commented on a change in pull request #402:
URL: https://github.com/apache/tika/pull/402#discussion_r570351698



##########
File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
##########
@@ -249,17 +255,71 @@ public String getLanguage() {
 
     /**
      * Set tesseract language dictionary to be used. Default is "eng".
+     * languages are either:
+     * <ol>
+     *   <li>Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl</li>
+     *   <li>A file path in the script directory.  The name starts with upper-case letter.
+     *       Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal</li>
+     * </ol>
      * Multiple languages may be specified, separated by plus characters.
-     * e.g. "chi_tra+chi_sim"
+     * e.g. "chi_tra+chi_sim+script/Arabic"
      */
     public void setLanguage(String language) {
-        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2}(\\+?))+")
-                || language.endsWith("+")) {
-            throw new IllegalArgumentException("Invalid language code: "+language);
+        // Get rid of embedded spaces
+        language = language.replaceAll("\\s", "");
+        // Test for leading or trailing +
+        if (language.matches("\\+.*|.*\\+")) {
+            throw new IllegalArgumentException("Invalid syntax - Can't start or end with +" + language);
+        }
+        // Split on the + sign
+        final String[] langs = language.split("\\+");
+        List<String> invalidCodes = new ArrayList<>();
+        for (String lang : langs) {
+            // First, make sure it conforms to the correct syntax
+            if (!lang.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2})|script(/|\\\\)[A-Z][a-zA-Z_]+")) {
+                invalidCodes.add(lang + " (invalid syntax)");
+            } else if (!langExists(lang)) {
+                invalidCodes.add(lang + " (not found)");
+            }
+        }
+        if (!invalidCodes.isEmpty()) {
+            throw new IllegalArgumentException("Invalid language code(s): " + invalidCodes);
         }
         this.language = language;
     }
 
+
+    /**
+     * Check if tessdata language model exists
+     */
+    private boolean langExists(String lang) {

Review comment:
       And you're not specifying the path?  I guess you're right.  The user can put it anywhere and just make sure the Path points to it.  So perhaps I should search the path for both xNix and Windows?  For Windows, I search for Tesseract and assume tessdata is under it.  For Unix, I just search for tessdata on the path?

##########
File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
##########
@@ -249,17 +255,71 @@ public String getLanguage() {
 
     /**
      * Set tesseract language dictionary to be used. Default is "eng".
+     * languages are either:
+     * <ol>
+     *   <li>Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl</li>
+     *   <li>A file path in the script directory.  The name starts with upper-case letter.
+     *       Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal</li>
+     * </ol>
      * Multiple languages may be specified, separated by plus characters.
-     * e.g. "chi_tra+chi_sim"
+     * e.g. "chi_tra+chi_sim+script/Arabic"
      */
     public void setLanguage(String language) {
-        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2}(\\+?))+")
-                || language.endsWith("+")) {
-            throw new IllegalArgumentException("Invalid language code: "+language);
+        // Get rid of embedded spaces
+        language = language.replaceAll("\\s", "");
+        // Test for leading or trailing +
+        if (language.matches("\\+.*|.*\\+")) {
+            throw new IllegalArgumentException("Invalid syntax - Can't start or end with +" + language);
+        }
+        // Split on the + sign
+        final String[] langs = language.split("\\+");
+        List<String> invalidCodes = new ArrayList<>();
+        for (String lang : langs) {
+            // First, make sure it conforms to the correct syntax
+            if (!lang.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2})|script(/|\\\\)[A-Z][a-zA-Z_]+")) {
+                invalidCodes.add(lang + " (invalid syntax)");
+            } else if (!langExists(lang)) {
+                invalidCodes.add(lang + " (not found)");
+            }
+        }
+        if (!invalidCodes.isEmpty()) {
+            throw new IllegalArgumentException("Invalid language code(s): " + invalidCodes);
         }
         this.language = language;
     }
 
+
+    /**
+     * Check if tessdata language model exists
+     */
+    private boolean langExists(String lang) {

Review comment:
       So you're saying that if the language directory is not specified by the user, for any OS, then we just pass it to Tesseract and let Tesseract handle it?  (i.e., what I'm currently doing for Windows)

##########
File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
##########
@@ -249,17 +255,71 @@ public String getLanguage() {
 
     /**
      * Set tesseract language dictionary to be used. Default is "eng".
+     * languages are either:
+     * <ol>
+     *   <li>Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl</li>
+     *   <li>A file path in the script directory.  The name starts with upper-case letter.
+     *       Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal</li>
+     * </ol>
      * Multiple languages may be specified, separated by plus characters.
-     * e.g. "chi_tra+chi_sim"
+     * e.g. "chi_tra+chi_sim+script/Arabic"
      */
     public void setLanguage(String language) {
-        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2}(\\+?))+")
-                || language.endsWith("+")) {
-            throw new IllegalArgumentException("Invalid language code: "+language);
+        // Get rid of embedded spaces
+        language = language.replaceAll("\\s", "");
+        // Test for leading or trailing +
+        if (language.matches("\\+.*|.*\\+")) {
+            throw new IllegalArgumentException("Invalid syntax - Can't start or end with +" + language);
+        }
+        // Split on the + sign
+        final String[] langs = language.split("\\+");
+        List<String> invalidCodes = new ArrayList<>();
+        for (String lang : langs) {
+            // First, make sure it conforms to the correct syntax
+            if (!lang.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2})|script(/|\\\\)[A-Z][a-zA-Z_]+")) {
+                invalidCodes.add(lang + " (invalid syntax)");
+            } else if (!langExists(lang)) {
+                invalidCodes.add(lang + " (not found)");
+            }
+        }
+        if (!invalidCodes.isEmpty()) {
+            throw new IllegalArgumentException("Invalid language code(s): " + invalidCodes);
         }
         this.language = language;
     }
 
+
+    /**
+     * Check if tessdata language model exists
+     */
+    private boolean langExists(String lang) {

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tika] tballison commented on a change in pull request #402: Tika 3286 - Check if lanaguage files exists and provide better error msg; Support script lanaguge files in script directory

Posted by GitBox <gi...@apache.org>.

tballison commented on a change in pull request #402:
URL: https://github.com/apache/tika/pull/402#discussion_r570318082



##########
File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
##########
@@ -249,17 +255,71 @@ public String getLanguage() {
 
     /**
      * Set tesseract language dictionary to be used. Default is "eng".
+     * languages are either:
+     * <ol>
+     *   <li>Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl</li>
+     *   <li>A file path in the script directory.  The name starts with upper-case letter.
+     *       Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal</li>
+     * </ol>
      * Multiple languages may be specified, separated by plus characters.
-     * e.g. "chi_tra+chi_sim"
+     * e.g. "chi_tra+chi_sim+script/Arabic"
      */
     public void setLanguage(String language) {
-        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2}(\\+?))+")
-                || language.endsWith("+")) {
-            throw new IllegalArgumentException("Invalid language code: "+language);
+        // Get rid of embedded spaces
+        language = language.replaceAll("\\s", "");
+        // Test for leading or trailing +
+        if (language.matches("\\+.*|.*\\+")) {
+            throw new IllegalArgumentException("Invalid syntax - Can't start or end with +" + language);
+        }
+        // Split on the + sign
+        final String[] langs = language.split("\\+");
+        List<String> invalidCodes = new ArrayList<>();
+        for (String lang : langs) {
+            // First, make sure it conforms to the correct syntax
+            if (!lang.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2})|script(/|\\\\)[A-Z][a-zA-Z_]+")) {
+                invalidCodes.add(lang + " (invalid syntax)");
+            } else if (!langExists(lang)) {
+                invalidCodes.add(lang + " (not found)");
+            }
+        }
+        if (!invalidCodes.isEmpty()) {
+            throw new IllegalArgumentException("Invalid language code(s): " + invalidCodes);
         }
         this.language = language;
     }
 
+
+    /**
+     * Check if tessdata language model exists
+     */
+    private boolean langExists(String lang) {

Review comment:
       If the user doesn't specify a lang dir, I don't think we should make assumptions or go looking for it.
   
   On my ubuntu laptop, tessdata is under /usr/share/tesseract-ocr/4.00/tessdata _not_ linux default is.  On my mac, tessdata isn't under /usr/share/tessdata, and as you've pointed out, on my Windows laptop, who knows where tessdata is.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tika] peterkronenberg commented on a change in pull request #402: Tika 3286 - Check if lanaguage files exists and provide better error msg; Support script lanaguge files in script directory

Posted by GitBox <gi...@apache.org>.

peterkronenberg commented on a change in pull request #402:
URL: https://github.com/apache/tika/pull/402#discussion_r570427888



##########
File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
##########
@@ -249,17 +255,71 @@ public String getLanguage() {
 
     /**
      * Set tesseract language dictionary to be used. Default is "eng".
+     * languages are either:
+     * <ol>
+     *   <li>Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl</li>
+     *   <li>A file path in the script directory.  The name starts with upper-case letter.
+     *       Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal</li>
+     * </ol>
      * Multiple languages may be specified, separated by plus characters.
-     * e.g. "chi_tra+chi_sim"
+     * e.g. "chi_tra+chi_sim+script/Arabic"
      */
     public void setLanguage(String language) {
-        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2}(\\+?))+")
-                || language.endsWith("+")) {
-            throw new IllegalArgumentException("Invalid language code: "+language);
+        // Get rid of embedded spaces
+        language = language.replaceAll("\\s", "");
+        // Test for leading or trailing +
+        if (language.matches("\\+.*|.*\\+")) {
+            throw new IllegalArgumentException("Invalid syntax - Can't start or end with +" + language);
+        }
+        // Split on the + sign
+        final String[] langs = language.split("\\+");
+        List<String> invalidCodes = new ArrayList<>();
+        for (String lang : langs) {
+            // First, make sure it conforms to the correct syntax
+            if (!lang.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2})|script(/|\\\\)[A-Z][a-zA-Z_]+")) {
+                invalidCodes.add(lang + " (invalid syntax)");
+            } else if (!langExists(lang)) {
+                invalidCodes.add(lang + " (not found)");
+            }
+        }
+        if (!invalidCodes.isEmpty()) {
+            throw new IllegalArgumentException("Invalid language code(s): " + invalidCodes);
         }
         this.language = language;
     }
 
+
+    /**
+     * Check if tessdata language model exists
+     */
+    private boolean langExists(String lang) {

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tika] tballison commented on a change in pull request #402: Tika 3286 - Check if lanaguage files exists and provide better error msg; Support script lanaguge files in script directory

Posted by GitBox <gi...@apache.org>.

tballison commented on a change in pull request #402:
URL: https://github.com/apache/tika/pull/402#discussion_r570380074



##########
File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
##########
@@ -249,17 +255,71 @@ public String getLanguage() {
 
     /**
      * Set tesseract language dictionary to be used. Default is "eng".
+     * languages are either:
+     * <ol>
+     *   <li>Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl</li>
+     *   <li>A file path in the script directory.  The name starts with upper-case letter.
+     *       Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal</li>
+     * </ol>
      * Multiple languages may be specified, separated by plus characters.
-     * e.g. "chi_tra+chi_sim"
+     * e.g. "chi_tra+chi_sim+script/Arabic"
      */
     public void setLanguage(String language) {
-        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2}(\\+?))+")
-                || language.endsWith("+")) {
-            throw new IllegalArgumentException("Invalid language code: "+language);
+        // Get rid of embedded spaces
+        language = language.replaceAll("\\s", "");
+        // Test for leading or trailing +
+        if (language.matches("\\+.*|.*\\+")) {
+            throw new IllegalArgumentException("Invalid syntax - Can't start or end with +" + language);
+        }
+        // Split on the + sign
+        final String[] langs = language.split("\\+");
+        List<String> invalidCodes = new ArrayList<>();
+        for (String lang : langs) {
+            // First, make sure it conforms to the correct syntax
+            if (!lang.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2})|script(/|\\\\)[A-Z][a-zA-Z_]+")) {
+                invalidCodes.add(lang + " (invalid syntax)");
+            } else if (!langExists(lang)) {
+                invalidCodes.add(lang + " (not found)");
+            }
+        }
+        if (!invalidCodes.isEmpty()) {
+            throw new IllegalArgumentException("Invalid language code(s): " + invalidCodes);
         }
         this.language = language;
     }
 
+
+    /**
+     * Check if tessdata language model exists
+     */
+    private boolean langExists(String lang) {

Review comment:
       Yes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tika] peterkronenberg commented on a change in pull request #402: Tika 3286 - Check if lanaguage files exists and provide better error msg; Support script lanaguge files in script directory

Posted by GitBox <gi...@apache.org>.

peterkronenberg commented on a change in pull request #402:
URL: https://github.com/apache/tika/pull/402#discussion_r570365375



##########
File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
##########
@@ -249,17 +255,71 @@ public String getLanguage() {
 
     /**
      * Set tesseract language dictionary to be used. Default is "eng".
+     * languages are either:
+     * <ol>
+     *   <li>Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl</li>
+     *   <li>A file path in the script directory.  The name starts with upper-case letter.
+     *       Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal</li>
+     * </ol>
      * Multiple languages may be specified, separated by plus characters.
-     * e.g. "chi_tra+chi_sim"
+     * e.g. "chi_tra+chi_sim+script/Arabic"
      */
     public void setLanguage(String language) {
-        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2}(\\+?))+")
-                || language.endsWith("+")) {
-            throw new IllegalArgumentException("Invalid language code: "+language);
+        // Get rid of embedded spaces
+        language = language.replaceAll("\\s", "");
+        // Test for leading or trailing +
+        if (language.matches("\\+.*|.*\\+")) {
+            throw new IllegalArgumentException("Invalid syntax - Can't start or end with +" + language);
+        }
+        // Split on the + sign
+        final String[] langs = language.split("\\+");
+        List<String> invalidCodes = new ArrayList<>();
+        for (String lang : langs) {
+            // First, make sure it conforms to the correct syntax
+            if (!lang.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2})|script(/|\\\\)[A-Z][a-zA-Z_]+")) {
+                invalidCodes.add(lang + " (invalid syntax)");
+            } else if (!langExists(lang)) {
+                invalidCodes.add(lang + " (not found)");
+            }
+        }
+        if (!invalidCodes.isEmpty()) {
+            throw new IllegalArgumentException("Invalid language code(s): " + invalidCodes);
         }
         this.language = language;
     }
 
+
+    /**
+     * Check if tessdata language model exists
+     */
+    private boolean langExists(String lang) {

Review comment:
       So you're saying that if the language directory is not specified by the user, for any OS, then we just pass it to Tesseract and let Tesseract handle it?  (i.e., what I'm currently doing for Windows)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tika] peterkronenberg commented on a change in pull request #402: Tika 3286 - Check if lanaguage files exists and provide better error msg; Support script lanaguge files in script directory

Posted by GitBox <gi...@apache.org>.

peterkronenberg commented on a change in pull request #402:
URL: https://github.com/apache/tika/pull/402#discussion_r570351698



##########
File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
##########
@@ -249,17 +255,71 @@ public String getLanguage() {
 
     /**
      * Set tesseract language dictionary to be used. Default is "eng".
+     * languages are either:
+     * <ol>
+     *   <li>Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl</li>
+     *   <li>A file path in the script directory.  The name starts with upper-case letter.
+     *       Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal</li>
+     * </ol>
      * Multiple languages may be specified, separated by plus characters.
-     * e.g. "chi_tra+chi_sim"
+     * e.g. "chi_tra+chi_sim+script/Arabic"
      */
     public void setLanguage(String language) {
-        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2}(\\+?))+")
-                || language.endsWith("+")) {
-            throw new IllegalArgumentException("Invalid language code: "+language);
+        // Get rid of embedded spaces
+        language = language.replaceAll("\\s", "");
+        // Test for leading or trailing +
+        if (language.matches("\\+.*|.*\\+")) {
+            throw new IllegalArgumentException("Invalid syntax - Can't start or end with +" + language);
+        }
+        // Split on the + sign
+        final String[] langs = language.split("\\+");
+        List<String> invalidCodes = new ArrayList<>();
+        for (String lang : langs) {
+            // First, make sure it conforms to the correct syntax
+            if (!lang.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2})|script(/|\\\\)[A-Z][a-zA-Z_]+")) {
+                invalidCodes.add(lang + " (invalid syntax)");
+            } else if (!langExists(lang)) {
+                invalidCodes.add(lang + " (not found)");
+            }
+        }
+        if (!invalidCodes.isEmpty()) {
+            throw new IllegalArgumentException("Invalid language code(s): " + invalidCodes);
         }
         this.language = language;
     }
 
+
+    /**
+     * Check if tessdata language model exists
+     */
+    private boolean langExists(String lang) {

Review comment:
       And you're not specifying the path?  I guess you're right.  The user can put it anywhere and just make sure the Path points to it.  So perhaps I should search the path for both xNix and Windows?  For Windows, I search for Tesseract and assume tessdata is under it.  For Unix, I just search for tessdata on the path?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tika] tballison commented on a change in pull request #402: Tika 3286 - Check if lanaguage files exists and provide better error msg; Support script lanaguge files in script directory

Posted by GitBox <gi...@apache.org>.

tballison commented on a change in pull request #402:
URL: https://github.com/apache/tika/pull/402#discussion_r570318082



##########
File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
##########
@@ -249,17 +255,71 @@ public String getLanguage() {
 
     /**
      * Set tesseract language dictionary to be used. Default is "eng".
+     * languages are either:
+     * <ol>
+     *   <li>Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl</li>
+     *   <li>A file path in the script directory.  The name starts with upper-case letter.
+     *       Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal</li>
+     * </ol>
      * Multiple languages may be specified, separated by plus characters.
-     * e.g. "chi_tra+chi_sim"
+     * e.g. "chi_tra+chi_sim+script/Arabic"
      */
     public void setLanguage(String language) {
-        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2}(\\+?))+")
-                || language.endsWith("+")) {
-            throw new IllegalArgumentException("Invalid language code: "+language);
+        // Get rid of embedded spaces
+        language = language.replaceAll("\\s", "");
+        // Test for leading or trailing +
+        if (language.matches("\\+.*|.*\\+")) {
+            throw new IllegalArgumentException("Invalid syntax - Can't start or end with +" + language);
+        }
+        // Split on the + sign
+        final String[] langs = language.split("\\+");
+        List<String> invalidCodes = new ArrayList<>();
+        for (String lang : langs) {
+            // First, make sure it conforms to the correct syntax
+            if (!lang.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2})|script(/|\\\\)[A-Z][a-zA-Z_]+")) {
+                invalidCodes.add(lang + " (invalid syntax)");
+            } else if (!langExists(lang)) {
+                invalidCodes.add(lang + " (not found)");
+            }
+        }
+        if (!invalidCodes.isEmpty()) {
+            throw new IllegalArgumentException("Invalid language code(s): " + invalidCodes);
         }
         this.language = language;
     }
 
+
+    /**
+     * Check if tessdata language model exists
+     */
+    private boolean langExists(String lang) {

Review comment:
       If the user doesn't specify a lang dir, I don't think we should make assumptions or go looking for it.
   
   On my ubuntu laptop, tessdata is under /usr/share/tesseract-ocr/4.00/tessdata _not_ linux default is.  On my mac, tessdata isn't under /usr/share/tessdata, and as you've pointed out, on my Windows laptop, who knows where tessdata is.

##########
File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
##########
@@ -249,17 +255,71 @@ public String getLanguage() {
 
     /**
      * Set tesseract language dictionary to be used. Default is "eng".
+     * languages are either:
+     * <ol>
+     *   <li>Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl</li>
+     *   <li>A file path in the script directory.  The name starts with upper-case letter.
+     *       Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal</li>
+     * </ol>
      * Multiple languages may be specified, separated by plus characters.
-     * e.g. "chi_tra+chi_sim"
+     * e.g. "chi_tra+chi_sim+script/Arabic"
      */
     public void setLanguage(String language) {
-        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2}(\\+?))+")
-                || language.endsWith("+")) {
-            throw new IllegalArgumentException("Invalid language code: "+language);
+        // Get rid of embedded spaces
+        language = language.replaceAll("\\s", "");
+        // Test for leading or trailing +
+        if (language.matches("\\+.*|.*\\+")) {
+            throw new IllegalArgumentException("Invalid syntax - Can't start or end with +" + language);
+        }
+        // Split on the + sign
+        final String[] langs = language.split("\\+");
+        List<String> invalidCodes = new ArrayList<>();
+        for (String lang : langs) {
+            // First, make sure it conforms to the correct syntax
+            if (!lang.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2})|script(/|\\\\)[A-Z][a-zA-Z_]+")) {
+                invalidCodes.add(lang + " (invalid syntax)");
+            } else if (!langExists(lang)) {
+                invalidCodes.add(lang + " (not found)");
+            }
+        }
+        if (!invalidCodes.isEmpty()) {
+            throw new IllegalArgumentException("Invalid language code(s): " + invalidCodes);
         }
         this.language = language;
     }
 
+
+    /**
+     * Check if tessdata language model exists
+     */
+    private boolean langExists(String lang) {

Review comment:
       If the user doesn't specify a lang dir, I don't think we should make assumptions or go looking for it.
   
   On my ubuntu laptop, tessdata is under /usr/share/tesseract-ocr/4.00/tessdata _not_ linux default is.  On my mac, tessdata is under /usr/local/Cellar/tesseract/4.1.1/share, and as you've pointed out, on my Windows laptop, who knows where tessdata is.

##########
File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
##########
@@ -249,17 +255,71 @@ public String getLanguage() {
 
     /**
      * Set tesseract language dictionary to be used. Default is "eng".
+     * languages are either:
+     * <ol>
+     *   <li>Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl</li>
+     *   <li>A file path in the script directory.  The name starts with upper-case letter.
+     *       Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal</li>
+     * </ol>
      * Multiple languages may be specified, separated by plus characters.
-     * e.g. "chi_tra+chi_sim"
+     * e.g. "chi_tra+chi_sim+script/Arabic"
      */
     public void setLanguage(String language) {
-        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2}(\\+?))+")
-                || language.endsWith("+")) {
-            throw new IllegalArgumentException("Invalid language code: "+language);
+        // Get rid of embedded spaces
+        language = language.replaceAll("\\s", "");
+        // Test for leading or trailing +
+        if (language.matches("\\+.*|.*\\+")) {
+            throw new IllegalArgumentException("Invalid syntax - Can't start or end with +" + language);
+        }
+        // Split on the + sign
+        final String[] langs = language.split("\\+");
+        List<String> invalidCodes = new ArrayList<>();
+        for (String lang : langs) {
+            // First, make sure it conforms to the correct syntax
+            if (!lang.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2})|script(/|\\\\)[A-Z][a-zA-Z_]+")) {
+                invalidCodes.add(lang + " (invalid syntax)");
+            } else if (!langExists(lang)) {
+                invalidCodes.add(lang + " (not found)");
+            }
+        }
+        if (!invalidCodes.isEmpty()) {
+            throw new IllegalArgumentException("Invalid language code(s): " + invalidCodes);
         }
         this.language = language;
     }
 
+
+    /**
+     * Check if tessdata language model exists
+     */
+    private boolean langExists(String lang) {

Review comment:
       I don't think this feature offers enough to justify the code (admittedly small) and the maintenance.  I'm extremely grateful for the updates to the script checking, and I'm happy that we now throw an exception that includes stderr if tesseract has an exitValue of > 0.  

##########
File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
##########
@@ -249,17 +255,71 @@ public String getLanguage() {
 
     /**
      * Set tesseract language dictionary to be used. Default is "eng".
+     * languages are either:
+     * <ol>
+     *   <li>Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl</li>
+     *   <li>A file path in the script directory.  The name starts with upper-case letter.
+     *       Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal</li>
+     * </ol>
      * Multiple languages may be specified, separated by plus characters.
-     * e.g. "chi_tra+chi_sim"
+     * e.g. "chi_tra+chi_sim+script/Arabic"
      */
     public void setLanguage(String language) {
-        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2}(\\+?))+")
-                || language.endsWith("+")) {
-            throw new IllegalArgumentException("Invalid language code: "+language);
+        // Get rid of embedded spaces
+        language = language.replaceAll("\\s", "");
+        // Test for leading or trailing +
+        if (language.matches("\\+.*|.*\\+")) {
+            throw new IllegalArgumentException("Invalid syntax - Can't start or end with +" + language);
+        }
+        // Split on the + sign
+        final String[] langs = language.split("\\+");
+        List<String> invalidCodes = new ArrayList<>();
+        for (String lang : langs) {
+            // First, make sure it conforms to the correct syntax
+            if (!lang.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2})|script(/|\\\\)[A-Z][a-zA-Z_]+")) {
+                invalidCodes.add(lang + " (invalid syntax)");
+            } else if (!langExists(lang)) {
+                invalidCodes.add(lang + " (not found)");
+            }
+        }
+        if (!invalidCodes.isEmpty()) {
+            throw new IllegalArgumentException("Invalid language code(s): " + invalidCodes);
         }
         this.language = language;
     }
 
+
+    /**
+     * Check if tessdata language model exists
+     */
+    private boolean langExists(String lang) {

Review comment:
       Yes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tika] tballison merged pull request #402: Tika 3286 - Check if lanaguage files exists and provide better error msg; Support script lanaguge files in script directory

Posted by GitBox <gi...@apache.org>.

tballison merged pull request #402:
URL: https://github.com/apache/tika/pull/402


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tika] tballison commented on a change in pull request #402: Tika 3286 - Check if lanaguage files exists and provide better error msg; Support script lanaguge files in script directory

Posted by GitBox <gi...@apache.org>.

tballison commented on a change in pull request #402:
URL: https://github.com/apache/tika/pull/402#discussion_r570354618



##########
File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
##########
@@ -249,17 +255,71 @@ public String getLanguage() {
 
     /**
      * Set tesseract language dictionary to be used. Default is "eng".
+     * languages are either:
+     * <ol>
+     *   <li>Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl</li>
+     *   <li>A file path in the script directory.  The name starts with upper-case letter.
+     *       Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal</li>
+     * </ol>
      * Multiple languages may be specified, separated by plus characters.
-     * e.g. "chi_tra+chi_sim"
+     * e.g. "chi_tra+chi_sim+script/Arabic"
      */
     public void setLanguage(String language) {
-        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2}(\\+?))+")
-                || language.endsWith("+")) {
-            throw new IllegalArgumentException("Invalid language code: "+language);
+        // Get rid of embedded spaces
+        language = language.replaceAll("\\s", "");
+        // Test for leading or trailing +
+        if (language.matches("\\+.*|.*\\+")) {
+            throw new IllegalArgumentException("Invalid syntax - Can't start or end with +" + language);
+        }
+        // Split on the + sign
+        final String[] langs = language.split("\\+");
+        List<String> invalidCodes = new ArrayList<>();
+        for (String lang : langs) {
+            // First, make sure it conforms to the correct syntax
+            if (!lang.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2})|script(/|\\\\)[A-Z][a-zA-Z_]+")) {
+                invalidCodes.add(lang + " (invalid syntax)");
+            } else if (!langExists(lang)) {
+                invalidCodes.add(lang + " (not found)");
+            }
+        }
+        if (!invalidCodes.isEmpty()) {
+            throw new IllegalArgumentException("Invalid language code(s): " + invalidCodes);
         }
         this.language = language;
     }
 
+
+    /**
+     * Check if tessdata language model exists
+     */
+    private boolean langExists(String lang) {

Review comment:
       I don't think this feature offers enough to justify the code (admittedly small) and the maintenance.  I'm extremely grateful for the updates to the script checking, and I'm happy that we now throw an exception that includes stderr if tesseract has an exitValue of > 0.  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tika] tballison commented on pull request #402: Tika 3286 - Check if lanaguage files exists and provide better error msg; Support script lanaguge files in script directory

Posted by GitBox <gi...@apache.org>.

tballison commented on pull request #402:
URL: https://github.com/apache/tika/pull/402#issuecomment-773498945


   Thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tika] tballison merged pull request #402: Tika 3286 - Check if lanaguage files exists and provide better error msg; Support script lanaguge files in script directory

Posted by GitBox <gi...@apache.org>.

tballison merged pull request #402:
URL: https://github.com/apache/tika/pull/402


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tika] tballison commented on a change in pull request #402: Tika 3286 - Check if lanaguage files exists and provide better error msg; Support script lanaguge files in script directory

Posted by GitBox <gi...@apache.org>.

tballison commented on a change in pull request #402:
URL: https://github.com/apache/tika/pull/402#discussion_r570318082



##########
File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
##########
@@ -249,17 +255,71 @@ public String getLanguage() {
 
     /**
      * Set tesseract language dictionary to be used. Default is "eng".
+     * languages are either:
+     * <ol>
+     *   <li>Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl</li>
+     *   <li>A file path in the script directory.  The name starts with upper-case letter.
+     *       Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal</li>
+     * </ol>
      * Multiple languages may be specified, separated by plus characters.
-     * e.g. "chi_tra+chi_sim"
+     * e.g. "chi_tra+chi_sim+script/Arabic"
      */
     public void setLanguage(String language) {
-        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2}(\\+?))+")
-                || language.endsWith("+")) {
-            throw new IllegalArgumentException("Invalid language code: "+language);
+        // Get rid of embedded spaces
+        language = language.replaceAll("\\s", "");
+        // Test for leading or trailing +
+        if (language.matches("\\+.*|.*\\+")) {
+            throw new IllegalArgumentException("Invalid syntax - Can't start or end with +" + language);
+        }
+        // Split on the + sign
+        final String[] langs = language.split("\\+");
+        List<String> invalidCodes = new ArrayList<>();
+        for (String lang : langs) {
+            // First, make sure it conforms to the correct syntax
+            if (!lang.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4}){0,2})|script(/|\\\\)[A-Z][a-zA-Z_]+")) {
+                invalidCodes.add(lang + " (invalid syntax)");
+            } else if (!langExists(lang)) {
+                invalidCodes.add(lang + " (not found)");
+            }
+        }
+        if (!invalidCodes.isEmpty()) {
+            throw new IllegalArgumentException("Invalid language code(s): " + invalidCodes);
         }
         this.language = language;
     }
 
+
+    /**
+     * Check if tessdata language model exists
+     */
+    private boolean langExists(String lang) {

Review comment:
       If the user doesn't specify a lang dir, I don't think we should make assumptions or go looking for it.
   
   On my ubuntu laptop, tessdata is under /usr/share/tesseract-ocr/4.00/tessdata _not_ linux default is.  On my mac, tessdata is under /usr/local/Cellar/tesseract/4.1.1/share, and as you've pointed out, on my Windows laptop, who knows where tessdata is.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tika] tballison commented on pull request #402: Tika 3286 - Check if lanaguage files exists and provide better error msg; Support script lanaguge files in script directory

Posted by GitBox <gi...@apache.org>.

tballison commented on pull request #402:
URL: https://github.com/apache/tika/pull/402#issuecomment-773498945


   Thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org