You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/01/22 18:36:35 UTC
[jira] [Comment Edited] (TIKA-1526) ExternalParser should
trap/ignore/workarround JDK-8047340 & JDK-8055301 so Turkish Tika users can
still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287818#comment-14287818 ]
Tim Allison edited comment on TIKA-1526 at 1/22/15 5:35 PM:
------------------------------------------------------------
Not having luck reproducing with openjdk-1.7.0.71 or 1.8.0_31 on RHEL.
To confirm, this is a BSD/Mac issue only as suggested [here|https://issues.apache.org/jira/browse/SOLR-6387?focusedCommentId=14100067&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14100067]?
If so, [~tpalsulich], would you have time to try something like this:
{noformat}
public void testPDFOCR() throws Exception {
Locale defaultLocale = Locale.getDefault();
try {
Locale.setDefault(new Locale("tr", ""));
String resource = "/test-documents/testOCR.pdf";
String[] nonOCRContains = new String[0];
testBasicOCR(resource, nonOCRContains, 2);
} finally {
Locale.setDefault(defaultLocale);
}
}
{noformat}
was (Author: tallison@mitre.org):
Not having luck reproducing with openjdk-1.7.0.71 or 1.8.0_31 on RHEL.
To confirm, this is a BSD/Mac issue only as suggested [here|https://issues.apache.org/jira/browse/SOLR-6387?focusedCommentId=14100067&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14100067]?
If so, [~tpalsulich], would you have time to try something like this:
public void testPDFOCR() throws Exception {
Locale defaultLocale = Locale.getDefault();
try {
Locale.setDefault(new Locale("tr", ""));
String resource = "/test-documents/testOCR.pdf";
String[] nonOCRContains = new String[0];
testBasicOCR(resource, nonOCRContains, 2);
} finally {
Locale.setDefault(defaultLocale);
}
}
> ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so Turkish Tika users can still use non-external parsers
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-1526
> URL: https://issues.apache.org/jira/browse/TIKA-1526
> Project: Tika
> Issue Type: Wish
> Reporter: Hoss Man
>
> the JDK has numerous pain points regarding the Turkish locale, "posix_spawn" lowercasing being one of them...
> https://bugs.openjdk.java.net/browse/JDK-8047340
> https://bugs.openjdk.java.net/browse/JDK-8055301
> As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled & configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so...
> {noformat}
> [junit4] > Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform.
> [junit4] > at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
> [junit4] > at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
> [junit4] > at java.security.AccessController.doPrivileged(Native Method)
> [junit4] > at java.lang.UNIXProcess.<clinit>(UNIXProcess.java:92)
> [junit4] > at java.lang.ProcessImpl.start(ProcessImpl.java:130)
> [junit4] > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> [junit4] > at java.lang.Runtime.exec(Runtime.java:620)
> [junit4] > at java.lang.Runtime.exec(Runtime.java:485)
> [junit4] > at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
> [junit4] > at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
> [junit4] > at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
> [junit4] > at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
> [junit4] > at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
> [junit4] > at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
> [junit4] > at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
> [junit4] > at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
> [junit4] > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> [junit4] > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> {noformat}
> ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed.
> It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge...
> {code}
> } catch (Error err) {
> if (err.getMessage() != null && (err.getMessage().contains("posix_spawn") || err.getMessage().contains("UNIXProcess"))) {
> log.warn("Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): " + err.getMessage());
> return "(error executing: " + cmd + ")";
> }
> }
> {code}
> ...but with Tika, it might be better for all ExternalParsers to just "opt out" as if they don't recognize the filetype when they detect this type of error fro m the check method (or perhaps it would be better if AutoDetectParser handled this? ... i'm not really sure how it would best fit into Tika's architecture)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)