You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> on 2014/11/16 19:14:19 UTC
Re: svn commit: r1640017 -
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
Thanks, Dave. I think you forgot the default config file?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: "dmeikle@apache.org" <dm...@apache.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Sunday, November 16, 2014 at 6:37 PM
To: "commits@tika.apache.org" <co...@tika.apache.org>
Subject: svn commit: r1640017 -
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
OCRConfig.java
>Author: dmeikle
>Date: Sun Nov 16 17:37:30 2014
>New Revision: 1640017
>
>URL: http://svn.apache.org/r1640017
>Log:
>TIKA-1476 - Updated TesseractOCRConfig to read from property file if
>present on classpath
>
>Modified:
>
>tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
>OCRConfig.java
>
>Modified:
>tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
>OCRConfig.java
>URL:
>http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apa
>che/tika/parser/ocr/TesseractOCRConfig.java?rev=1640017&r1=1640016&r2=1640
>017&view=diff
>==========================================================================
>====
>---
>tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
>OCRConfig.java (original)
>+++
>tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
>OCRConfig.java Sun Nov 16 17:37:30 2014
>@@ -17,7 +17,10 @@
> package org.apache.tika.parser.ocr;
>
> import java.io.File;
>+import java.io.IOException;
>+import java.io.InputStream;
> import java.io.Serializable;
>+import java.util.Properties;
>
> /**
> * Configuration for TesseractOCRParser.
>@@ -28,7 +31,11 @@ import java.io.Serializable;
> * config.setTesseractPath(tesseractFolder);<br>
> * parseContext.set(TesseractOCRConfig.class, config);<br>
> * </p>
>- *
>+ *
>+ * Parameters can also be set by creating the
>TesseractOCRConfig.properties file
>+ * and placing it in the package org/apache/tika/parser/ocr on the
>classpath. An
>+ * example file can be found in the test resources folder:
>+ *
><code>tika-parsers/src/test/resources/test-properties/TesseractOCRConfig-f
>ull.properties</code>.
> *
> */
> public class TesseractOCRConfig implements Serializable{
>@@ -52,7 +59,58 @@ public class TesseractOCRConfig implemen
>
> // Maximum time (seconds) to wait for the ocring process termination
> private int timeout = 120;
>-
>+
>+ /**
>+ * Default contructor.
>+ */
>+ public TesseractOCRConfig() {
>+ init(this.getClass().getResourceAsStream("TesseractOCRConfig.properties
>"));
>+ }
>+
>+ /**
>+ * Loads properties from InputStream and then tries to close
>InputStream.
>+ * If there is an IOException, this silently swallows the exception
>+ * and goes back to the default.
>+ *
>+ * @param is
>+ */
>+ public TesseractOCRConfig(InputStream is) {
>+ init(is);
>+ }
>+
>+ private void init(InputStream is) {
>+ if (is == null) {
>+ return;
>+ }
>+ Properties props = new Properties();
>+ try {
>+ props.load(is);
>+ } catch (IOException e) {
>+ } finally {
>+ if (is != null) {
>+ try {
>+ is.close();
>+ } catch (IOException e) {
>+ //swallow
>+ }
>+ }
>+ }
>+
>+ setTesseractPath(
>+ getProp(props, "tesseractPath", getTesseractPath()));
>+ setLanguage(
>+ getProp(props, "language", getLanguage()));
>+ setPageSegMode(
>+ getProp(props, "pageSegMode", getPageSegMode()));
>+ setMinFileSizeToOcr(
>+ getProp(props, "minFileSizeToOcr", getMinFileSizeToOcr()));
>+ setMaxFileSizeToOcr(
>+ getProp(props, "maxFileSizeToOcr", getMaxFileSizeToOcr()));
>+ setTimeout(
>+ getProp(props, "timeout", getTimeout()));
>+
>+ }
>+
> /** @see #setTesseractPath(String tesseractPath)*/
> public String getTesseractPath() {
> return tesseractPath;
>@@ -62,7 +120,7 @@ public class TesseractOCRConfig implemen
> * Set tesseract installation folder, needed if it is not on system
>path.
> */
> public void setTesseractPath(String tesseractPath) {
>- if(!tesseractPath.endsWith(File.separator))
>+ if(!tesseractPath.isEmpty() && !tesseractPath.endsWith(File.separator))
> tesseractPath += File.separator;
>
> this.tesseractPath = tesseractPath;
>@@ -132,5 +190,34 @@ public class TesseractOCRConfig implemen
> public int getTimeout() {
> return timeout;
> }
>-
>+
>+ /**
>+ * Get property from the properties file passed in.
>+ * @param properties properties file to read from.
>+ * @param property the property to fetch.
>+ * @param defaultMissing default parameter to use.
>+ * @return the value.
>+ */
>+ private int getProp(Properties properties, String property, int
>defaultMissing) {
>+ String p = properties.getProperty(property);
>+ if (p == null || p.isEmpty()){
>+ return defaultMissing;
>+ }
>+ try {
>+ return Integer.parseInt(p);
>+ } catch (Throwable ex) {
>+ throw new RuntimeException(String.format("Cannot parse
>TesseractOCRConfig variable $s, invalid integer value", property), ex);
>+ }
>+ }
>+
>+ /**
>+ * Get property from the properties file passed in.
>+ * @param properties properties file to read from.
>+ * @param property the property to fetch.
>+ * @param defaultMissing default parameter to use.
>+ * @return the value.
>+ */
>+ private String getProp(Properties properties, String property, String
>defaultMissing) {
>+ return properties.getProperty(property, defaultMissing);
>+ }
> }
>
>
Re: svn commit: r1640017 - /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
Posted by David Meikle <lo...@gmail.com>.
> On 17 Nov 2014, at 16:32, Hong-Thai Nguyen <th...@gmail.com> wrote:
>
> I've pushed a minor fix to pass this test on Windows.
Thanks Hong-Thai, sorry about that!
Cheers,
Dave
Re: svn commit: r1640017 - /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
Posted by Hong-Thai Nguyen <th...@gmail.com>.
Hi,
I've pushed a minor fix to pass this test on Windows.
Thanks,
On Mon, Nov 17, 2014 at 4:28 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:
> +1, agreed, Dave would be nice to have one as a default.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: David Meikle <lo...@gmail.com>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Monday, November 17, 2014 at 8:54 AM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: Re: svn commit: r1640017 -
> /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
> OCRConfig.java
>
> >Hi Chris,
> >
> >> On 16 Nov 2014, at 19:14, Mattmann, Chris A (3980)
> >><ch...@jpl.nasa.gov> wrote:
> >>
> >> Thanks, Dave. I think you forgot the default config file?
> >
> >Yup, forgot the tests and example config from my change! Just committed
> >them.
> >
> >I wasn't initial planning on including a default config, thinking if you
> >dropped a properties file on the class path it would use that, otherwise
> >it would go for the defaults but should probably add one to be consistent
> >with the PDFParserConfig.
> >
> >Cheers,
> >Dave
>
>
--
--------------
Hong-Thai
Re: svn commit: r1640017 -
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
+1, agreed, Dave would be nice to have one as a default.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: David Meikle <lo...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, November 17, 2014 at 8:54 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: svn commit: r1640017 -
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
OCRConfig.java
>Hi Chris,
>
>> On 16 Nov 2014, at 19:14, Mattmann, Chris A (3980)
>><ch...@jpl.nasa.gov> wrote:
>>
>> Thanks, Dave. I think you forgot the default config file?
>
>Yup, forgot the tests and example config from my change! Just committed
>them.
>
>I wasn't initial planning on including a default config, thinking if you
>dropped a properties file on the class path it would use that, otherwise
>it would go for the defaults but should probably add one to be consistent
>with the PDFParserConfig.
>
>Cheers,
>Dave
Re: svn commit: r1640017 - /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
Posted by David Meikle <lo...@gmail.com>.
Hi Chris,
> On 16 Nov 2014, at 19:14, Mattmann, Chris A (3980) <ch...@jpl.nasa.gov> wrote:
>
> Thanks, Dave. I think you forgot the default config file?
Yup, forgot the tests and example config from my change! Just committed them.
I wasn't initial planning on including a default config, thinking if you dropped a properties file on the class path it would use that, otherwise it would go for the defaults but should probably add one to be consistent with the PDFParserConfig.
Cheers,
Dave