You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by David Pilato <da...@pilato.fr> on 2022/09/01 08:38:24 UTC

.TesseractOCRParser does not extract text although Tesseract does

Hey team


I'm wondering what's wrong with my config.
I'm running this very basic piece of code:
@Test
public void testTika() throws TikaException, IOException, SAXException {
   BodyContentHandler handler = new BodyContentHandler(new WriteOutContentHandler(1000));
   new AutoDetectParser().parse(getBinaryContent("test-ocr.png"), handler, new Metadata(), new ParseContext());
   System.out.println("handler = " + handler);
}

Here are my logs:

16:31:13,089 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [tesseract]): true
16:31:13,560 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [tesseract]): true
16:31:13,564 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not appear to be installed (commandline: convert)
16:31:13,591 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [tesseract]): true
16:31:13,595 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not appear to be installed (commandline: convert)
handler =


The content is not extracted although Tesseract is detected.

When I run Tesseract manually:

tesseract test-ocr.png tess.out
cat tess.out.txt

I'm getting:

This file contains some words.

tesseract --version gives

tesseract 5.2.0
 leptonica-1.82.0
 libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37 : libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : libopenjp2 2.5.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.2
 Found libcurl/7.79.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11 nghttp2/1.45.1


What I'm missing here?


David

Re: .TesseractOCRParser does not extract text although Tesseract does

Posted by David Pilato <da...@pilato.fr>.
Great to know.

Thanks Tim!

David
Le 1 sept. 2022 à 14:58 +0200, Tim Allison <ta...@apache.org>, a écrit :
> Ugh.  I think you just ran into: https://issues.apache.org/jira/browse/TIKA-3812
>
> This will be fixed in the next release, hopefully out next week.
>
> The problem is that gdal is taking precedence over the ImageParser, and the gdal parser doesn't know about OCR.
>
> > On Thu, Sep 1, 2022 at 7:43 AM David Pilato <da...@pilato.fr> wrote:
> > > Here is the content of the metadata object:
> > >
> > > X-TIKA:Parsed-By=org.apache.tika.parser.DefaultParser
> > > X-TIKA:Parsed-By=org.apache.tika.parser.gdal.GDALParser
> > > X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.DefaultParser
> > > X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.gdal.GDALParser Content-Type=image/png
> > >
> > > And here is the dependency tree:
> > >
> > > [INFO] fr.pilato.elasticsearch.crawler:fscrawler-tika:jar:2.10-SNAPSHOT
> > > [INFO] +- fr.pilato.elasticsearch.crawler:fscrawler-framework:jar:2.10-SNAPSHOT:compile
> > > [INFO] | +- commons-io:commons-io:jar:2.11.0:compile
> > > [INFO] | +- com.fasterxml.jackson.core:jackson-core:jar:2.13.3:compile
> > > [INFO] | +- com.fasterxml.jackson.core:jackson-databind:jar:2.13.3:compile
> > > [INFO] | +- com.fasterxml.jackson.datatype:jackson-datatype-jsr310:jar:2.13.3:compile
> > > [INFO] | +- com.fasterxml.jackson.dataformat:jackson-dataformat-xml:jar:2.13.3:compile
> > > [INFO] | | +- org.codehaus.woodstox:stax2-api:jar:4.2.1:compile
> > > [INFO] | | \- com.fasterxml.woodstox:woodstox-core:jar:6.3.1:compile
> > > [INFO] | +- com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:jar:2.13.3:compile
> > > [INFO] | | \- org.yaml:snakeyaml:jar:1.30:compile
> > > [INFO] | +- com.fasterxml.jackson.core:jackson-annotations:jar:2.13.3:compile
> > > [INFO] | +- com.jayway.jsonpath:json-path:jar:2.7.0:compile
> > > [INFO] | | \- net.minidev:json-smart:jar:2.4.7:compile
> > > [INFO] | | \- net.minidev:accessors-smart:jar:2.4.7:compile
> > > [INFO] | +- org.apache.logging.log4j:log4j-core:jar:2.18.0:compile
> > > [INFO] | | \- org.apache.logging.log4j:log4j-api:jar:2.18.0:compile
> > > [INFO] | +- org.apache.logging.log4j:log4j-1.2-api:jar:2.18.0:compile
> > > [INFO] | +- org.apache.logging.log4j:log4j-slf4j-impl:jar:2.18.0:compile
> > > [INFO] | +- org.apache.logging.log4j:log4j-jcl:jar:2.18.0:compile
> > > [INFO] | | \- commons-logging:commons-logging:jar:1.2:compile
> > > [INFO] | +- org.apache.logging.log4j:log4j-jul:jar:2.18.0:compile
> > > [INFO] | \- org.fusesource.jansi:jansi:jar:2.4.0:compile
> > > [INFO] +- fr.pilato.elasticsearch.crawler:fscrawler-beans:jar:2.10-SNAPSHOT:compile
> > > [INFO] +- fr.pilato.elasticsearch.crawler:fscrawler-settings:jar:2.10-SNAPSHOT:compile
> > > [INFO] +- org.apache.tika:tika-core:jar:2.4.1:compile
> > > [INFO] | \- org.slf4j:slf4j-api:jar:1.7.36:compile
> > > [INFO] +- org.apache.tika:tika-parsers-standard-package:jar:2.4.1:compile
> > > [INFO] | +- org.apache.tika:tika-parser-apple-module:jar:2.4.1:compile
> > > [INFO] | | +- org.apache.tika:tika-parser-zip-commons:jar:2.4.1:compile
> > > [INFO] | | \- com.googlecode.plist:dd-plist:jar:1.23:compile
> > > [INFO] | +- org.apache.tika:tika-parser-audiovideo-module:jar:2.4.1:compile
> > > [INFO] | | \- com.drewnoakes:metadata-extractor:jar:2.18.0:compile
> > > [INFO] | | \- com.adobe.xmp:xmpcore:jar:6.1.11:compile
> > > [INFO] | +- org.apache.tika:tika-parser-cad-module:jar:2.4.1:compile
> > > [INFO] | +- org.apache.tika:tika-parser-code-module:jar:2.4.1:compile
> > > [INFO] | | +- org.codelibs:jhighlight:jar:1.1.0:compile
> > > [INFO] | | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
> > > [INFO] | | +- org.ow2.asm:asm:jar:9.3:compile
> > > [INFO] | | +- com.epam:parso:jar:2.0.14:compile
> > > [INFO] | | \- org.tallison:jmatio:jar:1.5:compile
> > > [INFO] | +- org.apache.tika:tika-parser-crypto-module:jar:2.4.1:compile
> > > [INFO] | | +- org.bouncycastle:bcmail-jdk15on:jar:1.70:compile
> > > [INFO] | | | +- org.bouncycastle:bcutil-jdk15on:jar:1.70:compile
> > > [INFO] | | | \- org.bouncycastle:bcpkix-jdk15on:jar:1.70:compile
> > > [INFO] | | \- org.bouncycastle:bcprov-jdk15on:jar:1.70:compile
> > > [INFO] | +- org.apache.tika:tika-parser-digest-commons:jar:2.4.1:compile
> > > [INFO] | | \- commons-codec:commons-codec:jar:1.15:compile
> > > [INFO] | +- org.apache.tika:tika-parser-font-module:jar:2.4.1:compile
> > > [INFO] | | \- org.apache.pdfbox:fontbox:jar:2.0.26:compile
> > > [INFO] | +- org.apache.tika:tika-parser-html-module:jar:2.4.1:compile
> > > [INFO] | | \- org.apache.tika:tika-parser-html-commons:jar:2.4.1:compile
> > > [INFO] | | \- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
> > > [INFO] | +- org.apache.tika:tika-parser-image-module:jar:2.4.1:compile
> > > [INFO] | | +- com.github.jai-imageio:jai-imageio-core:jar:1.4.0:compile
> > > [INFO] | | \- org.apache.pdfbox:jbig2-imageio:jar:3.0.4:compile
> > > [INFO] | +- org.apache.tika:tika-parser-mail-module:jar:2.4.1:compile
> > > [INFO] | | \- org.apache.tika:tika-parser-mail-commons:jar:2.4.1:compile
> > > [INFO] | | +- org.apache.james:apache-mime4j-core:jar:0.8.4:compile
> > > [INFO] | | \- org.apache.james:apache-mime4j-dom:jar:0.8.4:compile
> > > [INFO] | +- org.apache.tika:tika-parser-microsoft-module:jar:2.4.1:compile
> > > [INFO] | | +- com.pff:java-libpst:jar:0.9.3:compile
> > > [INFO] | | +- org.apache.commons:commons-lang3:jar:3.12.0:compile
> > > [INFO] | | +- org.apache.poi:poi:jar:5.2.2:compile
> > > [INFO] | | | +- org.apache.commons:commons-math3:jar:3.6.1:compile
> > > [INFO] | | | \- com.zaxxer:SparseBitSet:jar:1.2:compile
> > > [INFO] | | +- org.apache.poi:poi-scratchpad:jar:5.2.2:compile
> > > [INFO] | | +- org.apache.poi:poi-ooxml:jar:5.2.2:compile
> > > [INFO] | | | +- org.apache.poi:poi-ooxml-lite:jar:5.2.2:compile
> > > [INFO] | | | +- org.apache.xmlbeans:xmlbeans:jar:5.0.3:compile
> > > [INFO] | | | \- com.github.virtuald:curvesapi:jar:1.07:compile
> > > [INFO] | | +- com.healthmarketscience.jackcess:jackcess:jar:4.0.1:compile
> > > [INFO] | | \- com.healthmarketscience.jackcess:jackcess-encrypt:jar:4.0.1:compile
> > > [INFO] | +- org.slf4j:jcl-over-slf4j:jar:1.7.36:compile
> > > [INFO] | +- org.apache.tika:tika-parser-miscoffice-module:jar:2.4.1:compile
> > > [INFO] | | \- org.apache.commons:commons-collections4:jar:4.4:compile
> > > [INFO] | +- org.apache.tika:tika-parser-news-module:jar:2.4.1:compile
> > > [INFO] | | \- com.rometools:rome:jar:1.18.0:compile
> > > [INFO] | | \- com.rometools:rome-utils:jar:1.18.0:compile
> > > [INFO] | +- org.apache.tika:tika-parser-ocr-module:jar:2.4.1:compile
> > > [INFO] | | \- org.apache.commons:commons-exec:jar:1.3:compile
> > > [INFO] | +- org.apache.tika:tika-parser-pdf-module:jar:2.4.1:compile
> > > [INFO] | | +- org.apache.pdfbox:pdfbox:jar:2.0.26:compile
> > > [INFO] | | +- org.apache.pdfbox:pdfbox-tools:jar:2.0.26:compile
> > > [INFO] | | | \- org.apache.pdfbox:pdfbox-debugger:jar:2.0.26:compile
> > > [INFO] | | \- org.apache.pdfbox:jempbox:jar:1.8.16:compile
> > > [INFO] | +- org.apache.tika:tika-parser-pkg-module:jar:2.4.1:compile
> > > [INFO] | | +- org.tukaani:xz:jar:1.9:compile
> > > [INFO] | | +- org.brotli:dec:jar:0.1.2:compile
> > > [INFO] | | \- com.github.junrar:junrar:jar:7.5.2:compile
> > > [INFO] | +- org.apache.tika:tika-parser-text-module:jar:2.4.1:compile
> > > [INFO] | | \- com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile
> > > [INFO] | +- org.apache.tika:tika-parser-webarchive-module:jar:2.4.1:compile
> > > [INFO] | | +- org.netpreserve:jwarc:jar:0.18.1:compile
> > > [INFO] | | \- org.apache.commons:commons-compress:jar:1.21:compile
> > > [INFO] | +- org.apache.tika:tika-parser-xml-module:jar:2.4.1:compile
> > > [INFO] | | \- xerces:xercesImpl:jar:2.12.2:compile
> > > [INFO] | | \- xml-apis:xml-apis:jar:1.4.01:compile
> > > [INFO] | +- org.apache.tika:tika-parser-xmp-commons:jar:2.4.1:compile
> > > [INFO] | | \- org.apache.pdfbox:xmpbox:jar:2.0.26:compile
> > > [INFO] | +- org.gagravarr:vorbis-java-tika:jar:0.8:compile
> > > [INFO] | \- org.gagravarr:vorbis-java-core:jar:0.8:compile
> > > [INFO] +- org.apache.tika:tika-parser-scientific-module:jar:2.4.1:compile
> > > [INFO] | +- org.apache.sis.core:sis-utility:jar:1.2:compile
> > > [INFO] | | \- javax.measure:unit-api:jar:1.0:compile
> > > [INFO] | +- org.apache.sis.storage:sis-netcdf:jar:1.2:compile
> > > [INFO] | | +- org.apache.sis.storage:sis-storage:jar:1.2:compile
> > > [INFO] | | | \- org.apache.sis.core:sis-feature:jar:1.2:compile
> > > [INFO] | | \- org.apache.sis.core:sis-referencing:jar:1.2:compile
> > > [INFO] | +- org.apache.sis.core:sis-metadata:jar:1.2:compile
> > > [INFO] | | \- jakarta.xml.bind:jakarta.xml.bind-api:jar:3.0.1:compile
> > > [INFO] | +- org.opengis:geoapi:jar:3.0.1:compile
> > > [INFO] | +- edu.ucar:netcdf4:jar:4.5.5:compile
> > > [INFO] | | +- edu.ucar:cdm:jar:4.5.5:compile
> > > [INFO] | | | +- edu.ucar:udunits:jar:4.5.5:compile
> > > [INFO] | | | +- edu.ucar:httpservices:jar:4.5.5:compile
> > > [INFO] | | | | +- org.apache.httpcomponents:httpclient:jar:4.5.13:compile
> > > [INFO] | | | | \- org.apache.httpcomponents:httpmime:jar:4.5.13:compile
> > > [INFO] | | | +- org.apache.httpcomponents:httpcore:jar:4.4.15:compile
> > > [INFO] | | | +- joda-time:joda-time:jar:2.11.1:compile
> > > [INFO] | | | +- org.quartz-scheduler:quartz:jar:2.3.2:compile
> > > [INFO] | | | | +- com.mchange:c3p0:jar:0.9.5.4:compile
> > > [INFO] | | | | +- com.mchange:mchange-commons-java:jar:0.2.15:compile
> > > [INFO] | | | | \- com.zaxxer:HikariCP-java7:jar:2.4.13:compile
> > > [INFO] | | | \- com.beust:jcommander:jar:1.82:compile
> > > [INFO] | | \- net.java.dev.jna:jna:jar:5.12.1:compile
> > > [INFO] | +- edu.ucar:grib:jar:4.5.5:compile
> > > [INFO] | | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
> > > [INFO] | | +- org.jdom:jdom2:jar:2.0.6.1:compile
> > > [INFO] | | +- edu.ucar:jj2000:jar:5.2:compile
> > > [INFO] | | \- org.itadaki:bzip2:jar:0.9.1:compile
> > > [INFO] | +- net.jcip:jcip-annotations:jar:1.0:compile
> > > [INFO] | +- org.apache.commons:commons-csv:jar:1.9.0:compile
> > > [INFO] | \- org.glassfish.jaxb:jaxb-runtime:jar:2.3.6:compile
> > > [INFO] | +- org.glassfish.jaxb:txw2:jar:2.3.6:compile
> > > [INFO] | +- com.sun.istack:istack-commons-runtime:jar:3.0.12:compile
> > > [INFO] | \- com.sun.activation:jakarta.activation:jar:2.0.1:compile
> > > [INFO] +- org.apache.tika:tika-parser-sqlite3-module:jar:2.4.1:compile
> > > [INFO] | +- org.apache.tika:tika-parser-jdbc-commons:jar:2.4.1:compile
> > > [INFO] | \- org.xerial:sqlite-jdbc:jar:3.36.0.3:compile
> > > [INFO] +- org.apache.tika:tika-langdetect-optimaize:jar:2.4.1:compile
> > > [INFO] | \- com.optimaize.languagedetector:language-detector:jar:0.6:compile
> > > [INFO] | +- net.arnx:jsonic:jar:1.2.11:compile
> > > [INFO] | +- com.intellij:annotations:jar:12.0:compile
> > > [INFO] | \- com.google.guava:guava:jar:31.1-jre:compile
> > > [INFO] | +- com.google.guava:failureaccess:jar:1.0.1:compile
> > > [INFO] | +- com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava:compile
> > > [INFO] | +- com.google.code.findbugs:jsr305:jar:3.0.2:compile
> > > [INFO] | +- org.checkerframework:checker-qual:jar:3.12.0:compile
> > > [INFO] | +- com.google.errorprone:error_prone_annotations:jar:2.11.0:compile
> > > [INFO] | \- com.google.j2objc:j2objc-annotations:jar:1.3:compile
> > > [INFO] +- com.jcraft:jsch:jar:0.1.55:compile
> > > [INFO] +- fr.pilato.elasticsearch.crawler:fscrawler-test-framework:jar:2.10-SNAPSHOT:test
> > > [INFO] | +- org.hamcrest:hamcrest-all:jar:1.3:test
> > > [INFO] | +- junit:junit:jar:4.13.2:test
> > > [INFO] | | \- org.hamcrest:hamcrest-core:jar:1.3:test
> > > [INFO] | \- com.carrotsearch.randomizedtesting:randomizedtesting-runner:jar:2.8.1:test
> > > [INFO] \- fr.pilato.elasticsearch.crawler:fscrawler-test-documents:jar:2.10-SNAPSHOT:test
> > >
> > > David
> > > Le 1 sept. 2022 à 11:40 +0200, Tim Allison <ta...@apache.org>, a écrit :
> > > > And, what is recorded in the X-Tika-ParsedBy value in the metadata object?
> > > >
> > > > > On Thu, Sep 1, 2022 at 5:36 AM Tim Allison <ta...@apache.org> wrote:
> > > > > > What are your dependencies? Which parsers are in AutoDetectParser?
> > > > > >
> > > > > > > On Thu, Sep 1, 2022 at 4:38 AM David Pilato <da...@pilato.fr> wrote:
> > > > > > > > Hey team
> > > > > > > >
> > > > > > > >
> > > > > > > > I'm wondering what's wrong with my config.
> > > > > > > > I'm running this very basic piece of code:
> > > > > > > > @Test
> > > > > > > > public void testTika() throws TikaException, IOException, SAXException {
> > > > > > > >    BodyContentHandler handler = new BodyContentHandler(new WriteOutContentHandler(1000));
> > > > > > > >    new AutoDetectParser().parse(getBinaryContent("test-ocr.png"), handler, new Metadata(), new ParseContext());
> > > > > > > >    System.out.println("handler = " + handler);
> > > > > > > > }
> > > > > > > >
> > > > > > > > Here are my logs:
> > > > > > > >
> > > > > > > > 16:31:13,089 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [tesseract]): true
> > > > > > > > 16:31:13,560 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [tesseract]): true
> > > > > > > > 16:31:13,564 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not appear to be installed (commandline: convert)
> > > > > > > > 16:31:13,591 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [tesseract]): true
> > > > > > > > 16:31:13,595 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not appear to be installed (commandline: convert)
> > > > > > > > handler =
> > > > > > > >
> > > > > > > >
> > > > > > > > The content is not extracted although Tesseract is detected.
> > > > > > > >
> > > > > > > > When I run Tesseract manually:
> > > > > > > >
> > > > > > > > tesseract test-ocr.png tess.out
> > > > > > > > cat tess.out.txt
> > > > > > > >
> > > > > > > > I'm getting:
> > > > > > > >
> > > > > > > > This file contains some words.
> > > > > > > >
> > > > > > > > tesseract --version gives
> > > > > > > >
> > > > > > > > tesseract 5.2.0
> > > > > > > >  leptonica-1.82.0
> > > > > > > >  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37 : libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : libopenjp2 2.5.0
> > > > > > > >  Found AVX2
> > > > > > > >  Found AVX
> > > > > > > >  Found FMA
> > > > > > > >  Found SSE4.1
> > > > > > > >  Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.2
> > > > > > > >  Found libcurl/7.79.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11 nghttp2/1.45.1
> > > > > > > >
> > > > > > > >
> > > > > > > > What I'm missing here?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > David

Re: .TesseractOCRParser does not extract text although Tesseract does

Posted by Tim Allison <ta...@apache.org>.
Ugh.  I think you just ran into:
https://issues.apache.org/jira/browse/TIKA-3812

This will be fixed in the next release, hopefully out next week.

The problem is that gdal is taking precedence over the ImageParser, and the
gdal parser doesn't know about OCR.

On Thu, Sep 1, 2022 at 7:43 AM David Pilato <da...@pilato.fr> wrote:

> Here is the content of the metadata object:
>
> X-TIKA:Parsed-By=org.apache.tika.parser.DefaultParser
> X-TIKA:Parsed-By=org.apache.tika.parser.gdal.GDALParser
> X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.DefaultParser
> X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.gdal.GDALParser
> Content-Type=image/png
>
> And here is the dependency tree:
>
> [INFO] fr.pilato.elasticsearch.crawler:fscrawler-tika:jar:2.10-SNAPSHOT
> [INFO] +-
> fr.pilato.elasticsearch.crawler:fscrawler-framework:jar:2.10-SNAPSHOT:compile
> [INFO] | +- commons-io:commons-io:jar:2.11.0:compile
> [INFO] | +- com.fasterxml.jackson.core:jackson-core:jar:2.13.3:compile
> [INFO] | +- com.fasterxml.jackson.core:jackson-databind:jar:2.13.3:compile
> [INFO] | +-
> com.fasterxml.jackson.datatype:jackson-datatype-jsr310:jar:2.13.3:compile
> [INFO] | +-
> com.fasterxml.jackson.dataformat:jackson-dataformat-xml:jar:2.13.3:compile
> [INFO] | | +- org.codehaus.woodstox:stax2-api:jar:4.2.1:compile
> [INFO] | | \- com.fasterxml.woodstox:woodstox-core:jar:6.3.1:compile
> [INFO] | +-
> com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:jar:2.13.3:compile
> [INFO] | | \- org.yaml:snakeyaml:jar:1.30:compile
> [INFO] | +-
> com.fasterxml.jackson.core:jackson-annotations:jar:2.13.3:compile
> [INFO] | +- com.jayway.jsonpath:json-path:jar:2.7.0:compile
> [INFO] | | \- net.minidev:json-smart:jar:2.4.7:compile
> [INFO] | | \- net.minidev:accessors-smart:jar:2.4.7:compile
> [INFO] | +- org.apache.logging.log4j:log4j-core:jar:2.18.0:compile
> [INFO] | | \- org.apache.logging.log4j:log4j-api:jar:2.18.0:compile
> [INFO] | +- org.apache.logging.log4j:log4j-1.2-api:jar:2.18.0:compile
> [INFO] | +- org.apache.logging.log4j:log4j-slf4j-impl:jar:2.18.0:compile
> [INFO] | +- org.apache.logging.log4j:log4j-jcl:jar:2.18.0:compile
> [INFO] | | \- commons-logging:commons-logging:jar:1.2:compile
> [INFO] | +- org.apache.logging.log4j:log4j-jul:jar:2.18.0:compile
> [INFO] | \- org.fusesource.jansi:jansi:jar:2.4.0:compile
> [INFO] +-
> fr.pilato.elasticsearch.crawler:fscrawler-beans:jar:2.10-SNAPSHOT:compile
> [INFO] +-
> fr.pilato.elasticsearch.crawler:fscrawler-settings:jar:2.10-SNAPSHOT:compile
> [INFO] +- org.apache.tika:tika-core:jar:2.4.1:compile
> [INFO] | \- org.slf4j:slf4j-api:jar:1.7.36:compile
> [INFO] +- org.apache.tika:tika-parsers-standard-package:jar:2.4.1:compile
> [INFO] | +- org.apache.tika:tika-parser-apple-module:jar:2.4.1:compile
> [INFO] | | +- org.apache.tika:tika-parser-zip-commons:jar:2.4.1:compile
> [INFO] | | \- com.googlecode.plist:dd-plist:jar:1.23:compile
> [INFO] | +- org.apache.tika:tika-parser-audiovideo-module:jar:2.4.1:compile
> [INFO] | | \- com.drewnoakes:metadata-extractor:jar:2.18.0:compile
> [INFO] | | \- com.adobe.xmp:xmpcore:jar:6.1.11:compile
> [INFO] | +- org.apache.tika:tika-parser-cad-module:jar:2.4.1:compile
> [INFO] | +- org.apache.tika:tika-parser-code-module:jar:2.4.1:compile
> [INFO] | | +- org.codelibs:jhighlight:jar:1.1.0:compile
> [INFO] | | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
> [INFO] | | +- org.ow2.asm:asm:jar:9.3:compile
> [INFO] | | +- com.epam:parso:jar:2.0.14:compile
> [INFO] | | \- org.tallison:jmatio:jar:1.5:compile
> [INFO] | +- org.apache.tika:tika-parser-crypto-module:jar:2.4.1:compile
> [INFO] | | +- org.bouncycastle:bcmail-jdk15on:jar:1.70:compile
> [INFO] | | | +- org.bouncycastle:bcutil-jdk15on:jar:1.70:compile
> [INFO] | | | \- org.bouncycastle:bcpkix-jdk15on:jar:1.70:compile
> [INFO] | | \- org.bouncycastle:bcprov-jdk15on:jar:1.70:compile
> [INFO] | +- org.apache.tika:tika-parser-digest-commons:jar:2.4.1:compile
> [INFO] | | \- commons-codec:commons-codec:jar:1.15:compile
> [INFO] | +- org.apache.tika:tika-parser-font-module:jar:2.4.1:compile
> [INFO] | | \- org.apache.pdfbox:fontbox:jar:2.0.26:compile
> [INFO] | +- org.apache.tika:tika-parser-html-module:jar:2.4.1:compile
> [INFO] | | \- org.apache.tika:tika-parser-html-commons:jar:2.4.1:compile
> [INFO] | | \- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
> [INFO] | +- org.apache.tika:tika-parser-image-module:jar:2.4.1:compile
> [INFO] | | +- com.github.jai-imageio:jai-imageio-core:jar:1.4.0:compile
> [INFO] | | \- org.apache.pdfbox:jbig2-imageio:jar:3.0.4:compile
> [INFO] | +- org.apache.tika:tika-parser-mail-module:jar:2.4.1:compile
> [INFO] | | \- org.apache.tika:tika-parser-mail-commons:jar:2.4.1:compile
> [INFO] | | +- org.apache.james:apache-mime4j-core:jar:0.8.4:compile
> [INFO] | | \- org.apache.james:apache-mime4j-dom:jar:0.8.4:compile
> [INFO] | +- org.apache.tika:tika-parser-microsoft-module:jar:2.4.1:compile
> [INFO] | | +- com.pff:java-libpst:jar:0.9.3:compile
> [INFO] | | +- org.apache.commons:commons-lang3:jar:3.12.0:compile
> [INFO] | | +- org.apache.poi:poi:jar:5.2.2:compile
> [INFO] | | | +- org.apache.commons:commons-math3:jar:3.6.1:compile
> [INFO] | | | \- com.zaxxer:SparseBitSet:jar:1.2:compile
> [INFO] | | +- org.apache.poi:poi-scratchpad:jar:5.2.2:compile
> [INFO] | | +- org.apache.poi:poi-ooxml:jar:5.2.2:compile
> [INFO] | | | +- org.apache.poi:poi-ooxml-lite:jar:5.2.2:compile
> [INFO] | | | +- org.apache.xmlbeans:xmlbeans:jar:5.0.3:compile
> [INFO] | | | \- com.github.virtuald:curvesapi:jar:1.07:compile
> [INFO] | | +- com.healthmarketscience.jackcess:jackcess:jar:4.0.1:compile
> [INFO] | | \-
> com.healthmarketscience.jackcess:jackcess-encrypt:jar:4.0.1:compile
> [INFO] | +- org.slf4j:jcl-over-slf4j:jar:1.7.36:compile
> [INFO] | +- org.apache.tika:tika-parser-miscoffice-module:jar:2.4.1:compile
> [INFO] | | \- org.apache.commons:commons-collections4:jar:4.4:compile
> [INFO] | +- org.apache.tika:tika-parser-news-module:jar:2.4.1:compile
> [INFO] | | \- com.rometools:rome:jar:1.18.0:compile
> [INFO] | | \- com.rometools:rome-utils:jar:1.18.0:compile
> [INFO] | +- org.apache.tika:tika-parser-ocr-module:jar:2.4.1:compile
> [INFO] | | \- org.apache.commons:commons-exec:jar:1.3:compile
> [INFO] | +- org.apache.tika:tika-parser-pdf-module:jar:2.4.1:compile
> [INFO] | | +- org.apache.pdfbox:pdfbox:jar:2.0.26:compile
> [INFO] | | +- org.apache.pdfbox:pdfbox-tools:jar:2.0.26:compile
> [INFO] | | | \- org.apache.pdfbox:pdfbox-debugger:jar:2.0.26:compile
> [INFO] | | \- org.apache.pdfbox:jempbox:jar:1.8.16:compile
> [INFO] | +- org.apache.tika:tika-parser-pkg-module:jar:2.4.1:compile
> [INFO] | | +- org.tukaani:xz:jar:1.9:compile
> [INFO] | | +- org.brotli:dec:jar:0.1.2:compile
> [INFO] | | \- com.github.junrar:junrar:jar:7.5.2:compile
> [INFO] | +- org.apache.tika:tika-parser-text-module:jar:2.4.1:compile
> [INFO] | | \-
> com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile
> [INFO] | +- org.apache.tika:tika-parser-webarchive-module:jar:2.4.1:compile
> [INFO] | | +- org.netpreserve:jwarc:jar:0.18.1:compile
> [INFO] | | \- org.apache.commons:commons-compress:jar:1.21:compile
> [INFO] | +- org.apache.tika:tika-parser-xml-module:jar:2.4.1:compile
> [INFO] | | \- xerces:xercesImpl:jar:2.12.2:compile
> [INFO] | | \- xml-apis:xml-apis:jar:1.4.01:compile
> [INFO] | +- org.apache.tika:tika-parser-xmp-commons:jar:2.4.1:compile
> [INFO] | | \- org.apache.pdfbox:xmpbox:jar:2.0.26:compile
> [INFO] | +- org.gagravarr:vorbis-java-tika:jar:0.8:compile
> [INFO] | \- org.gagravarr:vorbis-java-core:jar:0.8:compile
> [INFO] +- org.apache.tika:tika-parser-scientific-module:jar:2.4.1:compile
> [INFO] | +- org.apache.sis.core:sis-utility:jar:1.2:compile
> [INFO] | | \- javax.measure:unit-api:jar:1.0:compile
> [INFO] | +- org.apache.sis.storage:sis-netcdf:jar:1.2:compile
> [INFO] | | +- org.apache.sis.storage:sis-storage:jar:1.2:compile
> [INFO] | | | \- org.apache.sis.core:sis-feature:jar:1.2:compile
> [INFO] | | \- org.apache.sis.core:sis-referencing:jar:1.2:compile
> [INFO] | +- org.apache.sis.core:sis-metadata:jar:1.2:compile
> [INFO] | | \- jakarta.xml.bind:jakarta.xml.bind-api:jar:3.0.1:compile
> [INFO] | +- org.opengis:geoapi:jar:3.0.1:compile
> [INFO] | +- edu.ucar:netcdf4:jar:4.5.5:compile
> [INFO] | | +- edu.ucar:cdm:jar:4.5.5:compile
> [INFO] | | | +- edu.ucar:udunits:jar:4.5.5:compile
> [INFO] | | | +- edu.ucar:httpservices:jar:4.5.5:compile
> [INFO] | | | | +- org.apache.httpcomponents:httpclient:jar:4.5.13:compile
> [INFO] | | | | \- org.apache.httpcomponents:httpmime:jar:4.5.13:compile
> [INFO] | | | +- org.apache.httpcomponents:httpcore:jar:4.4.15:compile
> [INFO] | | | +- joda-time:joda-time:jar:2.11.1:compile
> [INFO] | | | +- org.quartz-scheduler:quartz:jar:2.3.2:compile
> [INFO] | | | | +- com.mchange:c3p0:jar:0.9.5.4:compile
> [INFO] | | | | +- com.mchange:mchange-commons-java:jar:0.2.15:compile
> [INFO] | | | | \- com.zaxxer:HikariCP-java7:jar:2.4.13:compile
> [INFO] | | | \- com.beust:jcommander:jar:1.82:compile
> [INFO] | | \- net.java.dev.jna:jna:jar:5.12.1:compile
> [INFO] | +- edu.ucar:grib:jar:4.5.5:compile
> [INFO] | | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
> [INFO] | | +- org.jdom:jdom2:jar:2.0.6.1:compile
> [INFO] | | +- edu.ucar:jj2000:jar:5.2:compile
> [INFO] | | \- org.itadaki:bzip2:jar:0.9.1:compile
> [INFO] | +- net.jcip:jcip-annotations:jar:1.0:compile
> [INFO] | +- org.apache.commons:commons-csv:jar:1.9.0:compile
> [INFO] | \- org.glassfish.jaxb:jaxb-runtime:jar:2.3.6:compile
> [INFO] | +- org.glassfish.jaxb:txw2:jar:2.3.6:compile
> [INFO] | +- com.sun.istack:istack-commons-runtime:jar:3.0.12:compile
> [INFO] | \- com.sun.activation:jakarta.activation:jar:2.0.1:compile
> [INFO] +- org.apache.tika:tika-parser-sqlite3-module:jar:2.4.1:compile
> [INFO] | +- org.apache.tika:tika-parser-jdbc-commons:jar:2.4.1:compile
> [INFO] | \- org.xerial:sqlite-jdbc:jar:3.36.0.3:compile
> [INFO] +- org.apache.tika:tika-langdetect-optimaize:jar:2.4.1:compile
> [INFO] | \-
> com.optimaize.languagedetector:language-detector:jar:0.6:compile
> [INFO] | +- net.arnx:jsonic:jar:1.2.11:compile
> [INFO] | +- com.intellij:annotations:jar:12.0:compile
> [INFO] | \- com.google.guava:guava:jar:31.1-jre:compile
> [INFO] | +- com.google.guava:failureaccess:jar:1.0.1:compile
> [INFO] | +-
> com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava:compile
> [INFO] | +- com.google.code.findbugs:jsr305:jar:3.0.2:compile
> [INFO] | +- org.checkerframework:checker-qual:jar:3.12.0:compile
> [INFO] | +-
> com.google.errorprone:error_prone_annotations:jar:2.11.0:compile
> [INFO] | \- com.google.j2objc:j2objc-annotations:jar:1.3:compile
> [INFO] +- com.jcraft:jsch:jar:0.1.55:compile
> [INFO] +-
> fr.pilato.elasticsearch.crawler:fscrawler-test-framework:jar:2.10-SNAPSHOT:test
> [INFO] | +- org.hamcrest:hamcrest-all:jar:1.3:test
> [INFO] | +- junit:junit:jar:4.13.2:test
> [INFO] | | \- org.hamcrest:hamcrest-core:jar:1.3:test
> [INFO] | \-
> com.carrotsearch.randomizedtesting:randomizedtesting-runner:jar:2.8.1:test
> [INFO] \-
> fr.pilato.elasticsearch.crawler:fscrawler-test-documents:jar:2.10-SNAPSHOT:test
>
> David
> Le 1 sept. 2022 à 11:40 +0200, Tim Allison <ta...@apache.org>, a écrit
> :
>
> And, what is recorded in the X-Tika-ParsedBy value in the metadata object?
>
> On Thu, Sep 1, 2022 at 5:36 AM Tim Allison <ta...@apache.org> wrote:
>
>> What are your dependencies? Which parsers are in AutoDetectParser?
>>
>> On Thu, Sep 1, 2022 at 4:38 AM David Pilato <da...@pilato.fr> wrote:
>>
>>> Hey team
>>>
>>>
>>> I'm wondering what's wrong with my config.
>>> I'm running this very basic piece of code:
>>>
>>> @Test
>>> public void testTika() throws TikaException, IOException, SAXException {
>>>     BodyContentHandler handler = new BodyContentHandler(new WriteOutContentHandler(1000));
>>>     new AutoDetectParser().parse(getBinaryContent("test-ocr.png"), handler, new Metadata(), new ParseContext());
>>>     System.out.println("handler = " + handler);
>>> }
>>>
>>>
>>> Here are my logs:
>>>
>>> 16:31:13,089 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
>>> [tesseract]): true
>>> 16:31:13,560 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
>>> [tesseract]): true
>>> 16:31:13,564 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not
>>> appear to be installed (commandline: convert)
>>> 16:31:13,591 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
>>> [tesseract]): true
>>> 16:31:13,595 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not
>>> appear to be installed (commandline: convert)
>>> handler =
>>>
>>>
>>> The content is not extracted although Tesseract is detected.
>>>
>>> When I run Tesseract manually:
>>>
>>> tesseract test-ocr.png tess.out
>>> cat tess.out.txt
>>>
>>> I'm getting:
>>>
>>> This file contains some words.
>>>
>>> tesseract --version gives
>>>
>>> tesseract 5.2.0
>>>  leptonica-1.82.0
>>>  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37 :
>>> libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : libopenjp2 2.5.0
>>>  Found AVX2
>>>  Found AVX
>>>  Found FMA
>>>  Found SSE4.1
>>>  Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8
>>> liblz4/1.9.3 libzstd/1.5.2
>>>  Found libcurl/7.79.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11
>>> nghttp2/1.45.1
>>>
>>>
>>> What I'm missing here?
>>>
>>>
>>>
>>> David
>>>
>>

Re: .TesseractOCRParser does not extract text although Tesseract does

Posted by David Pilato <da...@pilato.fr>.
Here is the content of the metadata object:

X-TIKA:Parsed-By=org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By=org.apache.tika.parser.gdal.GDALParser
X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.gdal.GDALParser Content-Type=image/png

And here is the dependency tree:

[INFO] fr.pilato.elasticsearch.crawler:fscrawler-tika:jar:2.10-SNAPSHOT
[INFO] +- fr.pilato.elasticsearch.crawler:fscrawler-framework:jar:2.10-SNAPSHOT:compile
[INFO] | +- commons-io:commons-io:jar:2.11.0:compile
[INFO] | +- com.fasterxml.jackson.core:jackson-core:jar:2.13.3:compile
[INFO] | +- com.fasterxml.jackson.core:jackson-databind:jar:2.13.3:compile
[INFO] | +- com.fasterxml.jackson.datatype:jackson-datatype-jsr310:jar:2.13.3:compile
[INFO] | +- com.fasterxml.jackson.dataformat:jackson-dataformat-xml:jar:2.13.3:compile
[INFO] | | +- org.codehaus.woodstox:stax2-api:jar:4.2.1:compile
[INFO] | | \- com.fasterxml.woodstox:woodstox-core:jar:6.3.1:compile
[INFO] | +- com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:jar:2.13.3:compile
[INFO] | | \- org.yaml:snakeyaml:jar:1.30:compile
[INFO] | +- com.fasterxml.jackson.core:jackson-annotations:jar:2.13.3:compile
[INFO] | +- com.jayway.jsonpath:json-path:jar:2.7.0:compile
[INFO] | | \- net.minidev:json-smart:jar:2.4.7:compile
[INFO] | | \- net.minidev:accessors-smart:jar:2.4.7:compile
[INFO] | +- org.apache.logging.log4j:log4j-core:jar:2.18.0:compile
[INFO] | | \- org.apache.logging.log4j:log4j-api:jar:2.18.0:compile
[INFO] | +- org.apache.logging.log4j:log4j-1.2-api:jar:2.18.0:compile
[INFO] | +- org.apache.logging.log4j:log4j-slf4j-impl:jar:2.18.0:compile
[INFO] | +- org.apache.logging.log4j:log4j-jcl:jar:2.18.0:compile
[INFO] | | \- commons-logging:commons-logging:jar:1.2:compile
[INFO] | +- org.apache.logging.log4j:log4j-jul:jar:2.18.0:compile
[INFO] | \- org.fusesource.jansi:jansi:jar:2.4.0:compile
[INFO] +- fr.pilato.elasticsearch.crawler:fscrawler-beans:jar:2.10-SNAPSHOT:compile
[INFO] +- fr.pilato.elasticsearch.crawler:fscrawler-settings:jar:2.10-SNAPSHOT:compile
[INFO] +- org.apache.tika:tika-core:jar:2.4.1:compile
[INFO] | \- org.slf4j:slf4j-api:jar:1.7.36:compile
[INFO] +- org.apache.tika:tika-parsers-standard-package:jar:2.4.1:compile
[INFO] | +- org.apache.tika:tika-parser-apple-module:jar:2.4.1:compile
[INFO] | | +- org.apache.tika:tika-parser-zip-commons:jar:2.4.1:compile
[INFO] | | \- com.googlecode.plist:dd-plist:jar:1.23:compile
[INFO] | +- org.apache.tika:tika-parser-audiovideo-module:jar:2.4.1:compile
[INFO] | | \- com.drewnoakes:metadata-extractor:jar:2.18.0:compile
[INFO] | | \- com.adobe.xmp:xmpcore:jar:6.1.11:compile
[INFO] | +- org.apache.tika:tika-parser-cad-module:jar:2.4.1:compile
[INFO] | +- org.apache.tika:tika-parser-code-module:jar:2.4.1:compile
[INFO] | | +- org.codelibs:jhighlight:jar:1.1.0:compile
[INFO] | | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
[INFO] | | +- org.ow2.asm:asm:jar:9.3:compile
[INFO] | | +- com.epam:parso:jar:2.0.14:compile
[INFO] | | \- org.tallison:jmatio:jar:1.5:compile
[INFO] | +- org.apache.tika:tika-parser-crypto-module:jar:2.4.1:compile
[INFO] | | +- org.bouncycastle:bcmail-jdk15on:jar:1.70:compile
[INFO] | | | +- org.bouncycastle:bcutil-jdk15on:jar:1.70:compile
[INFO] | | | \- org.bouncycastle:bcpkix-jdk15on:jar:1.70:compile
[INFO] | | \- org.bouncycastle:bcprov-jdk15on:jar:1.70:compile
[INFO] | +- org.apache.tika:tika-parser-digest-commons:jar:2.4.1:compile
[INFO] | | \- commons-codec:commons-codec:jar:1.15:compile
[INFO] | +- org.apache.tika:tika-parser-font-module:jar:2.4.1:compile
[INFO] | | \- org.apache.pdfbox:fontbox:jar:2.0.26:compile
[INFO] | +- org.apache.tika:tika-parser-html-module:jar:2.4.1:compile
[INFO] | | \- org.apache.tika:tika-parser-html-commons:jar:2.4.1:compile
[INFO] | | \- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
[INFO] | +- org.apache.tika:tika-parser-image-module:jar:2.4.1:compile
[INFO] | | +- com.github.jai-imageio:jai-imageio-core:jar:1.4.0:compile
[INFO] | | \- org.apache.pdfbox:jbig2-imageio:jar:3.0.4:compile
[INFO] | +- org.apache.tika:tika-parser-mail-module:jar:2.4.1:compile
[INFO] | | \- org.apache.tika:tika-parser-mail-commons:jar:2.4.1:compile
[INFO] | | +- org.apache.james:apache-mime4j-core:jar:0.8.4:compile
[INFO] | | \- org.apache.james:apache-mime4j-dom:jar:0.8.4:compile
[INFO] | +- org.apache.tika:tika-parser-microsoft-module:jar:2.4.1:compile
[INFO] | | +- com.pff:java-libpst:jar:0.9.3:compile
[INFO] | | +- org.apache.commons:commons-lang3:jar:3.12.0:compile
[INFO] | | +- org.apache.poi:poi:jar:5.2.2:compile
[INFO] | | | +- org.apache.commons:commons-math3:jar:3.6.1:compile
[INFO] | | | \- com.zaxxer:SparseBitSet:jar:1.2:compile
[INFO] | | +- org.apache.poi:poi-scratchpad:jar:5.2.2:compile
[INFO] | | +- org.apache.poi:poi-ooxml:jar:5.2.2:compile
[INFO] | | | +- org.apache.poi:poi-ooxml-lite:jar:5.2.2:compile
[INFO] | | | +- org.apache.xmlbeans:xmlbeans:jar:5.0.3:compile
[INFO] | | | \- com.github.virtuald:curvesapi:jar:1.07:compile
[INFO] | | +- com.healthmarketscience.jackcess:jackcess:jar:4.0.1:compile
[INFO] | | \- com.healthmarketscience.jackcess:jackcess-encrypt:jar:4.0.1:compile
[INFO] | +- org.slf4j:jcl-over-slf4j:jar:1.7.36:compile
[INFO] | +- org.apache.tika:tika-parser-miscoffice-module:jar:2.4.1:compile
[INFO] | | \- org.apache.commons:commons-collections4:jar:4.4:compile
[INFO] | +- org.apache.tika:tika-parser-news-module:jar:2.4.1:compile
[INFO] | | \- com.rometools:rome:jar:1.18.0:compile
[INFO] | | \- com.rometools:rome-utils:jar:1.18.0:compile
[INFO] | +- org.apache.tika:tika-parser-ocr-module:jar:2.4.1:compile
[INFO] | | \- org.apache.commons:commons-exec:jar:1.3:compile
[INFO] | +- org.apache.tika:tika-parser-pdf-module:jar:2.4.1:compile
[INFO] | | +- org.apache.pdfbox:pdfbox:jar:2.0.26:compile
[INFO] | | +- org.apache.pdfbox:pdfbox-tools:jar:2.0.26:compile
[INFO] | | | \- org.apache.pdfbox:pdfbox-debugger:jar:2.0.26:compile
[INFO] | | \- org.apache.pdfbox:jempbox:jar:1.8.16:compile
[INFO] | +- org.apache.tika:tika-parser-pkg-module:jar:2.4.1:compile
[INFO] | | +- org.tukaani:xz:jar:1.9:compile
[INFO] | | +- org.brotli:dec:jar:0.1.2:compile
[INFO] | | \- com.github.junrar:junrar:jar:7.5.2:compile
[INFO] | +- org.apache.tika:tika-parser-text-module:jar:2.4.1:compile
[INFO] | | \- com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile
[INFO] | +- org.apache.tika:tika-parser-webarchive-module:jar:2.4.1:compile
[INFO] | | +- org.netpreserve:jwarc:jar:0.18.1:compile
[INFO] | | \- org.apache.commons:commons-compress:jar:1.21:compile
[INFO] | +- org.apache.tika:tika-parser-xml-module:jar:2.4.1:compile
[INFO] | | \- xerces:xercesImpl:jar:2.12.2:compile
[INFO] | | \- xml-apis:xml-apis:jar:1.4.01:compile
[INFO] | +- org.apache.tika:tika-parser-xmp-commons:jar:2.4.1:compile
[INFO] | | \- org.apache.pdfbox:xmpbox:jar:2.0.26:compile
[INFO] | +- org.gagravarr:vorbis-java-tika:jar:0.8:compile
[INFO] | \- org.gagravarr:vorbis-java-core:jar:0.8:compile
[INFO] +- org.apache.tika:tika-parser-scientific-module:jar:2.4.1:compile
[INFO] | +- org.apache.sis.core:sis-utility:jar:1.2:compile
[INFO] | | \- javax.measure:unit-api:jar:1.0:compile
[INFO] | +- org.apache.sis.storage:sis-netcdf:jar:1.2:compile
[INFO] | | +- org.apache.sis.storage:sis-storage:jar:1.2:compile
[INFO] | | | \- org.apache.sis.core:sis-feature:jar:1.2:compile
[INFO] | | \- org.apache.sis.core:sis-referencing:jar:1.2:compile
[INFO] | +- org.apache.sis.core:sis-metadata:jar:1.2:compile
[INFO] | | \- jakarta.xml.bind:jakarta.xml.bind-api:jar:3.0.1:compile
[INFO] | +- org.opengis:geoapi:jar:3.0.1:compile
[INFO] | +- edu.ucar:netcdf4:jar:4.5.5:compile
[INFO] | | +- edu.ucar:cdm:jar:4.5.5:compile
[INFO] | | | +- edu.ucar:udunits:jar:4.5.5:compile
[INFO] | | | +- edu.ucar:httpservices:jar:4.5.5:compile
[INFO] | | | | +- org.apache.httpcomponents:httpclient:jar:4.5.13:compile
[INFO] | | | | \- org.apache.httpcomponents:httpmime:jar:4.5.13:compile
[INFO] | | | +- org.apache.httpcomponents:httpcore:jar:4.4.15:compile
[INFO] | | | +- joda-time:joda-time:jar:2.11.1:compile
[INFO] | | | +- org.quartz-scheduler:quartz:jar:2.3.2:compile
[INFO] | | | | +- com.mchange:c3p0:jar:0.9.5.4:compile
[INFO] | | | | +- com.mchange:mchange-commons-java:jar:0.2.15:compile
[INFO] | | | | \- com.zaxxer:HikariCP-java7:jar:2.4.13:compile
[INFO] | | | \- com.beust:jcommander:jar:1.82:compile
[INFO] | | \- net.java.dev.jna:jna:jar:5.12.1:compile
[INFO] | +- edu.ucar:grib:jar:4.5.5:compile
[INFO] | | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
[INFO] | | +- org.jdom:jdom2:jar:2.0.6.1:compile
[INFO] | | +- edu.ucar:jj2000:jar:5.2:compile
[INFO] | | \- org.itadaki:bzip2:jar:0.9.1:compile
[INFO] | +- net.jcip:jcip-annotations:jar:1.0:compile
[INFO] | +- org.apache.commons:commons-csv:jar:1.9.0:compile
[INFO] | \- org.glassfish.jaxb:jaxb-runtime:jar:2.3.6:compile
[INFO] | +- org.glassfish.jaxb:txw2:jar:2.3.6:compile
[INFO] | +- com.sun.istack:istack-commons-runtime:jar:3.0.12:compile
[INFO] | \- com.sun.activation:jakarta.activation:jar:2.0.1:compile
[INFO] +- org.apache.tika:tika-parser-sqlite3-module:jar:2.4.1:compile
[INFO] | +- org.apache.tika:tika-parser-jdbc-commons:jar:2.4.1:compile
[INFO] | \- org.xerial:sqlite-jdbc:jar:3.36.0.3:compile
[INFO] +- org.apache.tika:tika-langdetect-optimaize:jar:2.4.1:compile
[INFO] | \- com.optimaize.languagedetector:language-detector:jar:0.6:compile
[INFO] | +- net.arnx:jsonic:jar:1.2.11:compile
[INFO] | +- com.intellij:annotations:jar:12.0:compile
[INFO] | \- com.google.guava:guava:jar:31.1-jre:compile
[INFO] | +- com.google.guava:failureaccess:jar:1.0.1:compile
[INFO] | +- com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava:compile
[INFO] | +- com.google.code.findbugs:jsr305:jar:3.0.2:compile
[INFO] | +- org.checkerframework:checker-qual:jar:3.12.0:compile
[INFO] | +- com.google.errorprone:error_prone_annotations:jar:2.11.0:compile
[INFO] | \- com.google.j2objc:j2objc-annotations:jar:1.3:compile
[INFO] +- com.jcraft:jsch:jar:0.1.55:compile
[INFO] +- fr.pilato.elasticsearch.crawler:fscrawler-test-framework:jar:2.10-SNAPSHOT:test
[INFO] | +- org.hamcrest:hamcrest-all:jar:1.3:test
[INFO] | +- junit:junit:jar:4.13.2:test
[INFO] | | \- org.hamcrest:hamcrest-core:jar:1.3:test
[INFO] | \- com.carrotsearch.randomizedtesting:randomizedtesting-runner:jar:2.8.1:test
[INFO] \- fr.pilato.elasticsearch.crawler:fscrawler-test-documents:jar:2.10-SNAPSHOT:test

David
Le 1 sept. 2022 à 11:40 +0200, Tim Allison <ta...@apache.org>, a écrit :
> And, what is recorded in the X-Tika-ParsedBy value in the metadata object?
>
> > On Thu, Sep 1, 2022 at 5:36 AM Tim Allison <ta...@apache.org> wrote:
> > > What are your dependencies? Which parsers are in AutoDetectParser?
> > >
> > > > On Thu, Sep 1, 2022 at 4:38 AM David Pilato <da...@pilato.fr> wrote:
> > > > > Hey team
> > > > >
> > > > >
> > > > > I'm wondering what's wrong with my config.
> > > > > I'm running this very basic piece of code:
> > > > > @Test
> > > > > public void testTika() throws TikaException, IOException, SAXException {
> > > > >    BodyContentHandler handler = new BodyContentHandler(new WriteOutContentHandler(1000));
> > > > >    new AutoDetectParser().parse(getBinaryContent("test-ocr.png"), handler, new Metadata(), new ParseContext());
> > > > >    System.out.println("handler = " + handler);
> > > > > }
> > > > >
> > > > > Here are my logs:
> > > > >
> > > > > 16:31:13,089 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [tesseract]): true
> > > > > 16:31:13,560 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [tesseract]): true
> > > > > 16:31:13,564 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not appear to be installed (commandline: convert)
> > > > > 16:31:13,591 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [tesseract]): true
> > > > > 16:31:13,595 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not appear to be installed (commandline: convert)
> > > > > handler =
> > > > >
> > > > >
> > > > > The content is not extracted although Tesseract is detected.
> > > > >
> > > > > When I run Tesseract manually:
> > > > >
> > > > > tesseract test-ocr.png tess.out
> > > > > cat tess.out.txt
> > > > >
> > > > > I'm getting:
> > > > >
> > > > > This file contains some words.
> > > > >
> > > > > tesseract --version gives
> > > > >
> > > > > tesseract 5.2.0
> > > > >  leptonica-1.82.0
> > > > >  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37 : libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : libopenjp2 2.5.0
> > > > >  Found AVX2
> > > > >  Found AVX
> > > > >  Found FMA
> > > > >  Found SSE4.1
> > > > >  Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.2
> > > > >  Found libcurl/7.79.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11 nghttp2/1.45.1
> > > > >
> > > > >
> > > > > What I'm missing here?
> > > > >
> > > > >
> > > > >
> > > > > David

Re: .TesseractOCRParser does not extract text although Tesseract does

Posted by Tim Allison <ta...@apache.org>.
And, what is recorded in the X-Tika-ParsedBy value in the metadata object?

On Thu, Sep 1, 2022 at 5:36 AM Tim Allison <ta...@apache.org> wrote:

> What are your dependencies? Which parsers are in AutoDetectParser?
>
> On Thu, Sep 1, 2022 at 4:38 AM David Pilato <da...@pilato.fr> wrote:
>
>> Hey team
>>
>>
>> I'm wondering what's wrong with my config.
>> I'm running this very basic piece of code:
>>
>> @Test
>> public void testTika() throws TikaException, IOException, SAXException {
>>     BodyContentHandler handler = new BodyContentHandler(new WriteOutContentHandler(1000));
>>     new AutoDetectParser().parse(getBinaryContent("test-ocr.png"), handler, new Metadata(), new ParseContext());
>>     System.out.println("handler = " + handler);
>> }
>>
>>
>> Here are my logs:
>>
>> 16:31:13,089 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
>> [tesseract]): true
>> 16:31:13,560 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
>> [tesseract]): true
>> 16:31:13,564 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not
>> appear to be installed (commandline: convert)
>> 16:31:13,591 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
>> [tesseract]): true
>> 16:31:13,595 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not
>> appear to be installed (commandline: convert)
>> handler =
>>
>>
>> The content is not extracted although Tesseract is detected.
>>
>> When I run Tesseract manually:
>>
>> tesseract test-ocr.png tess.out
>> cat tess.out.txt
>>
>> I'm getting:
>>
>> This file contains some words.
>>
>> tesseract --version gives
>>
>> tesseract 5.2.0
>>  leptonica-1.82.0
>>  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37 :
>> libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : libopenjp2 2.5.0
>>  Found AVX2
>>  Found AVX
>>  Found FMA
>>  Found SSE4.1
>>  Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8
>> liblz4/1.9.3 libzstd/1.5.2
>>  Found libcurl/7.79.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11
>> nghttp2/1.45.1
>>
>>
>> What I'm missing here?
>>
>>
>>
>> David
>>
>

Re: .TesseractOCRParser does not extract text although Tesseract does

Posted by Tim Allison <ta...@apache.org>.
What are your dependencies? Which parsers are in AutoDetectParser?

On Thu, Sep 1, 2022 at 4:38 AM David Pilato <da...@pilato.fr> wrote:

> Hey team
>
>
> I'm wondering what's wrong with my config.
> I'm running this very basic piece of code:
>
> @Test
> public void testTika() throws TikaException, IOException, SAXException {
>     BodyContentHandler handler = new BodyContentHandler(new WriteOutContentHandler(1000));
>     new AutoDetectParser().parse(getBinaryContent("test-ocr.png"), handler, new Metadata(), new ParseContext());
>     System.out.println("handler = " + handler);
> }
>
>
> Here are my logs:
>
> 16:31:13,089 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
> [tesseract]): true
> 16:31:13,560 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
> [tesseract]): true
> 16:31:13,564 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not
> appear to be installed (commandline: convert)
> 16:31:13,591 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
> [tesseract]): true
> 16:31:13,595 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not
> appear to be installed (commandline: convert)
> handler =
>
>
> The content is not extracted although Tesseract is detected.
>
> When I run Tesseract manually:
>
> tesseract test-ocr.png tess.out
> cat tess.out.txt
>
> I'm getting:
>
> This file contains some words.
>
> tesseract --version gives
>
> tesseract 5.2.0
>  leptonica-1.82.0
>  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37 : libtiff
> 4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : libopenjp2 2.5.0
>  Found AVX2
>  Found AVX
>  Found FMA
>  Found SSE4.1
>  Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8
> liblz4/1.9.3 libzstd/1.5.2
>  Found libcurl/7.79.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11
> nghttp2/1.45.1
>
>
> What I'm missing here?
>
>
>
> David
>