You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by PGNet Dev <pg...@gmail.com> on 2022/07/23 16:58:48 UTC
adding explicit OCR parser config to tika-server-config-custom.xml disables working OCR image processing?
I'm running tika 2.4.2/snap + tesseract5 for OCR. Imagemagick7 is installed for Image proc.
it's serving as backend to a dovecot/fts-tika setup
If I exec tika with custom config
cat /etc/tika/tika-server-config-custom.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<server>
<params>
<logLevel>debug</logLevel>
<javaPath>/usr/bin/java</javaPath>
<noFork>false</noFork>
<forkedJvmArgs>
<arg>-Xms1g</arg>
<arg>-Xmx1g</arg>
<arg>-Dpdfbox.fontcache=/var/tika</arg>
</forkedJvmArgs>
<digest>sha256</digest>
<enableUnsecureFeatures>false</enableUnsecureFeatures>
<id></id>
<maxFiles>100000</maxFiles>
<maxForkedStartupMillis>120000</maxForkedStartupMillis>
<maxRestarts>-1</maxRestarts>
<minimumTimeoutMillis>30000</minimumTimeoutMillis>
<returnStackTrace>false</returnStackTrace>
<taskPulseMillis>10000</taskPulseMillis>
<taskTimeoutMillis>300000</taskTimeoutMillis>
<endpoints>
<endpoint>tika</endpoint>
<endpoint>status</endpoint>
<endpoint>rmeta</endpoint>
</endpoints>
</params>
</server>
</properties>
and pass a jpg as an email attachment, all's good
i see tesseract invoked, and after receipt & indexing by dovecot, i can exec a body search on OCR'd text from the image, and it's found as expected
but, if i just add a specific parser config to the above,
cat /etc/tika/tika-server-config-custom.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
+ <parsers>
+ <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
+ <params>
+ <param name="applyRotation" type="bool">true</param>
+ <param name="enableImagePreprocessing" type="bool">true</param>
+ <param name="maxFileSizeToOcr" type="long">2147483647</param>
+ <param name="minFileSizeToOcr" type="long">0</param>
+ <param name="preserveInterwordSpacing" type="bool">true</param>
+ <param name="timeoutSeconds" type="int">180</param>
+ </params>
+ </parser>
+ </parsers>
<server>
<params>
...
relaunch tika, and resend the attachment , i see _no_ errors, the attachment/email _is_ delivered,
but,
i never see tesseract invoked in top, and a search after delivery on image-text returns empty.
it's not in the index.
what in that additional parser config is causing the problem?
Re: bug: adding to tika 2.4.2 config.xml truncates metadata return
Posted by PGNet Dev <pg...@gmail.com>.
I'd stared repeatedly at
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
in the docs. seemed reasonable that since TesseractOCRParser *is* the default parser, exlcuding it made no sense.
guess not!
with config,
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
+ <parser class="org.apache.tika.parser.DefaultParser">
+ <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
+ </parser>
<parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
<params>
<param name="skipOcr" type="bool">false</param>
<param name="tessdataPath" type="string">/usr/share/tesseract/tessdata</param>
<param name="tesseractPath" type="string">/usr/bin</param>
<param name="maxFileSizeToOcr" type="long">2147483647</param>
<param name="minFileSizeToOcr" type="long">0</param>
<param name="applyRotation" type="bool">true</param>
<param name="enableImagePreprocessing" type="bool">true</param>
<param name="preserveInterwordSpacing" type="bool">true</param>
<param name="timeoutSeconds" type="int">180</param>
</params>
</parser>
</parsers>
<server>
curl correctly returns
curl -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/meta
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core Test.SNAPSHOT">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:xmpTPg="http://ns.adobe.com/xap/1.0/t/pg/"
pdf:PDFVersion="1.7"
pdf:hasXFA="false"
pdf:hasCollection="false"
pdf:encrypted="false"
pdf:hasMarkedContent="false"
pdf:producer="Adobe PDF Library 15.0"
pdf:hasXMP="true"
xmp:CreatorTool="Adobe InDesign 15.1 (Macintosh)"
xmp:CreateDate="2020-10-14T17:08:10Z"
xmp:ModifyDate="2020-10-14T17:08:10Z"
xmp:MetadataDate="2020-10-14T17:08:10Z"
dc:format="application/pdf; version=1.7"
dc:language="en-US"
xmpMM:DocumentID="xmp.id:7a865d84-8dbf-4015-96b7-fdae89a9603b"
xmpTPg:NPages="1">
<pdf:unmappedUnicodeCharsPerPage>
<rdf:Seq>
<rdf:li>0</rdf:li>
</rdf:Seq>
</pdf:unmappedUnicodeCharsPerPage>
<pdf:charsPerPage>
<rdf:Seq>
<rdf:li>794</rdf:li>
</rdf:Seq>
</pdf:charsPerPage>
<pdf:annotationTypes>
<rdf:Bag>
<rdf:li>95e8dd6e9b4c5a3d-3d44cd989a3a348c</rdf:li>
<rdf:li>95e8dd6f9b4c5a3e-3d44cd979a3a348b</rdf:li>
<rdf:li>95e8dd709b4c5a3f-3d44cd969a3a348a</rdf:li>
<rdf:li>95e8dd719b4c5a40-3d44cd959a3a3489</rdf:li>
<rdf:li>95e8dd729b4c5a41-3d44cd949a3a3488</rdf:li>
</rdf:Bag>
</pdf:annotationTypes>
<pdf:annotationSubtypes>
<rdf:Bag>
<rdf:li>Link</rdf:li>
</rdf:Bag>
</pdf:annotationSubtypes>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
and
journalctl -f -u tika
Jul 26 10:50:05 mx-test.example.net tika[14096]: INFO [qtp641030345-33] 10:50:05,573 org.apache.tika.server.core.resource.MetadataResource /meta (autodetecting type)
finally, with same config, on receipt of email, submission to tika backend via dovecot,
journalctl -f -u tika
Jul 26 11:16:04 mx-test.example.net tika[14096]: INFO [qtp641030345-31] 11:16:04,013 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)
dovecot logs
==> /var/log/dovecot/dovecot-debug.log <==
2022-07-26 11:16:03 indexer-worker(postmaster@example.com)<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>: Debug: fts-flatcurve: Xapian library version: 1.4.19
2022-07-26 11:16:03 indexer-worker(postmaster@example.com)<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>: Debug: fts-flatcurve(INBOX): Opened DB (RO) messages=0 version=1 shards=1
2022-07-26 11:16:03 indexer-worker(postmaster@example.com)<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>: Debug: fts-flatcurve(INBOX): Last UID uid=0
2022-07-26 11:16:03 indexer-worker(postmaster@example.com)<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>: Debug: fts-flatcurve(INBOX): Last UID uid=0
2022-07-26 11:16:03 indexer-worker(postmaster@example.com)<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>: Debug: fts-flatcurve(INBOX): Opened DB (RW; current.1658584490654708) messages=0 version=1
2022-07-26 11:16:03 indexer-worker(postmaster@example.com)<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>: Debug: fts-flatcurve(INBOX): Indexing uid=93698
this pause is the to-tika submit, and return ...
2022-07-26 11:16:04 indexer-worker(postmaster@example.com)<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>: Debug: fts-flatcurve(INBOX): Committed 1 changes to DB (RW; current.1658584490654708) in 0.074 secs
2022-07-26 11:16:04 indexer-worker(postmaster@example.com)<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>: Debug: fts-flatcurve: Update transaction completed in 0.386 secs
... with the subsequently successfully updated index
yay. o/
Re: bug: adding to tika 2.4.2 config.xml truncates metadata return
Posted by Tim Allison <ta...@apache.org>.
Try something like this:
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
</parser>
<parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
<params>
<param name="timeoutSeconds" type="int">180</param>
</params>
</parser>
</parsers>
<server stuff../>
</properties>
On Tue, Jul 26, 2022 at 6:52 AM PGNet Dev <pg...@gmail.com> wrote:
> removing dovecot from the equation, reduced this to just tika,
> reproducible here
>
> running
>
> ls -al /srv/tika/tika-server.jar
> lrwxrwxrwx 1 root root 50 Jul 26 05:42
> /srv/tika/tika-server.jar ->
> tika-server-standard-2.4.2-20220725.215245-121.jar
>
> systemctl status tika -ln0
> ● tika.service - Apache Tika server
> Loaded: loaded (/etc/systemd/system/tika.service;
> enabled; vendor preset: disabled)
> Active: active (running) since Tue 2022-07-26
> 05:43:01 EDT; 29min ago
> Main PID: 10829 (java)
> Tasks: 53 (limit: 8812)
> Memory: 215.9M
> CPU: 37.667s
> CGroup: /system.slice/tika.service
> ├─ 10829 /usr/bin/java
> -Dpdfbox.fontcache=/var/tika -XX:ParallelGCThreads=1 -XX:CICompilerCount=2
> -XX:-CICompilerCountPerCPU -jar /srv/tika/tika-server.jar -c
> /etc/tika/tika-server-config-custom.xml --host 127.0.0.1 --port 9998
> └─ 10863 /usr/bin/java -Xms1g -Xmx1g
> -Dpdfbox.fontcache=/var/tika -Dlog4j2.debug -Djava.awt.headless=true -cp
> /srv/tika/tika-server.jar -Dtika.server.id=
> org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i "" -c
> /etc/tika/tika-server-config-custom.xml -forkedStatusFile
> /tmp/apache-tika-server-forked-tmp-12945021525641519393 -numRestarts 0
>
> on
>
> lsb_release -rd
> Description: Fedora release 36 (Thirty Six)
> Release: 36
>
> with
>
> tesseract --version
> tesseract 5.0.1
> leptonica-1.82.0
> libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.2) : libpng
> 1.6.37 : libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.3
> Found OpenMP 201511
>
> stream --version
> Version: ImageMagick 7.1.0-44 Q16-HDRI x86_64 20294
> https://imagemagick.org
> Copyright: (C) 1999 ImageMagick Studio LLC
> License: https://imagemagick.org/script/license.php
> Features: Cipher DPC HDRI Modules OpenMP(4.5)
> Delegates (built-in): bzlib cairo djvu fontconfig freetype
> gslib gvc heic jbig jng jp2 jpeg lcms lqr ltdl lzma openexr pangocairo png
> ps raqm raw rsvg tiff webp wmf x xml zip zlib
> Compiler: gcc (12.1)
>
> java -version
> Picked up JAVA_TOOL_OPTIONS: -Xmx512M
> openjdk version "18.0.1.1" 2022-04-22
> OpenJDK Runtime Environment 22.3 (build 18.0.1.1+2)
> OpenJDK 64-Bit Server VM 22.3 (build 18.0.1.1+2, mixed
> mode, sharing)
>
> & custom config
>
> cat /etc/tika/tika-server-config-custom.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
> <parsers>
> </parsers>
> <server>
> <params>
> <logLevel>debug</logLevel>
> <javaPath>/usr/bin/java</javaPath>
> <noFork>false</noFork>
> <forkedJvmArgs>
> <arg>-Xms1g</arg>
> <arg>-Xmx1g</arg>
> <arg>-Dpdfbox.fontcache=/var/tika</arg>
> <arg>-Dlog4j2.debug</arg>
> </forkedJvmArgs>
> <digest>sha256</digest>
>
> <enableUnsecureFeatures>false</enableUnsecureFeatures>
> <id></id>
> <maxFiles>100000</maxFiles>
>
> <maxForkedStartupMillis>120000</maxForkedStartupMillis>
> <maxRestarts>-1</maxRestarts>
> <minimumTimeoutMillis>30000</minimumTimeoutMillis>
> <returnStackTrace>false</returnStackTrace>
> <taskPulseMillis>10000</taskPulseMillis>
> <taskTimeoutMillis>300000</taskTimeoutMillis>
> <endpoints>
> <endpoint>tika</endpoint>
> <endpoint>status</endpoint>
> <endpoint>rmeta</endpoint>
> </endpoints>
> </params>
> </server>
> </properties>
>
> on exec, passing a test pdf,
>
> curl -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/meta
>
> complete metadata's returned
>
> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP
> Core Test.SNAPSHOT">
> <rdf:RDF xmlns:rdf="
> http://www.w3.org/1999/02/22-rdf-syntax-ns#">
> <rdf:Description rdf:about=""
> xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
> xmlns:xmp="http://ns.adobe.com/xap/1.0/"
> xmlns:dc="http://purl.org/dc/elements/1.1/"
> xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
> xmlns:xmpTPg="http://ns.adobe.com/xap/1.0/t/pg/"
> pdf:PDFVersion="1.7"
> pdf:hasXFA="false"
> pdf:hasCollection="false"
> pdf:encrypted="false"
> pdf:hasMarkedContent="false"
> pdf:producer="Adobe PDF Library 15.0"
> pdf:hasXMP="true"
> xmp:CreatorTool="Adobe InDesign 15.1 (Macintosh)"
> xmp:CreateDate="2020-10-14T17:08:10Z"
> xmp:ModifyDate="2020-10-14T17:08:10Z"
> xmp:MetadataDate="2020-10-14T17:08:10Z"
> dc:format="application/pdf; version=1.7"
> dc:language="en-US"
> xmpMM:DocumentID="xmp.id:7
> a865d84-8dbf-4015-96b7-fdae89a9603b"
> xmpTPg:NPages="1">
> <pdf:unmappedUnicodeCharsPerPage>
> <rdf:Seq>
> <rdf:li>0</rdf:li>
> </rdf:Seq>
> </pdf:unmappedUnicodeCharsPerPage>
> <pdf:charsPerPage>
> <rdf:Seq>
> <rdf:li>794</rdf:li>
> </rdf:Seq>
> </pdf:charsPerPage>
> <pdf:annotationTypes>
> <rdf:Bag>
>
> <rdf:li>95e8dd6e9b4c5a3d-3d44cd989a3a348c</rdf:li>
>
> <rdf:li>95e8dd6f9b4c5a3e-3d44cd979a3a348b</rdf:li>
>
> <rdf:li>95e8dd709b4c5a3f-3d44cd969a3a348a</rdf:li>
>
> <rdf:li>95e8dd719b4c5a40-3d44cd959a3a3489</rdf:li>
>
> <rdf:li>95e8dd729b4c5a41-3d44cd949a3a3488</rdf:li>
> </rdf:Bag>
> </pdf:annotationTypes>
> <pdf:annotationSubtypes>
> <rdf:Bag>
> <rdf:li>Link</rdf:li>
> </rdf:Bag>
> </pdf:annotationSubtypes>
> </rdf:Description>
> </rdf:RDF>
> </x:xmpmeta>
>
> if i add TesseractOCRParser class config to the above, for simple param
> override
>
> cat /etc/tika/tika-server-config-custom.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
> <parsers>
> + <parser
> class="org.apache.tika.parser.ocr.TesseractOCRParser">
> + <params>
> + <param name="timeoutSeconds"
> type="int">180</param>
> + </params>
> + </parser>
> </parsers>
> ...
>
> exec
>
> systemctl restart tika
> curl -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/meta
>
> returns incomplete/truncated data
>
> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP
> Core Test.SNAPSHOT">
> <rdf:RDF xmlns:rdf="
> http://www.w3.org/1999/02/22-rdf-syntax-ns#">
> <rdf:Description rdf:about=""/>
> </rdf:RDF>
> </x:xmpmeta>
>
bug: adding to tika 2.4.2 config.xml truncates metadata return
Posted by PGNet Dev <pg...@gmail.com>.
removing dovecot from the equation, reduced this to just tika, reproducible here
running
ls -al /srv/tika/tika-server.jar
lrwxrwxrwx 1 root root 50 Jul 26 05:42 /srv/tika/tika-server.jar -> tika-server-standard-2.4.2-20220725.215245-121.jar
systemctl status tika -ln0
● tika.service - Apache Tika server
Loaded: loaded (/etc/systemd/system/tika.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2022-07-26 05:43:01 EDT; 29min ago
Main PID: 10829 (java)
Tasks: 53 (limit: 8812)
Memory: 215.9M
CPU: 37.667s
CGroup: /system.slice/tika.service
├─ 10829 /usr/bin/java -Dpdfbox.fontcache=/var/tika -XX:ParallelGCThreads=1 -XX:CICompilerCount=2 -XX:-CICompilerCountPerCPU -jar /srv/tika/tika-server.jar -c /etc/tika/tika-server-config-custom.xml --host 127.0.0.1 --port 9998
└─ 10863 /usr/bin/java -Xms1g -Xmx1g -Dpdfbox.fontcache=/var/tika -Dlog4j2.debug -Djava.awt.headless=true -cp /srv/tika/tika-server.jar -Dtika.server.id= org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i "" -c /etc/tika/tika-server-config-custom.xml -forkedStatusFile /tmp/apache-tika-server-forked-tmp-12945021525641519393 -numRestarts 0
on
lsb_release -rd
Description: Fedora release 36 (Thirty Six)
Release: 36
with
tesseract --version
tesseract 5.0.1
leptonica-1.82.0
libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.2) : libpng 1.6.37 : libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.3
Found OpenMP 201511
stream --version
Version: ImageMagick 7.1.0-44 Q16-HDRI x86_64 20294 https://imagemagick.org
Copyright: (C) 1999 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC HDRI Modules OpenMP(4.5)
Delegates (built-in): bzlib cairo djvu fontconfig freetype gslib gvc heic jbig jng jp2 jpeg lcms lqr ltdl lzma openexr pangocairo png ps raqm raw rsvg tiff webp wmf x xml zip zlib
Compiler: gcc (12.1)
java -version
Picked up JAVA_TOOL_OPTIONS: -Xmx512M
openjdk version "18.0.1.1" 2022-04-22
OpenJDK Runtime Environment 22.3 (build 18.0.1.1+2)
OpenJDK 64-Bit Server VM 22.3 (build 18.0.1.1+2, mixed mode, sharing)
& custom config
cat /etc/tika/tika-server-config-custom.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
</parsers>
<server>
<params>
<logLevel>debug</logLevel>
<javaPath>/usr/bin/java</javaPath>
<noFork>false</noFork>
<forkedJvmArgs>
<arg>-Xms1g</arg>
<arg>-Xmx1g</arg>
<arg>-Dpdfbox.fontcache=/var/tika</arg>
<arg>-Dlog4j2.debug</arg>
</forkedJvmArgs>
<digest>sha256</digest>
<enableUnsecureFeatures>false</enableUnsecureFeatures>
<id></id>
<maxFiles>100000</maxFiles>
<maxForkedStartupMillis>120000</maxForkedStartupMillis>
<maxRestarts>-1</maxRestarts>
<minimumTimeoutMillis>30000</minimumTimeoutMillis>
<returnStackTrace>false</returnStackTrace>
<taskPulseMillis>10000</taskPulseMillis>
<taskTimeoutMillis>300000</taskTimeoutMillis>
<endpoints>
<endpoint>tika</endpoint>
<endpoint>status</endpoint>
<endpoint>rmeta</endpoint>
</endpoints>
</params>
</server>
</properties>
on exec, passing a test pdf,
curl -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/meta
complete metadata's returned
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core Test.SNAPSHOT">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:xmpTPg="http://ns.adobe.com/xap/1.0/t/pg/"
pdf:PDFVersion="1.7"
pdf:hasXFA="false"
pdf:hasCollection="false"
pdf:encrypted="false"
pdf:hasMarkedContent="false"
pdf:producer="Adobe PDF Library 15.0"
pdf:hasXMP="true"
xmp:CreatorTool="Adobe InDesign 15.1 (Macintosh)"
xmp:CreateDate="2020-10-14T17:08:10Z"
xmp:ModifyDate="2020-10-14T17:08:10Z"
xmp:MetadataDate="2020-10-14T17:08:10Z"
dc:format="application/pdf; version=1.7"
dc:language="en-US"
xmpMM:DocumentID="xmp.id:7a865d84-8dbf-4015-96b7-fdae89a9603b"
xmpTPg:NPages="1">
<pdf:unmappedUnicodeCharsPerPage>
<rdf:Seq>
<rdf:li>0</rdf:li>
</rdf:Seq>
</pdf:unmappedUnicodeCharsPerPage>
<pdf:charsPerPage>
<rdf:Seq>
<rdf:li>794</rdf:li>
</rdf:Seq>
</pdf:charsPerPage>
<pdf:annotationTypes>
<rdf:Bag>
<rdf:li>95e8dd6e9b4c5a3d-3d44cd989a3a348c</rdf:li>
<rdf:li>95e8dd6f9b4c5a3e-3d44cd979a3a348b</rdf:li>
<rdf:li>95e8dd709b4c5a3f-3d44cd969a3a348a</rdf:li>
<rdf:li>95e8dd719b4c5a40-3d44cd959a3a3489</rdf:li>
<rdf:li>95e8dd729b4c5a41-3d44cd949a3a3488</rdf:li>
</rdf:Bag>
</pdf:annotationTypes>
<pdf:annotationSubtypes>
<rdf:Bag>
<rdf:li>Link</rdf:li>
</rdf:Bag>
</pdf:annotationSubtypes>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
if i add TesseractOCRParser class config to the above, for simple param override
cat /etc/tika/tika-server-config-custom.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
+ <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
+ <params>
+ <param name="timeoutSeconds" type="int">180</param>
+ </params>
+ </parser>
</parsers>
...
exec
systemctl restart tika
curl -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/meta
returns incomplete/truncated data
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core Test.SNAPSHOT">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""/>
</rdf:RDF>
</x:xmpmeta>
Re: adding explicit OCR parser config to tika-server-config-custom.xml disables working OCR image processing?
Posted by PGNet Dev <pg...@gmail.com>.
on startup of
tika tika-server-standard-2.4.2-20220723.145242-114.jar
, mod'd to enable debug logs,
with config
...
<properties>
<parsers>
<!--
<parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
</parser>
-->
...
systemctl restart tika
journalctl -f -u tika | grep -i tesseract
Jul 23 14:07:18 mx-test tika[40896]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.tika.parser.ocr.TesseractOCRParser
Jul 23 14:07:18 mx-test tika[40896]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.tika.parser.ocr.TesseractOCRConfig
Jul 23 14:07:18 mx-test tika[40896]: DEBUG [main] 14:07:18,257 org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
Jul 23 14:07:18 mx-test tika[40896]: DEBUG [main] 14:07:18,259 org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]): true
Jul 23 14:07:19 mx-test tika[40896]: DEBUG [main] 14:07:19,979 org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
Jul 23 14:07:19 mx-test tika[40896]: DEBUG [main] 14:07:19,980 org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]): true
Jul 23 14:07:20 mx-test tika[40896]: DEBUG [main] 14:07:20,140 org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
Jul 23 14:07:20 mx-test tika[40896]: DEBUG [main] 14:07:20,141 org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]): true
on receipt of email+img attach, tesseract IS invoked
Jul 23 14:18:58 mx-test tika[41388]: INFO [qtp444127949-31] 14:18:58,527 org.apache.tika.parser.ocr.TesseractOCRParser Tesseract is installed and is being invoked. This can add greatly to processing time. If you do not want tesseract to be applied to your files see: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
Jul 23 14:18:58 mx-test tika[41388]: DEBUG [qtp444127949-31] 14:18:58,530 org.apache.tika.parser.ocr.TesseractOCRParser Tesseract command: tesseract /tmp/apache-tika-5913626684069109310.tmp /tmp/apache-tika-17854463223950821902.tmp --psm 1 -l eng -c page_separator= -c preserve_interword_spaces=0 txt
Jul 23 14:19:04 mx-test tika[41388]: DEBUG [Thread-23] 14:19:04,973 org.apache.tika.parser.ocr.TesseractOCRParser
Jul 23 14:19:04 mx-test tika[41388]: DEBUG [Thread-24] 14:19:04,973 org.apache.tika.parser.ocr.TesseractOCRParser Estimating resolution as 304
, and the parsed image result _is_ passed back to dovecot, where it's correctly indexed, and embedded terms are searchable
otoh, with config
...
<properties>
<parsers>
<parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
</parser>
...
on startup, same
Jul 23 14:15:32 mx-test tika[41205]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.tika.parser.ocr.TesseractOCRParser
Jul 23 14:15:32 mx-test tika[41205]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.tika.parser.ocr.TesseractOCRConfig
Jul 23 14:15:32 mx-test tika[41205]: DEBUG [main] 14:15:32,685 org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
Jul 23 14:15:32 mx-test tika[41205]: DEBUG [main] 14:15:32,686 org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]): true
Jul 23 14:15:34 mx-test tika[41205]: DEBUG [main] 14:15:34,472 org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
Jul 23 14:15:34 mx-test tika[41205]: DEBUG [main] 14:15:34,473 org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]): true
Jul 23 14:15:34 mx-test tika[41205]: DEBUG [main] 14:15:34,631 org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
Jul 23 14:15:34 mx-test tika[41205]: DEBUG [main] 14:15:34,632 org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]): true
but, on receipt of email+img attach,
(empty)
Re: adding explicit OCR parser config to tika-server-config-custom.xml disables working OCR image processing?
Posted by PGNet Dev <pg...@gmail.com>.
narrowing it down, this works
cat /etc/tika/tika-server-config-custom.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
</parsers>
<server>
<params>
...
but this doesn't,
cat /etc/tika/tika-server-config-custom.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
+ <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
+ </parser>
</parsers>
<server>
<params>
...
checking tika source, that appears to be the right path,
find . | grep /TesseractOCRParser.class
./org/apache/tika/parser/ocr/TesseractOCRParser.class
:-/