You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by PGNet Dev <pg...@gmail.com> on 2022/07/15 13:41:16 UTC
tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
i'm running tika-server 2.4.1 on a linux box,
lsb_release -rd
Description: Fedora release 36 (Thirty Six)
Release: 36
uname -rm
5.18.11-200.fc36.x86_64 x86_64
java -version
Picked up JAVA_TOOL_OPTIONS: -Xmx512M
openjdk version "18.0.1" 2022-04-19
OpenJDK Runtime Environment 22.3 (build 18.0.1+10)
OpenJDK 64-Bit Server VM 22.3 (build 18.0.1+10, mixed mode, sharing)
ps ax | grep tika-server
1003 ? Ssl 0:12 /usr/bin/java -jar /srv/webapps/tika/tika-server.jar -c /usr/local/etc/tika/tika-server-config-custom.xml
1143 ? Sl 0:37 /usr/bin/java -Xms1g -Xmx1g -Dpdfbox.fontcache=/var/tika -Dlog4j2.info -Djava.awt.headless=true -cp /srv/webapps/tika/tika-server.jar -Dtika.server.id= org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i -c /usr/local/etc/tika/tika-server-config-custom.xml -forkedStatusFile /tmp/apache-tika-server-forked-tmp-9638775429532759882 -numRestarts 0
it's invoked from a dovecot imap server instance, for attachment parsing,
dovecot --version
2.3.19.1 (9b53102964)
cat dovecot/conf.d/10-master.com
...
plugin {
...
fts_tika = http://127.0.0.1:9998/tika/
}
...
on receipt of an email with a standard attachment/exmaple -- e.g. the example pdf @
https://smallpdf.com/edit-pdf
, per journal logs, the message is submitted to tika, but fails due to a 'corrupt stream'
Jul 15 08:41:27 mx tika[1143]: INFO [qtp1837533591-27] 08:41:27,224 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] 08:41:27,453 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 104315, length: 356, expected end position: 104671
Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,457 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] 08:41:27,730 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 101699, length: 1472, expected end position: 103171
Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,735 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] 08:41:27,742 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 101509, length: 66, expected end position: 101575
Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,744 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] 08:41:27,748 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 2011, length: 2482, expected end position: 4493
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] 08:41:27,752 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (test.pdf)
Jul 15 08:41:27 mx tika[1143]: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@356fdbd7
Jul 15 08:41:27 mx tika[1143]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at java.lang.Thread.run(Thread.java:833) ~[?:?]
Jul 15 08:41:27 mx tika[1143]: Caused by: java.io.IOException: Page tree root must be a dictionary
Jul 15 08:41:27 mx tika[1143]: at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: ... 37 more
Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,767 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$337/0x0000000800eabbf8, ContentType: text/plain
Is this likely an issue with tika-server itself? &/or java/dovecot?
What additional diagnostics can help narrow it down?
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/15/22 11:22 AM, Tilman Hausherr wrote:
> likely invalid PDFs. Please upload them somewhere for inspection
I'm seeing this with all the PDFs I've tried ... so far.
Including the one I grabbed from the site I referenced in the OP, which I've re-uploaded to:
https://ufile.io/dkew7k0u
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 15.07.2022 um 15:41 schrieb PGNet Dev:
> Jul 15 08:41:27 mx tika[1143]: INFO [qtp1837533591-27]
> 08:41:27,224 org.apache.tika.server.core.resource.TikaResource /tika
> (application/pdf)
> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
> 08:41:27,453 org.apache.pdfbox.pdfparser.COSParser The end of the
> stream doesn't point to the correct offset, using workaround to read
> the stream, stream start position: 104315, length: 356, expected end
> position: 104671
> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,457 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop
> reading corrupt stream due to a DataFormatException
> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
> 08:41:27,730 org.apache.pdfbox.pdfparser.COSParser The end of the
> stream doesn't point to the correct offset, using workaround to read
> the stream, stream start position: 101699, length: 1472, expected end
> position: 103171
> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,735 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop
> reading corrupt stream due to a DataFormatException
> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
> 08:41:27,742 org.apache.pdfbox.pdfparser.COSParser The end of the
> stream doesn't point to the correct offset, using workaround to read
> the stream, stream start position: 101509, length: 66, expected end
> position: 101575
> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,744 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop
> reading corrupt stream due to a DataFormatException
> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
> 08:41:27,748 org.apache.pdfbox.pdfparser.COSParser The end of the
> stream doesn't point to the correct offset, using workaround to read
> the stream, stream start position: 2011, length: 2482, expected end
> position: 4493
> Jul 15 08:41:27 mx tika[1143]: Caused by: java.io.IOException:
> Page tree root must be a dictionary
likely invalid PDFs. Please upload them somewhere for inspection
Tilman
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/15/22 10:43 PM, Tilman Hausherr wrote:
> That's what I also get.
>
> The next that could be done is to debug this, if possible. Tim suggested the file might be truncated.
>
> I don't know if it is possible, if you can run tika in a debugger, then stop at org.apache.pdfbox.pdfparser.PDFParser.initialParse() where the exception "Page tree root must be a dictionary" happens. There try to access this.fileLen . Compare that number to your file length.
1st stab at debugging this, i launch tika with debug tooling,
/usr/bin/java \
-agentlib:jdwp=transport=dt_socket,address=127.0.0.1:8080,server=y,suspend=n \
-jar /srv/tika/tika-server.jar \
-c /etc/tika/tika-server-config-custom.xml
in another shell, attach the debugger
jdb -attach 127.0.0.1:8080
then set the bp
> stop in org.apache.pdfbox.pdfparser.PDFParser.initialParse
Deferring breakpoint org.apache.pdfbox.pdfparser.PDFParser.initialParse.
It will be set after the class is loaded.
i then send/receive the email with PDF attachment -- through dovecot>tika -- as above
i again see the scan-fail error in tika logs, but never see a
Breakpoint hit: ...
dumping at prompt anyway,
> dump this.fileLen
No current thread
this.fileLen = null
> threads
Group system:
(java.lang.ref.Reference$ReferenceHandler)2788 Reference Handler running
(java.lang.ref.Finalizer$FinalizerThread)2789 Finalizer cond. waiting
(java.lang.Thread)2790 Signal Dispatcher running
(java.lang.Thread)2791 Notification Thread running
(java.lang.Thread)2792 process reaper running
Group main:
(java.lang.Thread)1 main cond. waiting
(java.lang.Thread)2780 pool-2-thread-1 cond. waiting
(java.lang.Thread)2795 Thread-2 running
Group InnocuousThreadGroup:
(jdk.internal.misc.InnocuousThread)2796 Common-Cleaner cond. waiting
am i even setting the stop correctly, in order to get at the fail?
> An alternative would be that 1) I add the file length in PDFBox exception 2) you create a Tika build with the PDFBox snapshot.
atm, i'm not building tika-server myself. rather, using just the DL'd runnable jar from
https://dlcdn.apache.org/tika/2.4.1/tika-server-standard-2.4.1.jar
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/23/22 2:01 PM, Tilman Hausherr wrote:
> Weird... all changes I did today are these in the parent pom and in the tika-pipes pom, it updates some versions and it removes version dependencies that were needed to avoid conflicts but these weren't happening anymore so I removed them to simply version maintenance.
I can't confirm that it started working WITH this update. only at/prior to it.
entirely possible that another very recent update -- in the last couple of days' snap builds -- fixed it, and started working when
i was 'looking elsewhere'.
in any case, it appears to be behaving itself for now ... fingers crossed
well enough that I can (mostly) reliably re-scan/index my local server's email collection.
still getting some timeouts in dovecot handoff if OCR is taking too long on a big source, and currently having some fun with figuring out why i can't manage to set/override tesseract OCR params in tika config.xml (new thread on this ML),
but iiuc, neither's related to this^ prior issue . i.e, now is *IS* scanning.
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 23.07.2022 um 19:46 schrieb PGNet Dev:
> with update to
>
> tika-server-standard-2.4.2-20220723.145242-114.jar
>
> no more errors. so far.
>
> dunno what change fixed it ...
>
Weird... all changes I did today are these in the parent pom and in the
tika-pipes pom, it updates some versions and it removes version
dependencies that were needed to avoid conflicts but these weren't
happening anymore so I removed them to simply version maintenance.
# This patch file was generated by NetBeans IDE
# It uses platform neutral UTF-8 encoding and \n newlines.
--- a/<html>pom.xml (<b>e8ff570</b>)</html>
+++ b/<html>pom.xml (<b>3d8e7ef</b>)</html>
@@ -288,7 +288,7 @@
<!-- dependency versions -->
- <aws.version>1.12.265</aws.version>
+ <aws.version>1.12.267</aws.version>
<google.cloud.version>2.10.0</google.cloud.version>
<asm.version>9.3</asm.version>
<boilerpipe.version>1.1.0</boilerpipe.version>
@@ -473,7 +473,7 @@
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
- <version>3.21.2</version>
+ <version>3.21.3</version>
</dependency>
<dependency>
<groupId>com.ibm.icu</groupId>
@@ -526,11 +526,6 @@
<version>${javax.annotation.version}</version>
</dependency>
<dependency>
- <groupId>javax.servlet</groupId>
- <artifactId>servlet-api</artifactId>
- <version>2.5</version>
- </dependency>
- <dependency>
<groupId>javax.xml.soap</groupId>
<artifactId>javax.xml.soap-api</artifactId>
<version>1.4.0</version>
# This patch file was generated by NetBeans IDE
# It uses platform neutral UTF-8 encoding and \n newlines.
--- a/<html>pom.xml (<b>1385efc</b>)</html>
+++ b/<html><b>Current File</b></html>
@@ -112,11 +112,6 @@
</dependency>
<dependency>
<groupId>io.netty</groupId>
- <artifactId>netty-tcnative-classes</artifactId>
- <version>2.0.53.Final</version>
- </dependency>
- <dependency>
- <groupId>io.netty</groupId>
<artifactId>netty-transport</artifactId>
<version>${netty.version}</version>
</dependency>
@@ -130,16 +125,6 @@
<artifactId>netty-transport-native-epoll</artifactId>
<version>${netty.version}</version>
</dependency>
- <dependency>
- <groupId>io.projectreactor.netty</groupId>
- <artifactId>reactor-netty-http</artifactId>
- <version>1.0.21</version>
- </dependency>
- <dependency>
- <groupId>io.projectreactor</groupId>
- <artifactId>reactor-core</artifactId>
- <version>3.4.21</version>
- </dependency>
</dependencies>
</dependencyManagement>
<build>
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
with update to
tika-server-standard-2.4.2-20220723.145242-114.jar
no more errors. so far.
dunno what change fixed it ...
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 20.07.2022 um 22:34 schrieb PGNet Dev:
>
>
> curl -v --header "Accept: text/plain" -T
> ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/tika
It's mysterious, it works fine at work, but not at home. I don't think
that's the cause of your problem because you never got THAT bug.
Tilman
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/20/22 2:13 PM, Tilman Hausherr wrote:
> I noticed you have "Accept: text/plain"
>
> When I try this:
>
> curl -T Get_Started_With_Smallpdf.pdf http://localhost:9998/tika --header "Accept: text/plain"
>
> I get
>
> Caused by: java.util.NoSuchElementException: No value present
> at java.util.OptionalInt.getAsInt(OptionalInt.java:130) ~[?:?]
> at org.apache.tika.server.core.ProduceTypeResourceComparator.compareProduceTypes(ProduceTypeResourceComparator.java:136) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
> at org.apache.tika.server.core.ProduceTypeResourceComparator.compare(ProduceTypeResourceComparator.java:97) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
> at org.apache.cxf.jaxrs.model.OperationResourceInfoComparator.compare(OperationResourceInfoComparator.java:69) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
> at org.apache.cxf.jaxrs.model.OperationResourceInfoComparator.compare(OperationResourceInfoComparator.java:31) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
> at java.util.TreeMap.put(TreeMap.java:795) ~[?:?]
> at java.util.TreeMap.put(TreeMap.java:534) ~[?:?]
> at org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:551) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
>
> without the header, I get the html output.
i don't see your error with curl, with or without header spec'm
here, *with* 'text/plain' header specified,
curl -v --header "Accept: text/plain" -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/tika
* Trying 127.0.0.1:9998...
* Connected to 127.0.0.1 (127.0.0.1) port 9998 (#0)
> PUT /tika HTTP/1.1
> Host: 127.0.0.1:9998
> User-Agent: curl/7.82.0
> Accept: text/plain
> Content-Length: 69451
> Expect: 100-continue
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Wed, 20 Jul 2022 20:27:25 GMT
< Content-Type: text/plain
< Transfer-Encoding: chunked
< Server: Jetty(9.4.48.v20220622)
<
Welcome to Smallpdf
Digital Documents—All In One Place
Access Files Anytime, Anywhere
Enhance Documents in One Click
Collaborate With Others
With the new Smallpdf experience, you can
freely upload, organize, and share digital
documents. When you enable the ‘Storage’
option, we’ll also store all processed files here.
You can access files stored on Smallpdf from
your computer, phone, or tablet. We’ll also
sync files from the Smallpdf Mobile App to our
online portal
When you right-click on a file, we’ll present
you with an array of options to convert,
compress, or modify it.
Forget mundane administrative tasks. With
Smallpdf, you can request e-signatures, send
large files, or even enable the Smallpdf G Suite
App for your entire organization.
Ready to take document management to the next level?
https://bit.ly/smallpdf-preferences-en
https://bit.ly/smallpdf-preferences-en
https://bit.ly/smallpdf-download-en
https://bit.ly/smallpdf-chrome-extension
https://bit.ly/smallpdf-chrome-extension
* Connection #0 to host 127.0.0.1 left intact
it requests & returns text, no error.
and withOUT,
curl -v -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/tika
* Trying 127.0.0.1:9998...
* Connected to 127.0.0.1 (127.0.0.1) port 9998 (#0)
> PUT /tika HTTP/1.1
> Host: 127.0.0.1:9998
> User-Agent: curl/7.82.0
> Accept: */*
> Content-Length: 69451
> Expect: 100-continue
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Wed, 20 Jul 2022 20:28:56 GMT
< Content-Type: text/xml
< Transfer-Encoding: chunked
< Server: Jetty(9.4.48.v20220622)
<
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="1.7"/>
<meta name="xmp:CreatorTool" content="Adobe InDesign 15.1 (Macintosh)"/>
<meta name="pdf:hasXFA" content="false"/>
<meta name="access_permission:modify_annotations" content="true"/>
<meta name="access_permission:can_print_degraded" content="true"/>
<meta name="dcterms:created" content="2020-10-14T15:08:10Z"/>
<meta name="dcterms:modified" content="2020-10-14T15:08:10Z"/>
<meta name="dc:format" content="application/pdf; version=1.7"/>
<meta name="xmpMM:DocumentID" content="xmp.id:7a865d84-8dbf-4015-96b7-fdae89a9603b"/>
<meta name="pdf:docinfo:creator_tool" content="Adobe InDesign 15.1 (Macintosh)"/>
<meta name="access_permission:fill_in_form" content="true"/>
<meta name="pdf:docinfo:modified" content="2020-10-14T15:08:10Z"/>
<meta name="pdf:hasCollection" content="false"/>
<meta name="pdf:encrypted" content="false"/>
<meta name="xmp:CreateDate" content="2020-10-14T17:08:10Z"/>
<meta name="Content-Length" content="69451"/>
<meta name="pdf:hasMarkedContent" content="false"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="xmp:ModifyDate" content="2020-10-14T17:08:10Z"/>
<meta name="xmp:MetadataDate" content="2020-10-14T17:08:10Z"/>
<meta name="dc:language" content="en-US"/>
<meta name="pdf:producer" content="Adobe PDF Library 15.0"/>
<meta name="X-TIKA:digest:SHA256" content="91184c3c4db0d5d6fdac1d33a220f208e29df1b4c06daebc0591ff6447bcfed2"/>
<meta name="access_permission:extract_for_accessibility" content="true"/>
<meta name="access_permission:assemble_document" content="true"/>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="pdf:hasXMP" content="true"/>
<meta name="access_permission:extract_content" content="true"/>
<meta name="xmpMM:DerivedFrom:DocumentID" content="xmp.did:b47e2f57-0029-45c5-8e1d-97f7c1535615"/>
<meta name="access_permission:can_print" content="true"/>
<meta name="pdf:docinfo:trapped" content="False"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
<meta name="xmpMM:DerivedFrom:InstanceID" content="xmp.iid:20710a9c-3691-41fa-bd81-adf858100386"/>
<meta name="access_permission:can_modify" content="true"/>
<meta name="pdf:docinfo:producer" content="Adobe PDF Library 15.0"/>
<meta name="pdf:docinfo:created" content="2020-10-14T15:08:10Z"/>
<title>�</title>
</head>
<body>
<div class="page">
<p/>
<p>Welcome to Smallpdf
</p>
<p>Digital Documents—All In One Place
</p>
<p>Access Files Anytime, Anywhere
</p>
<p>Enhance Documents in One Click
</p>
<p>Collaborate With Others
</p>
<p>With the new Smallpdf experience, you can
freely upload, organize, and share digital
documents. When you enable the ‘Storage’
option, we’ll also store all processed files here.
</p>
<p>You can access files stored on Smallpdf from
your computer, phone, or tablet. We’ll also
sync files from the Smallpdf Mobile App to our
online portal
</p>
<p>When you right-click on a file, we’ll present
you with an array of options to convert,
compress, or modify it.
</p>
<p>Forget mundane administrative tasks. With
Smallpdf, you can request e-signatures, send
large files, or even enable the Smallpdf G Suite
App for your entire organization.
</p>
<p>Ready to take document management to the next level? </p>
<p/>
<div class="annotation">
<a href="https://bit.ly/smallpdf-preferences-en">https://bit.ly/smallpdf-preferences-en</a>
</div>
<div class="annotation">
<a href="https://bit.ly/smallpdf-preferences-en">https://bit.ly/smallpdf-preferences-en</a>
</div>
<div class="annotation">
<a href="https://bit.ly/smallpdf-download-en">https://bit.ly/smallpdf-download-en</a>
</div>
<div class="annotation">
<a href="https://bit.ly/smallpdf-chrome-extension">https://bit.ly/smallpdf-chrome-extension</a>
</div>
<div class="annotation">
<a href="https://bit.ly/smallpdf-chrome-extension">https://bit.ly/smallpdf-chrome-extension</a>
</div>
</div>
</body>
</html>
* Connection #0 to host 127.0.0.1 left intact
, requests '*/*' and returns "text/xml'
just to check, if I use your at-the-end header arg placement
curl -v -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/tika --header "Accept: text/plain"
i again see no error,
* Trying 127.0.0.1:9998...
* Connected to 127.0.0.1 (127.0.0.1) port 9998 (#0)
> PUT /tika HTTP/1.1
> Host: 127.0.0.1:9998
> User-Agent: curl/7.82.0
> Accept: text/plain
> Content-Length: 69451
> Expect: 100-continue
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Wed, 20 Jul 2022 20:32:00 GMT
< Content-Type: text/plain
< Transfer-Encoding: chunked
< Server: Jetty(9.4.48.v20220622)
<
Welcome to Smallpdf
Digital Documents—All In One Place
Access Files Anytime, Anywhere
Enhance Documents in One Click
Collaborate With Others
With the new Smallpdf experience, you can
freely upload, organize, and share digital
documents. When you enable the ‘Storage’
option, we’ll also store all processed files here.
You can access files stored on Smallpdf from
your computer, phone, or tablet. We’ll also
sync files from the Smallpdf Mobile App to our
online portal
When you right-click on a file, we’ll present
you with an array of options to convert,
compress, or modify it.
Forget mundane administrative tasks. With
Smallpdf, you can request e-signatures, send
large files, or even enable the Smallpdf G Suite
App for your entire organization.
Ready to take document management to the next level?
https://bit.ly/smallpdf-preferences-en
https://bit.ly/smallpdf-preferences-en
https://bit.ly/smallpdf-download-en
https://bit.ly/smallpdf-chrome-extension
https://bit.ly/smallpdf-chrome-extension
* Connection #0 to host 127.0.0.1 left intact
this is with
curl -V
curl 7.82.0 (x86_64-redhat-linux-gnu) libcurl/7.82.0 OpenSSL/3.0.5 zlib/1.2.11 brotli/1.0.9 libidn2/2.3.3 libpsl/0.21.1 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.46.0 OpenLDAP/2.6.2
Release-Date: 2022-03-05
Protocols: dict file ftp ftps gopher gophers http https imap imaps ldap ldaps mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS brotli GSS-API HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM NTLM_WB PSL SPNEGO SSL TLS-SRP UnixSockets
and
tika-server-standard-2.4.2-20220720.025305-98.jar
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/20/22 4:10 PM, PGNet Dev wrote:
> what _should_ it be for programmatic submission (e.g. via dovecot fts-tika) to tika? text or html?
it *appears* to me that the flow is
email+attachment -> dovecot/fts-tika ------[ pdf attachment ]-----> tika-backend -----[ text result ]-----> dovecot/fts-flatcurve
where flatcurve is the indexer. it's expecting data in text format.
from the curl results, seems that for tika-backend to return the text/plain result, it needs the "Accept: text/plain"
so, not unexpectedly "Accept: text/plain" is passed in the PUT
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/20/22 2:13 PM, Tilman Hausherr wrote:
> I noticed you have "Accept: text/plain"
>
> When I try this:
>
> curl -T Get_Started_With_Smallpdf.pdf http://localhost:9998/tika --header "Accept: text/plain"
>
> I get
>
> Caused by: java.util.NoSuchElementException: No value present
> at java.util.OptionalInt.getAsInt(OptionalInt.java:130) ~[?:?]
> at org.apache.tika.server.core.ProduceTypeResourceComparator.compareProduceTypes(ProduceTypeResourceComparator.java:136) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
> at org.apache.tika.server.core.ProduceTypeResourceComparator.compare(ProduceTypeResourceComparator.java:97) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
> at org.apache.cxf.jaxrs.model.OperationResourceInfoComparator.compare(OperationResourceInfoComparator.java:69) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
> at org.apache.cxf.jaxrs.model.OperationResourceInfoComparator.compare(OperationResourceInfoComparator.java:31) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
> at java.util.TreeMap.put(TreeMap.java:795) ~[?:?]
> at java.util.TreeMap.put(TreeMap.java:534) ~[?:?]
> at org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:551) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
>
> without the header, I get the html output.
interesting catch. what _should_ it be for programmatic submission (e.g. via dovecot fts-tika) to tika? text or html?
it's reported here in the tika logs I posted, earliest at
...
DEBUG [qtp485047320-28] 11:01:15,794 org.eclipse.jetty.server.HttpChannel REQUEST for //127.0.0.1:9998/tika/ on HttpChannelOverHttp@2ab20b5f{s=HttpChannelState@1dd88b59{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=1,c=false/false,a=IDLE,uri=//127.0.0.1:9998/tika/,age=1}
PUT //127.0.0.1:9998/tika/ HTTP/1.1
Host: 127.0.0.1:9998
Date: Wed, 20 Jul 2022 15:01:15 GMT
Transfer-Encoding: chunked
Connection: keep-alive
Content-Type: application/pdf
Content-Disposition: attachment; filename="Get_Started_With_Smallpdf.pdf"
!! Accept: text/plain
...
which appears to be the PUT, I assume, pushed by the dovecot-end of the handshake.
checking dovecot source, it hails from here,
https://github.com/dovecot/core/blob/main/src/plugins/fts/fts-parser-tika.c#L170
if (parser_context->content_disposition != NULL)
http_client_request_add_header(http_req, "Content-Disposition",
parser_context->content_disposition);
!! 170 http_client_request_add_header(http_req, "Accept", "text/plain");
parser->http_req = http_req;
return &parser->parser;
}
The '"Accept", "text/plain"' has been there awhile; e.g., quick-checking old release source for v2.3.8, from Oct 8, 2019,
https://github.com/dovecot/core/blob/release-2.3.8/src/plugins/fts/fts-parser-tika.c#L163
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tilman Hausherr <TH...@t-online.de>.
I noticed you have "Accept: text/plain"
When I try this:
curl -T Get_Started_With_Smallpdf.pdf http://localhost:9998/tika
--header "Accept: text/plain"
I get
Caused by: java.util.NoSuchElementException: No value present
at java.util.OptionalInt.getAsInt(OptionalInt.java:130) ~[?:?]
at
org.apache.tika.server.core.ProduceTypeResourceComparator.compareProduceTypes(ProduceTypeResourceComparator.java:136)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.tika.server.core.ProduceTypeResourceComparator.compare(ProduceTypeResourceComparator.java:97)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.jaxrs.model.OperationResourceInfoComparator.compare(OperationResourceInfoComparator.java:69)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.jaxrs.model.OperationResourceInfoComparator.compare(OperationResourceInfoComparator.java:31)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at java.util.TreeMap.put(TreeMap.java:795) ~[?:?]
at java.util.TreeMap.put(TreeMap.java:534) ~[?:?]
at
org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:551)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
without the header, I get the html output.
Tilman
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
hi,
On 7/20/22 10:07 AM, Tim Allison wrote:
> Sorry...just catching up on this. If you want the digest of the incoming bytes and you can configure tika-server via a config file, try this as the config (e.g. tika-config-digest.xml)
>
> <properties>
> <server>
> <params>
> <digest>sha256</digest>
> </params>
> </server>
> </properties>
>
> then start the server: java -jar tika-server-standard-xyz.jar -c tika-config-digest.xml
>
> Then send the file: curl -T ~/Downloads/Get_Started_With_Smallpdf.pdf http://localhost:9998/tika <http://localhost:9998/tika>
i'm already normally launching tika service as,
cat /etc/systemd/system/tika.service
[Unit]
Description=Apache Tika server
After=network-online.target
Requires=network-online.target
[Service]
SyslogIdentifier=tika
User=tika
Group=tika
ExecStart=/usr/bin/java \
-jar /srv/tika/tika-server.jar \
!! -c /etc/tika/tika-server-config-custom.xml
[Install]
WantedBy=multi-user.target
where
cat /etc/tika/tika-server-config-custom.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<server>
<params>
<logLevel>debug</logLevel>
<port>9998</port>
<host>127.0.0.1</host>
<javaPath>/usr/bin/java</javaPath>
<noFork>false</noFork>
<forkedJvmArgs>
<arg>-Xms1g</arg>
<arg>-Xmx1g</arg>
<arg>-Dpdfbox.fontcache=/var/tika</arg>
<arg>-Dlog4j2.debug</arg>
</forkedJvmArgs>
!! <digest>sha256</digest>
<enableUnsecureFeatures>false</enableUnsecureFeatures>
<id></id>
<maxFiles>100000</maxFiles>
<maxForkedStartupMillis>120000</maxForkedStartupMillis>
<maxRestarts>-1</maxRestarts>
<minimumTimeoutMillis>30000</minimumTimeoutMillis>
<returnStackTrace>false</returnStackTrace>
<taskPulseMillis>10000</taskPulseMillis>
<taskTimeoutMillis>300000</taskTimeoutMillis>
<endpoints>
<endpoint>tika</endpoint>
<endpoint>status</endpoint>
<endpoint>rmeta</endpoint>
</endpoints>
</params>
</server>
</properties>
DL'ing the _latest_ build
F="tika-server-standard-2.4.2-20220720.025305-98.jar"
D="/srv/tika"
cd ${D}
rm -rf TMP
mkdir -p TMP/mod
cd TMP
rm -f ${F}*
wget https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-server-standard/2.4.2-SNAPSHOT/${F}
cd mod
extract
jar -xfv ../${F}
mod logging
perl -pi -e 's|Root level="info"|Root level="debug"|g' log4j2.xml
repack
jar -cvmf META-INF/MANIFEST.MF ../mod.jar *
my usual target symlink
cd ${D}
ln -sf TMP/mod.jar tika-server.jar
stop tiks service, if any
systemctl stop tika
systemctl disable tika
systemctl status tika -ln0
○ tika.service - Apache Tika server
Loaded: loaded (/etc/systemd/system/tika.service; disabled; vendor preset: disabled)
Active: inactive (dead)
ps ax | grep tika
(empty)
start manually
/usr/bin/java \
-jar /srv/tika/tika-server.jar \
-c /etc/tika/tika-server-config-custom.xml
...
INFO [main] 10:49:37,925 org.apache.tika.server.core.TikaServerProcess Started Apache Tika server at http://127.0.0.1:9998/
, console persists here for this active process
ps ax | grep tika
29181 pts/0 Sl+ 0:07 /usr/bin/java -jar /srv/tika/tika-server.jar -c /etc/tika/tika-server-config-custom.xml
29202 pts/0 Sl+ 0:16 /usr/bin/java -Xms1g -Xmx1g -Dpdfbox.fontcache=/var/tika -Dlog4j2.debug -Djava.awt.headless=true -cp /srv/tika/tika-server.jar -Dtika.server.id= org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i -c /etc/tika/tika-server-config-custom.xml -forkedStatusFile /tmp/apache-tika-server-forked-tmp-9024552766199524298 -numRestarts 0
exec in other shell window
curl -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/tika
@ console for the *curl* command, I see
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="1.7"/>
...
<meta name="X-TIKA:digest:SHA256" content="91184c3c4db0d5d6fdac1d33a220f208e29df1b4c06daebc0591ff6447bcfed2"/>
but nothing seemingly relevant/informative in the java/tika console session; lots of DEBUG etc, but no sha256sum info
in any case, for this scenario, checking original
sha256sum ~/Get_Started_With_Smallpdf.pdf
91184c3c4db0d5d6fdac1d33a220f208e29df1b4c06daebc0591ff6447bcfed2 /root/Get_Started_With_Smallpdf.pdf
it's a match.
but that's NOT testing the fail scenario.
THAT scenario is email send/receive -> dovecot -> dovecot fts-tika plugin -> tika-server.
config'ing dovecot to use fts-tika scanning
fts_tika = http://127.0.0.1:9998/tika/
& generate verbose debug logs
mail_debug = yes
when I exec that send/receive -- from, e.g., an external gmail account to my server
I see the attachment handoff. 1st, sent from dovecot fts-tika
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: queue http://127.0.0.1:9998: Connection to peer 127.0.0.1:9998 claimed request [Req1: PUT http://127.0.0.1:9998/tika/]
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: conn 127.0.0.1:9998 [1]: Claimed request [Req1: PUT http://127.0.0.1:9998/tika/]
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Sent header
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 5562, buffered=5570)
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: peer 127.0.0.1:9998: No more requests to service for this peer (1 connections exist, 0 pending)
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 3409, buffered=3416)
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Finished sending payload
2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
==> /var/log/dovecot/dovecot-info.log <==
2022-07-20 11:07:02 lmtp(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw>: Info: sieve: msgid=<8e...@fastmail.fm>: stored mail into mailbox 'INBOX'
==> /var/log/dovecot/dovecot-debug.log <==
2022-07-20 11:07:02 lmtp(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw>: Debug: sieve: msgid=<8e...@fastmail.fm>: Finish implicit keep action
2022-07-20 11:07:02 lmtp(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw>: Debug: sieve: msgid=<8e...@fastmail.fm>: Finishing actions
2022-07-20 11:07:02 lmtp(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw>: Debug: sieve: msgid=<8e...@fastmail.fm>: Finished executing result (final, status=ok, keep=yes)
2022-07-20 11:07:02 lmtp(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw>: Debug: sieve: multi-script: Sequence finished (status=ok, keep=yes)
2022-07-20 11:07:02 lmtp(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw>: Debug: sieve: multi-script: Destroy
2022-07-20 11:07:02 lmtp(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw>: Debug: lmtp-server: conn unix:pid=39462,uid=89 [1]: rcpt user01@example.com: duplicate db: Cleanup
2022-07-20 11:07:02 lmtp(39463): Debug: lmtp-server: conn unix:pid=39462,uid=89 [1]: rcpt user01@example.com: User session is finished
2022-07-20 11:07:02 lmtp(39463): Debug: lmtp-server: conn unix:pid=39462,uid=89 [1]: rcpt user01@example.com: dict(file): dict destroyed
==> /var/log/dovecot/dovecot-info.log <==
2022-07-20 11:07:02 lmtp(39463): Info: Disconnect from local: Logged out (state=READY)
==> /var/log/dovecot/dovecot-debug.log <==
2022-07-20 11:07:06 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: conn 127.0.0.1:9998 [1]: Got 200 response for request [Req1: PUT http://127.0.0.1:9998/tika/]: OK (took 3327 ms + 217 ms in queue)
2022-07-20 11:07:06 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: conn 127.0.0.1:9998 [1]: Response payload stream destroyed (20 ms after initial response)
2022-07-20 11:07:06 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Finished
2022-07-20 11:07:06 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: queue http://127.0.0.1:9998: Dropping request [Req1: PUT http://127.0.0.1:9998/tika/]
2022-07-20 11:07:06 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: host 127.0.0.1: Host is idle (timeout = 100 msecs)
2022-07-20 11:07:06 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Free (requests left=1)
at this point, dovecot's 'done' with the attachment as far as tika is involved, and it's 'in' tika-backend's control; dovecot DOES of course continue to process, and ultimately deliver, the email+attachment to my inbox. where, as reported earlier, I can verify that the RECEIVED attachment is identical in size/sha256sum to the original.
i do see the handoff to tika-backend,
...
DEBUG [qtp485047320-28] 11:01:15,794 org.eclipse.jetty.server.HttpChannel REQUEST for //127.0.0.1:9998/tika/ on HttpChannelOverHttp@2ab20b5f{s=HttpChannelState@1dd88b59{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=1,c=false/false,a=IDLE,uri=//127.0.0.1:9998/tika/,age=1}
PUT //127.0.0.1:9998/tika/ HTTP/1.1
Host: 127.0.0.1:9998
Date: Wed, 20 Jul 2022 15:01:15 GMT
Transfer-Encoding: chunked
Connection: keep-alive
Content-Type: application/pdf
Content-Disposition: attachment; filename="Get_Started_With_Smallpdf.pdf"
Accept: text/plain
DEBUG [qtp485047320-28] 11:01:15,799 org.eclipse.jetty.server.HttpConnection HttpConnection@7d858986::SocketChannelEndPoint@7f055fae{l=/127.0.0.1:9998,r=/127.0.0.1:59150,OPEN,fill=-,flush=-,to=43/200000}{io=0/0,kio=0,kro=1}->HttpConnection@7d858986[p=HttpParser{s=CHUNKED_CONTENT,0 of -1},g=HttpGenerator@127a4f1e{s=START}]=>HttpChannelOverHttp@2ab20b5f{s=HttpChannelState@1dd88b59{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=1,c=false/false,a=IDLE,uri=//127.0.0.1:9998/tika/,age=6} parsed true HttpParser{s=CHUNKED_CONTENT,0 of -1}
...
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
DEBUG [qtp485047320-31] 11:07:03,442 org.apache.cxf.transport.http.Headers Request Headers: {Accept=[text/plain], Authorization=[***], connection=[keep-alive], Content-Disposition=[attachment; filename="Get_Started_With_Smallpdf.pdf"], content-type=[application/pdf], Date=[Wed, 20 Jul 2022 15:07:02 GMT], Host=[127.0.0.1:9998], Proxy-Authorization=[***], transfer-encoding=[chunked]}
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
...
but no trace, that I can find in any log, of sha256sum generated by tika, as in the curl case above.
THAT is the necessary bit here -- getting at, and confirming, the size/sha256sum of what Tika has received -- from dovecot's fts-tika handoff.
how/where to get tika to spit our THAT info?
either as loggable/logged response to dovecot's http-client connection, on successful handoff,
in its own logs,
or, just trapping the file and checking manually?
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tim Allison <ta...@apache.org>.
Sorry...just catching up on this. If you want the digest of the incoming
bytes and you can configure tika-server via a config file, try this as the
config (e.g. tika-config-digest.xml)
<properties>
<server>
<params>
<digest>sha256</digest>
</params>
</server>
</properties>
then start the server: java -jar tika-server-standard-xyz.jar -c
tika-config-digest.xml
Then send the file: curl -T ~/Downloads/Get_Started_With_Smallpdf.pdf
http://localhost:9998/tika
This should be in the output: <meta name="X-TIKA:digest:SHA256"
content="91184c3c4db0d5d6fdac1d33a220f208e29df1b4c06daebc0591ff6447bcfed2"/>
I confirmed this value with shasum -a 256.
On Tue, Jul 19, 2022 at 1:11 PM PGNet Dev <pg...@gmail.com> wrote:
> On 7/19/22 12:24 PM, Tilman Hausherr wrote:
> > The checkstyle violation is about the coding style. You can delete that
> part in the tika-parent/pom.xml if you want, or add <skip>true</skip> below
> "<configuration>" in that plugin. Same for the ossindex-maven-plugin and
> the forbiddenapis plugin.
>
> > If the debugger didn't stop, then the breakpoint was at the wrong place.
> Or it's not possible to debug.
>
> I'll give the pom mod a try in a bit.
>
> As to which breakpoint, I certainly don't know the tika/java internals
> well enough to say what is/isn't correct, yet.
>
> > Re "is there anything informative in that now-more-verbose DEBUG output?
> " well yes, the MD5 output. This proves that the file is different. (ok,
> the different length showed that too)
>
> I've asked over at Dovecot ML what, specifically, dovecot 'sends' to the
> tika backend via their fts-tika plugin:
>
> the original/complete/unmodified attachment, suggesting that the file
> size / MD5 hash should be the same as what tika's trapping
>
> or,
>
> some modification to the file is made (trimmed, or add'l headers, etc
> etc), and that the size/hash are not _expected_ to be the same
>
> we'll see what i hear
>
>
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/19/22 12:24 PM, Tilman Hausherr wrote:
> The checkstyle violation is about the coding style. You can delete that part in the tika-parent/pom.xml if you want, or add <skip>true</skip> below "<configuration>" in that plugin. Same for the ossindex-maven-plugin and the forbiddenapis plugin.
> If the debugger didn't stop, then the breakpoint was at the wrong place. Or it's not possible to debug.
I'll give the pom mod a try in a bit.
As to which breakpoint, I certainly don't know the tika/java internals well enough to say what is/isn't correct, yet.
> Re "is there anything informative in that now-more-verbose DEBUG output? " well yes, the MD5 output. This proves that the file is different. (ok, the different length showed that too)
I've asked over at Dovecot ML what, specifically, dovecot 'sends' to the tika backend via their fts-tika plugin:
the original/complete/unmodified attachment, suggesting that the file size / MD5 hash should be the same as what tika's trapping
or,
some modification to the file is made (trimmed, or add'l headers, etc etc), and that the size/hash are not _expected_ to be the same
we'll see what i hear
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tilman Hausherr <TH...@t-online.de>.
The checkstyle violation is about the coding style. You can delete that
part in the tika-parent/pom.xml if you want, or add <skip>true</skip>
below "<configuration>" in that plugin. Same for the
ossindex-maven-plugin and the forbiddenapis plugin.
If the debugger didn't stop, then the breakpoint was at the wrong place.
Or it's not possible to debug.
Re "is there anything informative in that now-more-verbose DEBUG output?
" well yes, the MD5 output. This proves that the file is different. (ok,
the different length showed that too)
Tilman
Am 19.07.2022 um 11:37 schrieb PGNet Dev:
> On 7/18/22 11:05 PM, Tilman Hausherr wrote:
>> Yes the file is deleted...
>
>>
>> Alternatively, grab the source code from the trunk, and add this line
>> in the file
>> tika-main\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pdf-module\src\main\java\org\apache\tika\parser\pdf\PDFParser.java
>>
>>
>> Files.write(Paths.get("/tmp/yourfile.pdf"),
>> Files.readAllBytes(tstream.getPath()));
>>
>> after the line that has ", md5: ".
>>
>> Then build the parser module, and then the standard server subproject
>> with "mvn -DskipTests install".
>
> 1st, attempting the build, FAILs
>
> cd src/tika
> EDIT
> tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
>
> ...
> 168 if (LOG.isDebugEnabled() && tstream != null) {
> LOG.debug("File: " + tstream.getPath() + ", length: "
> + tstream.getLength() +
> ", md5: " + calcMD5(tstream.getPath()));
> + Files.write(Paths.get("/tmp/yourfile.pdf"),
> Files.readAllBytes(tstream.getPath()));
> }
> ...
>
>
> mvn install -pl tika-parsers -am
> mvn -DskipTests install
> ...
> [INFO] BUILD FAILURE
> [INFO]
> ------------------------------------------------------------------------
> [INFO] Total time: 31.493 s
> [INFO] Finished at: 2022-07-19T04:48:43-04:00
> [INFO]
> ------------------------------------------------------------------------
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-checkstyle-plugin:3.1.2:check
> (validate) on project tika-parser-pdf-module: You have 1 Checkstyle
> violation. -> [Help 1]
>
>
>> try setting a breakpoint in org.apache.tika.parser.pdf.PDFParser so
>> that you get that file.
>
> next, run in debugger instead,
>
> sudo -u tika /usr/bin/jdb \
> -classpath /srv/tika/tika-server.jar \
> org.apache.tika.server.core.TikaServerCli \
> -c /etc/tika/tika-server-config-custom.xml
>
> Initializing jdb ...
>
> set breakpoint
>
> > stop in org.apache.tika.parser.pdf.PDFParser
> Deferring breakpoint org.apache.tika.parser.pdf.PDFParser.
> It will be set after the class is loaded.
>
> run it
>
> > run
> run org.apache.tika.server.core.TikaServerCli -c
> /etc/tika/tika-server-config-custom.xml
> Set uncaught java.lang.Throwable
> Set deferred uncaught java.lang.Throwable
> >
> VM Started: DEBUG [pool-2-thread-1] 05:21:37,469
> org.apache.tika.server.core.TikaServerWatchDog forked process
> commandline: [/usr/bin/java, -Xms1g, -Xmx1g,
> -Dpdfbox.fontcache=/var/tika, -Dlog4j2.debug,
> -Djava.awt.headless=true, -cp, /srv/tika/tika-server.jar,
> -Dtika.server.id=, org.apache.tika.server.core.TikaServerProcess, -h,
> 127.0.0.1, -p, 9998, -i, , -c,
> /etc/tika/tika-server-config-custom.xml, -forkedStatusFile,
> /tmp/apache-tika-server-forked-tmp-11335114907490900739, -numRestarts, 0]
> ...
> DEBUG [main] 05:21:50,871 org.apache.cxf.endpoint.ServerImpl
> register the server to serverRegistry
> TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor
> class org.apache.tika.server.core.ServerStatusWatcher
> INFO [main] 05:21:50,906
> org.apache.tika.server.core.TikaServerProcess Started Apache Tika
> server at http://127.0.0.1:9998/
>
> receive email+attachment
>
> *lots* of debug logs @ jdb console,
>
> -> https://pastebin.com/HDtR9RKP
>
> NOTE, there,
>
> ...
> DEBUG [qtp485047320-31] 05:22:58,423
> org.apache.tika.parser.pdf.PDFParser File:
> /tmp/apache-tika-11251774738482156793.tmp, length: 104932, md5:
> 092bf24b2cac33fac27965549c99613a
> ...
>
> but, no file captured
>
> ls -al /tmp/apache-tika*tmp
> ls: cannot access '/tmp/apache-tika*tmp': No such file or
> directory
>
> is there anything informative in that now-more-verbose DEBUG output?
>
>
>
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/18/22 11:05 PM, Tilman Hausherr wrote:
> Yes the file is deleted...
>
> Alternatively, grab the source code from the trunk, and add this line in the file
> tika-main\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pdf-module\src\main\java\org\apache\tika\parser\pdf\PDFParser.java
>
> Files.write(Paths.get("/tmp/yourfile.pdf"), Files.readAllBytes(tstream.getPath()));
>
> after the line that has ", md5: ".
>
> Then build the parser module, and then the standard server subproject with "mvn -DskipTests install".
1st, attempting the build, FAILs
cd src/tika
EDIT tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
...
168 if (LOG.isDebugEnabled() && tstream != null) {
LOG.debug("File: " + tstream.getPath() + ", length: " + tstream.getLength() +
", md5: " + calcMD5(tstream.getPath()));
+ Files.write(Paths.get("/tmp/yourfile.pdf"), Files.readAllBytes(tstream.getPath()));
}
...
mvn install -pl tika-parsers -am
mvn -DskipTests install
...
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 31.493 s
[INFO] Finished at: 2022-07-19T04:48:43-04:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:3.1.2:check (validate) on project tika-parser-pdf-module: You have 1 Checkstyle violation. -> [Help 1]
> try setting a breakpoint in org.apache.tika.parser.pdf.PDFParser so that you get that file.
next, run in debugger instead,
sudo -u tika /usr/bin/jdb \
-classpath /srv/tika/tika-server.jar \
org.apache.tika.server.core.TikaServerCli \
-c /etc/tika/tika-server-config-custom.xml
Initializing jdb ...
set breakpoint
> stop in org.apache.tika.parser.pdf.PDFParser
Deferring breakpoint org.apache.tika.parser.pdf.PDFParser.
It will be set after the class is loaded.
run it
> run
run org.apache.tika.server.core.TikaServerCli -c /etc/tika/tika-server-config-custom.xml
Set uncaught java.lang.Throwable
Set deferred uncaught java.lang.Throwable
>
VM Started: DEBUG [pool-2-thread-1] 05:21:37,469 org.apache.tika.server.core.TikaServerWatchDog forked process commandline: [/usr/bin/java, -Xms1g, -Xmx1g, -Dpdfbox.fontcache=/var/tika, -Dlog4j2.debug, -Djava.awt.headless=true, -cp, /srv/tika/tika-server.jar, -Dtika.server.id=, org.apache.tika.server.core.TikaServerProcess, -h, 127.0.0.1, -p, 9998, -i, , -c, /etc/tika/tika-server-config-custom.xml, -forkedStatusFile, /tmp/apache-tika-server-forked-tmp-11335114907490900739, -numRestarts, 0]
...
DEBUG [main] 05:21:50,871 org.apache.cxf.endpoint.ServerImpl register the server to serverRegistry
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.tika.server.core.ServerStatusWatcher
INFO [main] 05:21:50,906 org.apache.tika.server.core.TikaServerProcess Started Apache Tika server at http://127.0.0.1:9998/
receive email+attachment
*lots* of debug logs @ jdb console,
-> https://pastebin.com/HDtR9RKP
NOTE, there,
...
DEBUG [qtp485047320-31] 05:22:58,423 org.apache.tika.parser.pdf.PDFParser File: /tmp/apache-tika-11251774738482156793.tmp, length: 104932, md5: 092bf24b2cac33fac27965549c99613a
...
but, no file captured
ls -al /tmp/apache-tika*tmp
ls: cannot access '/tmp/apache-tika*tmp': No such file or directory
is there anything informative in that now-more-verbose DEBUG output?
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tilman Hausherr <TH...@t-online.de>.
Yes the file is deleted... try setting a breakpoint in
org.apache.tika.parser.pdf.PDFParser so that you get that file.
Alternatively, grab the source code from the trunk, and add this line in
the file
tika-main\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pdf-module\src\main\java\org\apache\tika\parser\pdf\PDFParser.java
Files.write(Paths.get("/tmp/yourfile.pdf"),
Files.readAllBytes(tstream.getPath()));
after the line that has ", md5: ".
Then build the parser module, and then the standard server subproject
with "mvn -DskipTests install".
The file tika-server-standard-2.4.2-SNAPSHOT.jar will be in
tika-main\tika-server\tika-server-standard\target
I can also do it for you and upload the jar file somewhere, but
obviously that's risky.
Tilman
Am 19.07.2022 um 03:53 schrieb PGNet Dev:
>
>>
>> I've just improved the output, I'm adding an MD5 checksum. This would
>> be another indicator that something is wrong (or not).
>
> indeed.
>
> i now see in the logs
>
> Jul 18 21:28:23 mx-test tika[18970]: DEBUG [qtp977522995-24]
> 21:28:23,264 org.apache.tika.parser.pdf.PDFParser File:
> /tmp/apache-tika-9115808773791090696.tmp, length: 104932, md5:
> 092bf24b2cac33fac27965549c99613a
>
> checking the original attachment
>
> ls -al Get_Started_With_Smallpdf.pdf
> -rw-r--r-- 1 root root 68K Jul 15 12:16
> Get_Started_With_Smallpdf.pdf
>
> file Get_Started_With_Smallpdf.pdf
> Get_Started_With_Smallpdf.pdf: PDF document, version 1.7
>
> md5sum Get_Started_With_Smallpdf.pdf
> 14266e428c6a5f371c5abe164026c762 Get_Started_With_Smallpdf.pdf
>
> checking,
>
> ls -al /tmp/apache-tika-9115808773791090696.tmp
> ls: cannot access '/tmp/apache-tika-9115808773791090696.tmp':
> No such file or directory
>
> is not persisted.
>
> in any case, the /tmp file's NOT the same size as the orig pdf --
> oddly, LARGER than the original file.
> dunno what to make of that yet.
>
> fwiw, the received attachment is verified to be identical to the sent
> original.
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/18/22 11:56 AM, Tilman Hausherr wrote:
> Something doesn't work properly on your side, I get a lot of "DEBUG" lines. I opened tika-server-standard-2.4.2-SNAPSHOT.jar with 7zip, extracted it, changed it, and put it back. This is how it looks (comment removed):
>
> <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
> <Configuration status="WARN">
> <Appenders>
> <Console name="Console" target="SYSTEM_ERR">
> <PatternLayout pattern="%-5p [%t] %d{HH:mm:ss,SSS} %c %m%n"/>
> </Console>
> </Appenders>
> <Loggers>
> <Root level="debug">
> <AppenderRef ref="Console"/>
> </Root>
> </Loggers>
> </Configuration>
editing log4j2.xml directly in the jar, and repacking works. no idea why other method doesn't.
D="/srv/tika"
F="tika-server-standard-2.4.2-20220718.165252-94.jar"
cd ${D}
rm -rf TMP
mkdir -p TMP/mod
cd TMP
rm -f ${F}*
wget https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-server-standard/2.4.2-SNAPSHOT/${F}
cd mod
jar -xfv ../${F}
perl -pi -e 's|Root level="info"|Root level="debug"|g' log4j2.xml
jar -cvmf META-INF/MANIFEST.MF ../mod.jar *
launch tika using 'mod.jar'
verify
ls -al /srv/tika/tika-server.jar
lrwxrwxrwx 1 root root 11 Jul 18 14:46 /srv/tika/tika-server.jar -> TMP/mod.jar
systemctl status tika -ln0
● tika.service - Apache Tika server
Loaded: loaded (/etc/systemd/system/tika.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2022-07-18 21:24:40 EDT; 18s ago
Main PID: 18935 (java)
Tasks: 54 (limit: 8811)
Memory: 174.0M
CPU: 24.491s
CGroup: /system.slice/tika.service
├─ 18935 /usr/bin/java -jar /srv/tika/tika-server.jar -c /etc/tika/tika-server-config-custom.xml
└─ 18970 /usr/bin/java -Xms1g -Xmx1g -Dpdfbox.fontcache=/var/tika -Dlog4j2.debug -Djava.awt.headless=true -cp /srv/tika/tika-server.jar -Dtika.server.id= org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i "" -c /etc/tika/tika-server-config-custom.xml -forkedStatusFile /tmp/apache-tika-server-forked-tmp-1104448251575803884 -numRestarts 0
re-send message with attachment ...
verbose/DEBUG logs
journalctl -f -u dovecot
-> https://pastebin.com/raw/sk5xevAM
> The output contains a line with "DEBUG" and "org.apache.tika.parser.pdf.PDFParser".
>
> I've just improved the output, I'm adding an MD5 checksum. This would be another indicator that something is wrong (or not).
indeed.
i now see in the logs
Jul 18 21:28:23 mx-test tika[18970]: DEBUG [qtp977522995-24] 21:28:23,264 org.apache.tika.parser.pdf.PDFParser File: /tmp/apache-tika-9115808773791090696.tmp, length: 104932, md5: 092bf24b2cac33fac27965549c99613a
checking the original attachment
ls -al Get_Started_With_Smallpdf.pdf
-rw-r--r-- 1 root root 68K Jul 15 12:16 Get_Started_With_Smallpdf.pdf
file Get_Started_With_Smallpdf.pdf
Get_Started_With_Smallpdf.pdf: PDF document, version 1.7
md5sum Get_Started_With_Smallpdf.pdf
14266e428c6a5f371c5abe164026c762 Get_Started_With_Smallpdf.pdf
checking,
ls -al /tmp/apache-tika-9115808773791090696.tmp
ls: cannot access '/tmp/apache-tika-9115808773791090696.tmp': No such file or directory
is not persisted.
in any case, the /tmp file's NOT the same size as the orig pdf -- oddly, LARGER than the original file.
dunno what to make of that yet.
fwiw, the received attachment is verified to be identical to the sent original.
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tilman Hausherr <TH...@t-online.de>.
Something doesn't work properly on your side, I get a lot of "DEBUG"
lines. I opened tika-server-standard-2.4.2-SNAPSHOT.jar with 7zip,
extracted it, changed it, and put it back. This is how it looks (comment
removed):
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<Configuration status="WARN">
<Appenders>
<Console name="Console" target="SYSTEM_ERR">
<PatternLayout pattern="%-5p [%t] %d{HH:mm:ss,SSS} %c %m%n"/>
</Console>
</Appenders>
<Loggers>
<Root level="debug">
<AppenderRef ref="Console"/>
</Root>
</Loggers>
</Configuration>
The output contains a line with "DEBUG" and
"org.apache.tika.parser.pdf.PDFParser".
I've just improved the output, I'm adding an MD5 checksum. This would be
another indicator that something is wrong (or not).
Tilman
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/17/22 11:10 PM, Tilman Hausherr wrote:
> Alternatively, make your own, and use it with
> -Dlog4j.configuration=file:./log4j.xml
> when starting the server.
per
https://logging.apache.org/log4j/2.x/manual/configuration.html
create
cat /etc/tika/log4j2.xml
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<Configuration status="WARN">
<Appenders>
<Console name="Console" target="SYSTEM_ERR">
<PatternLayout pattern="%-5p [%t] %d{HH:mm:ss,SSS} %c %m%n"/>
</Console>
</Appenders>
<Loggers>
<Logger name="org.apache.pdfbox" level="debug">
<AppenderRef ref="Console"/>
</Logger>
<Root level="debug">
<AppenderRef ref="Console"/>
</Root>
</Loggers>
</Configuration>
launch
systemctl status tika -ln0
● tika.service - Apache Tika server
Loaded: loaded (/etc/systemd/system/tika.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2022-07-18 07:42:34 EDT; 2min 54s ago
Main PID: 52278 (java)
Tasks: 54 (limit: 8811)
Memory: 205.8M
CPU: 27.392s
CGroup: /system.slice/tika.service
! ├─ 52278 /usr/bin/java -Dlog4j.configuration=file:/etc/tika/log4j2.xml -jar /srv/tika/tika-server.jar -c /etc/tika/tika-server-config-custom.xml
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
└─ 52313 /usr/bin/java -Xms1g -Xmx1g -Dpdfbox.fontcache=/var/tika -Dlog4j2.debug -Djava.awt.headless=true -cp /srv/tika/tika-server.jar -Dtika.server.id= org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i "" -c /etc/tika/tika-server-config-custom.xml -forkedStatusFile /tmp/apache-tika-server-forked-tmp-4132939610106343699 -numRestarts 0
logs,
journalctl -f
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
Jul 18 08:05:02 mx-test tika[53433]: INFO [qtp1401737458-25] 08:05:02,442 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.tika.io.TemporaryResources
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.tika.renderer.pdf.pdfbox.PDFBoxRenderer
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:03 mx-test tika[53433]: WARN [qtp1401737458-25] 08:05:03,067 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 104319, length: 366, expected end position: 104685
Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:03 mx-test tika[53433]: ERROR [qtp1401737458-25] 08:05:03,134 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
Jul 18 08:05:03 mx-test tika[53433]: WARN [qtp1401737458-25] 08:05:03,431 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 101704, length: 1475, expected end position: 103179
Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
Jul 18 08:05:03 mx-test tika[53433]: ERROR [qtp1401737458-25] 08:05:03,445 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
Jul 18 08:05:03 mx-test tika[53433]: WARN [qtp1401737458-25] 08:05:03,447 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 101514, length: 66, expected end position: 101580
Jul 18 08:05:03 mx-test tika[53433]: ERROR [qtp1401737458-25] 08:05:03,449 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
Jul 18 08:05:03 mx-test tika[53433]: WARN [qtp1401737458-25] 08:05:03,459 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 2011, length: 2482, expected end position: 4493
Jul 18 08:05:03 mx-test tika[53433]: WARN [qtp1401737458-25] 08:05:03,466 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (Get_Started_With_Smallpdf.pdf)
Jul 18 08:05:03 mx-test tika[53433]: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@4f3e230b
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at java.lang.Thread.run(Thread.java:833) ~[?:?]
Jul 18 08:05:03 mx-test tika[53433]: Caused by: java.io.IOException: Page tree root must be a dictionary
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:291) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:178) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 18 08:05:03 mx-test tika[53433]: ... 37 more
Jul 18 08:05:03 mx-test tika[53433]: ERROR [qtp1401737458-25] 08:05:03,587 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$344/0x0000000800eb2e78, ContentType: text/plain
Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tilman Hausherr <TH...@t-online.de>.
There's a file log4j2.xml in the jar file, edit that one and put it back
into the jar.
Alternatively, make your own, and use it with
-Dlog4j.configuration=file:./log4j.xml
when starting the server.
Tilman
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/17/22 11:52 AM, Tilman Hausherr wrote:
> https://issues.apache.org/jira/browse/TIKA-3819
> This will show filename and length but only if logging is in DEBUG log level. The modified version will appear at
> https://repository.apache.org/content/groups/snapshots/org/apache/tika/
> in a few hours.
thx o/
checking
https://issues.apache.org/jira/browse/TIKA-3819
i see
Fix Version/s: 2.4.2
https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/697/
Build #697 (Jul 17, 2022, 3:47:56 PM)
i installed
tika-server-standard-2.4.2-20220717.154907-90.jar
set
cat tika-server-config-custom.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<server>
<params>
! <logLevel>debug</logLevel>
...
<forkedJvmArgs>
...
! <arg>-Dlog4j2.debug</arg>
...
and launched,
systemctl status tika -l
● tika.service - Apache Tika server
Loaded: loaded (/etc/systemd/system/tika.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2022-07-17 20:51:36 EDT; 5min ago
Main PID: 25001 (java)
Tasks: 54 (limit: 8811)
Memory: 208.3M
CPU: 31.115s
CGroup: /system.slice/tika.service
├─ 25001 /usr/bin/java -jar /srv/tika/tika-server.jar -c /etc/tika/tika-server-config-custom.xml
└─ 25039 /usr/bin/java -Xms1g -Xmx1g -Dpdfbox.fontcache=/var/tika -Dlog4j2.debug -Djava.awt.headless=true -cp /srv/tika/tika-server.jar -Dtika.server.id= org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i "" -c /etc/tika/tika-server-config-custom.xml -forkedStatusFile /tmp/apache-tika-server-forked-tmp-8013562591697588923 -numRestarts 0
Jul 17 20:52:15 mx-test tika[25039]: at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:52:15 mx-test tika[25039]: at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:52:15 mx-test tika[25039]: at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:52:15 mx-test tika[25039]: at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:52:15 mx-test tika[25039]: at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:291) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:52:15 mx-test tika[25039]: at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:178) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:52:15 mx-test tika[25039]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:52:15 mx-test tika[25039]: ... 37 more
Jul 17 20:52:15 mx-test tika[25039]: ERROR [qtp1401737458-25] 20:52:15,597 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$344/0x0000000800eb2e78, ContentType: text/plain
Jul 17 20:52:15 mx-test tika[25039]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
on receipt of email + pdf attachment, FAIL as before,
journalctl -f -u tika
Jul 17 20:59:42 mx-test tika[25039]: INFO [qtp1401737458-25] 20:59:42,066 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)
Jul 17 20:59:42 mx-test tika[25039]: WARN [qtp1401737458-25] 20:59:42,243 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 104319, length: 366, expected end position: 104685
Jul 17 20:59:42 mx-test tika[25039]: ERROR [qtp1401737458-25] 20:59:42,245 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
Jul 17 20:59:42 mx-test tika[25039]: WARN [qtp1401737458-25] 20:59:42,467 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 101704, length: 1475, expected end position: 103179
Jul 17 20:59:42 mx-test tika[25039]: ERROR [qtp1401737458-25] 20:59:42,469 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
Jul 17 20:59:42 mx-test tika[25039]: WARN [qtp1401737458-25] 20:59:42,481 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 101514, length: 66, expected end position: 101580
Jul 17 20:59:42 mx-test tika[25039]: ERROR [qtp1401737458-25] 20:59:42,482 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
Jul 17 20:59:42 mx-test tika[25039]: WARN [qtp1401737458-25] 20:59:42,493 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 2011, length: 2482, expected end position: 4493
Jul 17 20:59:42 mx-test tika[25039]: WARN [qtp1401737458-25] 20:59:42,495 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (Get_Started_With_Smallpdf.pdf)
Jul 17 20:59:42 mx-test tika[25039]: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@4f3e230b
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at java.lang.Thread.run(Thread.java:833) ~[?:?]
Jul 17 20:59:42 mx-test tika[25039]: Caused by: java.io.IOException: Page tree root must be a dictionary
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:291) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:178) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
Jul 17 20:59:42 mx-test tika[25039]: ... 37 more
Jul 17 20:59:42 mx-test tika[25039]: ERROR [qtp1401737458-25] 20:59:42,499 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$344/0x0000000800eb2e78, ContentType: text/plain
where, the attachment is,
pdfinfo Get_Started_With_Smallpdf.pdf
Creator: Adobe InDesign 15.1 (Macintosh)
Producer: Adobe PDF Library 15.0
CreationDate: Wed Oct 14 11:08:10 2020 EDT
ModDate: Wed Oct 14 11:08:10 2020 EDT
Custom Metadata: no
Metadata Stream: yes
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 595.276 x 841.89 pts (A4)
Page rot: 0
File size: 69451 bytes
Optimized: no
PDF version: 1.7
i don't see any additional DEBUG info, or the file length targeted.
additional steps/config needed to enable the DEBUG output from the snapshot?
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/17/22 11:52 AM, Tilman Hausherr wrote:
> I'll add some logging when in debug mode, maybe this will help in the future. I still believe the error is on your side, but debugging would help "proving" this.
since, per prior advice, I can curl the attachment to tika server with no error, I tend to agree.
as mentioned, I suspect dovecot's fts-tika plugin.
finding the issue to 'prove' it is the challenge. iiuc, to do that, i need to run debug on tika-server as fed by dovecot/fts-tika -- i.e., in the receive-a-mail use case.
i've asked on dovecot ML if anyone else can confirm, or not, the error.
i know that both in my case, and on ML, 'this' *was* previously working.
> https://issues.apache.org/jira/browse/TIKA-3819
> This will show filename and length but only if logging is in DEBUG log level. The modified version will appear at
> https://repository.apache.org/content/groups/snapshots/org/apache/tika/
> in a few hours.
> Tilman
i'll set up a test env where i can replicate the problem, and watch for the snap to give it a go
o/
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 17.07.2022 um 17:09 schrieb PGNet Dev:
> On 7/17/22 10:24 AM, Tilman Hausherr wrote:
>> That is in pdfbox, not in tika.
>>
>> There's also a PDFParser.parse() in tika, which then calls
>> PDDocument.load(). However I don't know if this will use the
>> InputStream call, or the one with File. If it uses the one with the
>> file, then check the length and content of the file (tika does
>> sometimes store streams into a temporary file).
>
> i see the same results -- i.e., nada -- with explicit stop in
> PDFParser.parse
>
>> Re the failed build: remove the segment with ossindex-maven-plugin
>> from the parent pom.xml . That plugin (or rather, the company behind
>> it) has gone crazy, we've partly disabled it in the current trunk.
>
> no idea what specifically to do there.
>
> trying building 'main' with those partial disables, rather than
> '2.4.1', that also fails,
I'll add some logging when in debug mode, maybe this will help in the
future. I still believe the error is on your side, but debugging would
help "proving" this.
https://issues.apache.org/jira/browse/TIKA-3819
This will show filename and length but only if logging is in DEBUG log
level. The modified version will appear at
https://repository.apache.org/content/groups/snapshots/org/apache/tika/
in a few hours.
Tilman
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/17/22 10:24 AM, Tilman Hausherr wrote:
> That is in pdfbox, not in tika.
>
> There's also a PDFParser.parse() in tika, which then calls PDDocument.load(). However I don't know if this will use the InputStream call, or the one with File. If it uses the one with the file, then check the length and content of the file (tika does sometimes store streams into a temporary file).
i see the same results -- i.e., nada -- with explicit stop in PDFParser.parse
> Re the failed build: remove the segment with ossindex-maven-plugin from the parent pom.xml . That plugin (or rather, the company behind it) has gone crazy, we've partly disabled it in the current trunk.
no idea what specifically to do there.
trying building 'main' with those partial disables, rather than '2.4.1', that also fails,
INFO [pool-6-thread-1] 10:59:03,890 org.apache.tika.pipes.PipesClient pipesClientId=2 parse success: myId in 58 ms
ERROR [main] 10:59:03,907 org.apache.tika.pipes.PipesServer oom: myId
java.lang.OutOfMemoryError: oom message
at jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:67) ~[?:?]
at java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499) ~[?:?]
at java.lang.reflect.Constructor.newInstance(Constructor.java:483) ~[?:?]
at org.apache.tika.parser.mock.MockParser.throwIt(MockParser.java:428) ~[test-classes/:?]
at org.apache.tika.parser.mock.MockParser.throwIt(MockParser.java:374) ~[test-classes/:?]
at org.apache.tika.parser.mock.MockParser.executeAction(MockParser.java:155) ~[test-classes/:?]
at org.apache.tika.parser.mock.MockParser.parse(MockParser.java:134) ~[test-classes/:?]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[classes/:?]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[classes/:?]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167) ~[classes/:?]
at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:163) ~[classes/:?]
at org.apache.tika.pipes.PipesServer.parseRecursive(PipesServer.java:540) ~[classes/:?]
at org.apache.tika.pipes.PipesServer.parse(PipesServer.java:473) ~[classes/:?]
at org.apache.tika.pipes.PipesServer.parseIt(PipesServer.java:420) ~[classes/:?]
at org.apache.tika.pipes.PipesServer.actuallyParse(PipesServer.java:340) ~[classes/:?]
at org.apache.tika.pipes.PipesServer.parseOne(PipesServer.java:311) ~[classes/:?]
at org.apache.tika.pipes.PipesServer.processRequests(PipesServer.java:232) ~[classes/:?]
at org.apache.tika.pipes.PipesServer.main(PipesServer.java:168) ~[classes/:?]
my 1st priority is a stable dovecot search env, so i've removed tika from use & its config.
for now, i'll have to pass this^ on to an admin here that works regularly in a full java env, and won't have to keep guessing at how to debug the app.
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 17.07.2022 um 15:58 schrieb PGNet Dev:
> On 7/16/22 10:51 PM, Tilman Hausherr wrote:
>> You didn't get the exception I mentioned; then set the breakpoint at
>> parse() to get the fileLen. The current error messages suggests that
>> bytes have been changed or have been lost.
>>
>> IIRC tika saves the PDF in a file in the temp directory before
>> parsing, maybe look there at that time and compare the length and
>> content with your own.
>
>
> i haven't managed to stop at any *.parse bkpt i set after `jdb -attach`
That is in pdfbox, not in tika.
There's also a PDFParser.parse() in tika, which then calls
PDDocument.load(). However I don't know if this will use the InputStream
call, or the one with File. If it uses the one with the file, then check
the length and content of the file (tika does sometimes store streams
into a temporary file).
Re the failed build: remove the segment with ossindex-maven-plugin from
the parent pom.xml . That plugin (or rather, the company behind it) has
gone crazy, we've partly disabled it in the current trunk.
Tilman
>
> wondering if req'd debug info is included/complete in the runnable
> jar, i decided to try a clean mvn build
>
> git checkout 2.4.1
> mvn clean
> mvn -X compile -am -pl :tika-server-standard
>
> which fails
>
> ...
> [DEBUG] 82 component-reports; 16.90 ms
> [WARNING] Excluding coordinates: com.google.guava:guava:31.1-jre
> [INFO]
> ------------------------------------------------------------------------
> [INFO] Reactor Summary for Apache Tika parent 2.4.1:
> [INFO]
> [INFO] Apache Tika parent .................................
> SUCCESS [ 0.790 s]
> [INFO] Apache Tika core ...................................
> SUCCESS [ 4.806 s]
> [INFO] Apache Tika serialization ..........................
> SUCCESS [ 0.698 s]
> [INFO] Apache Tika parser modules .........................
> SUCCESS [ 0.045 s]
> [INFO] Apache Tika standard parser modules and package ....
> SUCCESS [ 0.033 s]
> [INFO] Apache Tika standard parser modules ................
> SUCCESS [ 0.030 s]
> [INFO] Apache Tika html commons ...........................
> SUCCESS [ 0.114 s]
> [INFO] Apache Tika digest commons .........................
> SUCCESS [ 0.154 s]
> [INFO] Apache Tika mail commons ...........................
> SUCCESS [ 0.078 s]
> [INFO] Apache Tika XMP commons ............................
> SUCCESS [ 0.120 s]
> [INFO] Apache Tika ZIP commons ............................
> SUCCESS [ 0.213 s]
> [INFO] Apache Tika image parser module ....................
> SUCCESS [ 0.355 s]
> [INFO] Apache Tika OCR parser module ......................
> SUCCESS [ 0.302 s]
> [INFO] Apache Tika audiovideo parser module ...............
> SUCCESS [ 0.369 s]
> [INFO] Apache Tika text parser module .....................
> SUCCESS [ 0.424 s]
> [INFO] Apache Tika code parser module .....................
> SUCCESS [ 0.205 s]
> [INFO] Apache Tika html parser module .....................
> SUCCESS [ 0.305 s]
> [INFO] Apache Tika font parser module .....................
> SUCCESS [ 0.078 s]
> [INFO] Apache Tika XML parser module ......................
> SUCCESS [ 0.132 s]
> [INFO] Apache Tika Microsoft parser module ................
> SUCCESS [ 2.600 s]
> [INFO] Apache Tika package parser module ..................
> SUCCESS [ 0.145 s]
> [INFO] Apache Tika PDF parser module ......................
> SUCCESS [ 0.667 s]
> [INFO] Apache Tika Apple parser module ....................
> SUCCESS [ 0.216 s]
> [INFO] Apache Tika cad parser module ......................
> SUCCESS [ 0.203 s]
> [INFO] Apache Tika mail parser module .....................
> SUCCESS [ 0.187 s]
> [INFO] Apache Tika miscellaneous office format parser module
> SUCCESS [ 0.421 s]
> [INFO] Apache Tika news parser module .....................
> SUCCESS [ 0.163 s]
> [INFO] Apache Tika crypto parser module ...................
> SUCCESS [ 0.106 s]
> [INFO] Apache Tika WARC parser module .....................
> SUCCESS [ 0.104 s]
> [INFO] Apache Tika standard parser package ................
> SUCCESS [ 0.565 s]
> [INFO] Apache Tika XMP ....................................
> SUCCESS [ 0.286 s]
> [INFO] Apache Tika language detection .....................
> SUCCESS [ 0.021 s]
> [INFO] Apache Tika langdetect test commons ................
> SUCCESS [ 0.057 s]
> [INFO] Apache Tika Optimaize langdetect ...................
> SUCCESS [ 0.108 s]
> [INFO] Apache Tika OpenNLP langdetect .....................
> SUCCESS [ 0.114 s]
> [INFO] Apache Tika pipes ..................................
> SUCCESS [ 0.018 s]
> [INFO] Apache Tika emitters ...............................
> SUCCESS [ 0.017 s]
> [INFO] Apache Tika filesystem emitter .....................
> SUCCESS [ 0.065 s]
> [INFO] Apache Tika translate ..............................
> SUCCESS [ 0.446 s]
> [INFO] Apache Tika server module ..........................
> SUCCESS [ 0.019 s]
> [INFO] Apache Tika server core ............................
> FAILURE [ 0.112 s]
> [INFO] Apache Tika standard server ........................ SKIPPED
> [INFO]
> ------------------------------------------------------------------------
> [INFO] BUILD FAILURE
> [INFO]
> ------------------------------------------------------------------------
> [INFO] Total time: 16.545 s
> [INFO] Finished at: 2022-07-17T09:41:53-04:00
> [INFO]
> ------------------------------------------------------------------------
> [ERROR] Failed to execute goal
> org.sonatype.ossindex.maven:ossindex-maven-plugin:3.2.0:audit
> (audit-dependencies) on project tika-server-core: Detected 2
> vulnerable components:
> [ERROR]
> org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile;
> https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-server@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
> [ERROR] * [CVE-2022-2047] CWE-20: Improper Input Validation
> (2.7);
> https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
> [ERROR] org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile;
> https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-http@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
> [ERROR] * [CVE-2022-2047] CWE-20: Improper Input Validation
> (2.7);
> https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
> [ERROR]
> [ERROR] Excluded coordinates:
> [ERROR] - com.google.guava:guava:31.1-jre
> [ERROR]
> [ERROR] -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
> execute goal
> org.sonatype.ossindex.maven:ossindex-maven-plugin:3.2.0:audit
> (audit-dependencies) on project tika-server-core: Detected 2
> vulnerable components:
> org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile;
> https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-server@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
>
> * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7);
> https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
>
> org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile;
> https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-http@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
> * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7);
> https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
>
>
> Excluded coordinates:
> - com.google.guava:guava:31.1-jre
>
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute
> (MojoExecutor.java:215)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute
> (MojoExecutor.java:156)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute
> (MojoExecutor.java:148)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject
> (LifecycleModuleBuilder.java:117)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject
> (LifecycleModuleBuilder.java:81)
> at
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
> (SingleThreadedBuilder.java:56)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute
> (LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute
> (DefaultMaven.java:305)
> at org.apache.maven.DefaultMaven.doExecute
> (DefaultMaven.java:192)
> at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
> at org.apache.maven.cli.MavenCli.execute (MavenCli.java:972)
> at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
> at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
> at jdk.internal.reflect.DirectMethodHandleAccessor.invoke
> (DirectMethodHandleAccessor.java:104)
> at java.lang.reflect.Method.invoke (Method.java:577)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced
> (Launcher.java:282)
> at org.codehaus.plexus.classworlds.launcher.Launcher.launch
> (Launcher.java:225)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode
> (Launcher.java:406)
> at org.codehaus.plexus.classworlds.launcher.Launcher.main
> (Launcher.java:347)
> Caused by: org.apache.maven.plugin.MojoFailureException: Detected
> 2 vulnerable components:
> org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile;
> https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-server@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
>
> * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7);
> https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
>
> org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile;
> https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-http@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
> * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7);
> https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
>
>
> Excluded coordinates:
> - com.google.guava:guava:31.1-jre
>
> at org.sonatype.ossindex.maven.plugin.AuditMojoSupport.execute
> (AuditMojoSupport.java:257)
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo
> (DefaultBuildPluginManager.java:137)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute
> (MojoExecutor.java:210)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute
> (MojoExecutor.java:156)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute
> (MojoExecutor.java:148)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject
> (LifecycleModuleBuilder.java:117)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject
> (LifecycleModuleBuilder.java:81)
> at
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
> (SingleThreadedBuilder.java:56)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute
> (LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute
> (DefaultMaven.java:305)
> at org.apache.maven.DefaultMaven.doExecute
> (DefaultMaven.java:192)
> at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
> at org.apache.maven.cli.MavenCli.execute (MavenCli.java:972)
> at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
> at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
> at jdk.internal.reflect.DirectMethodHandleAccessor.invoke
> (DirectMethodHandleAccessor.java:104)
> at java.lang.reflect.Method.invoke (Method.java:577)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced
> (Launcher.java:282)
> at org.codehaus.plexus.classworlds.launcher.Launcher.launch
> (Launcher.java:225)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode
> (Launcher.java:406)
> at org.codehaus.plexus.classworlds.launcher.Launcher.main
> (Launcher.java:347)
> [ERROR]
> [ERROR]
> [ERROR] For more information about the errors and possible
> solutions, please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build
> with the command
> [ERROR] mvn <args> -rf :tika-server-core
>
> checking @
>
> https://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
>
>
> "Unlike many other errors, this exception is not generated by
> the Maven core itself but by a plugin. As a rule of thumb, plugins use
> this error to signal a failure of the build because there is something
> wrong with the dependencies or sources of a project, e.g. a
> compilation or a test failure."
>
> in /tmp
>
> immediately after tika-server start
>
> '/usr/bin/tree -Csup --timefmt "%F %R:%S %z"' /tmp | grep tika
> ├── [-rw------- tika 0 2022-07-17 09:54:08
> -0400] apache-tika-server-forked-tmp-16337036696243797817
> ├── [drwxr-xr-x tika 80 2022-07-17 09:54:08
> -0400] hsperfdata_tika
> │ ├── [-rw------- tika 32768 2022-07-17 09:54:04
> -0400] 15865
> │ └── [-rw------- tika 32768 2022-07-17 09:54:08
> -0400] 15902
>
> , and, same -- i.e. nothing added -- after receipt of email with
> failed tika scan/parse
>
> anyone have some explicit instructions for setting a catchable
> breakpoint in a jdb -attach to tika-server?
> or, error-free build instructions?
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/16/22 10:51 PM, Tilman Hausherr wrote:
> You didn't get the exception I mentioned; then set the breakpoint at parse() to get the fileLen. The current error messages suggests that bytes have been changed or have been lost.
>
> IIRC tika saves the PDF in a file in the temp directory before parsing, maybe look there at that time and compare the length and content with your own.
i haven't managed to stop at any *.parse bkpt i set after `jdb -attach`
wondering if req'd debug info is included/complete in the runnable jar, i decided to try a clean mvn build
git checkout 2.4.1
mvn clean
mvn -X compile -am -pl :tika-server-standard
which fails
...
[DEBUG] 82 component-reports; 16.90 ms
[WARNING] Excluding coordinates: com.google.guava:guava:31.1-jre
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Tika parent 2.4.1:
[INFO]
[INFO] Apache Tika parent ................................. SUCCESS [ 0.790 s]
[INFO] Apache Tika core ................................... SUCCESS [ 4.806 s]
[INFO] Apache Tika serialization .......................... SUCCESS [ 0.698 s]
[INFO] Apache Tika parser modules ......................... SUCCESS [ 0.045 s]
[INFO] Apache Tika standard parser modules and package .... SUCCESS [ 0.033 s]
[INFO] Apache Tika standard parser modules ................ SUCCESS [ 0.030 s]
[INFO] Apache Tika html commons ........................... SUCCESS [ 0.114 s]
[INFO] Apache Tika digest commons ......................... SUCCESS [ 0.154 s]
[INFO] Apache Tika mail commons ........................... SUCCESS [ 0.078 s]
[INFO] Apache Tika XMP commons ............................ SUCCESS [ 0.120 s]
[INFO] Apache Tika ZIP commons ............................ SUCCESS [ 0.213 s]
[INFO] Apache Tika image parser module .................... SUCCESS [ 0.355 s]
[INFO] Apache Tika OCR parser module ...................... SUCCESS [ 0.302 s]
[INFO] Apache Tika audiovideo parser module ............... SUCCESS [ 0.369 s]
[INFO] Apache Tika text parser module ..................... SUCCESS [ 0.424 s]
[INFO] Apache Tika code parser module ..................... SUCCESS [ 0.205 s]
[INFO] Apache Tika html parser module ..................... SUCCESS [ 0.305 s]
[INFO] Apache Tika font parser module ..................... SUCCESS [ 0.078 s]
[INFO] Apache Tika XML parser module ...................... SUCCESS [ 0.132 s]
[INFO] Apache Tika Microsoft parser module ................ SUCCESS [ 2.600 s]
[INFO] Apache Tika package parser module .................. SUCCESS [ 0.145 s]
[INFO] Apache Tika PDF parser module ...................... SUCCESS [ 0.667 s]
[INFO] Apache Tika Apple parser module .................... SUCCESS [ 0.216 s]
[INFO] Apache Tika cad parser module ...................... SUCCESS [ 0.203 s]
[INFO] Apache Tika mail parser module ..................... SUCCESS [ 0.187 s]
[INFO] Apache Tika miscellaneous office format parser module SUCCESS [ 0.421 s]
[INFO] Apache Tika news parser module ..................... SUCCESS [ 0.163 s]
[INFO] Apache Tika crypto parser module ................... SUCCESS [ 0.106 s]
[INFO] Apache Tika WARC parser module ..................... SUCCESS [ 0.104 s]
[INFO] Apache Tika standard parser package ................ SUCCESS [ 0.565 s]
[INFO] Apache Tika XMP .................................... SUCCESS [ 0.286 s]
[INFO] Apache Tika language detection ..................... SUCCESS [ 0.021 s]
[INFO] Apache Tika langdetect test commons ................ SUCCESS [ 0.057 s]
[INFO] Apache Tika Optimaize langdetect ................... SUCCESS [ 0.108 s]
[INFO] Apache Tika OpenNLP langdetect ..................... SUCCESS [ 0.114 s]
[INFO] Apache Tika pipes .................................. SUCCESS [ 0.018 s]
[INFO] Apache Tika emitters ............................... SUCCESS [ 0.017 s]
[INFO] Apache Tika filesystem emitter ..................... SUCCESS [ 0.065 s]
[INFO] Apache Tika translate .............................. SUCCESS [ 0.446 s]
[INFO] Apache Tika server module .......................... SUCCESS [ 0.019 s]
[INFO] Apache Tika server core ............................ FAILURE [ 0.112 s]
[INFO] Apache Tika standard server ........................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 16.545 s
[INFO] Finished at: 2022-07-17T09:41:53-04:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.sonatype.ossindex.maven:ossindex-maven-plugin:3.2.0:audit (audit-dependencies) on project tika-server-core: Detected 2 vulnerable components:
[ERROR] org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-server@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
[ERROR] * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
[ERROR] org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-http@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
[ERROR] * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
[ERROR]
[ERROR] Excluded coordinates:
[ERROR] - com.google.guava:guava:31.1-jre
[ERROR]
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.sonatype.ossindex.maven:ossindex-maven-plugin:3.2.0:audit (audit-dependencies) on project tika-server-core: Detected 2 vulnerable components:
org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-server@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
* [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-http@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
* [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
Excluded coordinates:
- com.google.guava:guava:31.1-jre
at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
at org.apache.maven.cli.MavenCli.execute (MavenCli.java:972)
at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
at jdk.internal.reflect.DirectMethodHandleAccessor.invoke (DirectMethodHandleAccessor.java:104)
at java.lang.reflect.Method.invoke (Method.java:577)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
Caused by: org.apache.maven.plugin.MojoFailureException: Detected 2 vulnerable components:
org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-server@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
* [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-http@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
* [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
Excluded coordinates:
- com.google.guava:guava:31.1-jre
at org.sonatype.ossindex.maven.plugin.AuditMojoSupport.execute (AuditMojoSupport.java:257)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
at org.apache.maven.cli.MavenCli.execute (MavenCli.java:972)
at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
at jdk.internal.reflect.DirectMethodHandleAccessor.invoke (DirectMethodHandleAccessor.java:104)
at java.lang.reflect.Method.invoke (Method.java:577)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn <args> -rf :tika-server-core
checking @
https://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
"Unlike many other errors, this exception is not generated by the Maven core itself but by a plugin. As a rule of thumb, plugins use this error to signal a failure of the build because there is something wrong with the dependencies or sources of a project, e.g. a compilation or a test failure."
in /tmp
immediately after tika-server start
'/usr/bin/tree -Csup --timefmt "%F %R:%S %z"' /tmp | grep tika
├── [-rw------- tika 0 2022-07-17 09:54:08 -0400] apache-tika-server-forked-tmp-16337036696243797817
├── [drwxr-xr-x tika 80 2022-07-17 09:54:08 -0400] hsperfdata_tika
│ ├── [-rw------- tika 32768 2022-07-17 09:54:04 -0400] 15865
│ └── [-rw------- tika 32768 2022-07-17 09:54:08 -0400] 15902
, and, same -- i.e. nothing added -- after receipt of email with failed tika scan/parse
anyone have some explicit instructions for setting a catchable breakpoint in a jdb -attach to tika-server?
or, error-free build instructions?
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 16.07.2022 um 18:43 schrieb PGNet Dev:
>
> i don't get any more useful info on failure,
>
> --> https://pastebin.com/raw/DsrLxbeg
You didn't get the exception I mentioned; then set the breakpoint at
parse() to get the fileLen. The current error messages suggests that
bytes have been changed or have been lost.
IIRC tika saves the PDF in a file in the temp directory before parsing,
maybe look there at that time and compare the length and content with
your own.
Tilman
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tim Allison <ta...@apache.org>.
Right. I think Tilman was suggesting adding new debug logging to
tika-server.
On Sat, Jul 16, 2022 at 12:43 PM PGNet Dev <pg...@gmail.com> wrote:
> with debug log levels set in tika config
>
> cat tika-server-config-custom.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
> <server>
> <params>
> <logLevel>debug</logLevel>
> <port>9998</port>
> <host>127.0.0.1</host>
> <javaPath>/usr/bin/java</javaPath>
> <noFork>false</noFork>
> <forkedJvmArgs>
> <arg>-Xms1g</arg>
> <arg>-Xmx1g</arg>
> <arg>-Dpdfbox.fontcache=/var/tika</arg>
> <arg>-Dlog4j2.debug</arg>
> </forkedJvmArgs>
> ...
>
> i don't get any more useful info on failure,
>
> --> https://pastebin.com/raw/DsrLxbeg
>
> . unless there's more relevant debug info to squeeze out from config
> alone,
>
> On 7/15/22 10:43 PM, Tilman Hausherr wrote:
> > The next that could be done is to debug this, if possible. Tim suggested
> the file might be truncated.
> >
> > I don't know if it is possible, if you can run tika in a debugger, then
> stop at org.apache.pdfbox.pdfparser.PDFParser.initialParse() where the
> exception "Page tree root must be a dictionary" happens. There try to
> access this.fileLen . Compare that number to your file length.
>
> , I'll figure out how to debug the tika-server java backend, while being
> fed by the dovecot attachment submission task.
> guessing 'jdb',
>
> jdb -version
> This is jdb version 18.0 (Java SE version 18.0.1)
>
> is the right tool for that.
>
>
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
with debug log levels set in tika config
cat tika-server-config-custom.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<server>
<params>
<logLevel>debug</logLevel>
<port>9998</port>
<host>127.0.0.1</host>
<javaPath>/usr/bin/java</javaPath>
<noFork>false</noFork>
<forkedJvmArgs>
<arg>-Xms1g</arg>
<arg>-Xmx1g</arg>
<arg>-Dpdfbox.fontcache=/var/tika</arg>
<arg>-Dlog4j2.debug</arg>
</forkedJvmArgs>
...
i don't get any more useful info on failure,
--> https://pastebin.com/raw/DsrLxbeg
. unless there's more relevant debug info to squeeze out from config alone,
On 7/15/22 10:43 PM, Tilman Hausherr wrote:
> The next that could be done is to debug this, if possible. Tim suggested the file might be truncated.
>
> I don't know if it is possible, if you can run tika in a debugger, then stop at org.apache.pdfbox.pdfparser.PDFParser.initialParse() where the exception "Page tree root must be a dictionary" happens. There try to access this.fileLen . Compare that number to your file length.
, I'll figure out how to debug the tika-server java backend, while being fed by the dovecot attachment submission task.
guessing 'jdb',
jdb -version
This is jdb version 18.0 (Java SE version 18.0.1)
is the right tool for that.
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tilman Hausherr <TH...@t-online.de>.
That's what I also get.
The next that could be done is to debug this, if possible. Tim suggested
the file might be truncated.
I don't know if it is possible, if you can run tika in a debugger, then
stop at org.apache.pdfbox.pdfparser.PDFParser.initialParse() where the
exception "Page tree root must be a dictionary" happens. There try to
access this.fileLen . Compare that number to your file length.
(I'm wondering if we are offering some debug info in the tika server, or
if we could offer it in the future, e.g. telling the length, and/or
offering an MD5 checksum if log debug mode is on)
An alternative would be that 1) I add the file length in PDFBox
exception 2) you create a Tika build with the PDFBox snapshot.
Tilman
Am 15.07.2022 um 18:26 schrieb PGNet Dev:
> On 7/15/22 12:01 PM, Tim Allison wrote:
>> If you curl the test file (GetStartedWithSmallpdf.pdf) against your
>> tika-server, what do you see? The test file works for me with
>> 2.4.2-SNAPSHOT at least. Are the files getting truncated somehow?
>
>
>> If you curl the test file (GetStartedWithSmallpdf.pdf) against your
>> tika-server, what do you see?
>
> in journal log, only this:
>
> Jul 15 12:24:47 mx.loc tika[1143]: INFO [qtp1837533591-23]
> 12:24:47,978 org.apache.tika.server.core.resource.TikaResource /tika
> (application/pdf)
>
> and, @ console, this:
>
> https://pastebin.com/raw/Nu1RCbat
>
>
>
>> Are the files getting truncated somehow?
>
> Perhaps? I'd guess that since curl of the source file against tika ,
> as above, works ok, that what's feeding tika -- namely dovecot's fts
> plugin -- would be a likely candidate.
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by PGNet Dev <pg...@gmail.com>.
On 7/15/22 12:01 PM, Tim Allison wrote:
> If you curl the test file (GetStartedWithSmallpdf.pdf) against your tika-server, what do you see? The test file works for me with 2.4.2-SNAPSHOT at least. Are the files getting truncated somehow?
> If you curl the test file (GetStartedWithSmallpdf.pdf) against your tika-server, what do you see?
in journal log, only this:
Jul 15 12:24:47 mx.loc tika[1143]: INFO [qtp1837533591-23] 12:24:47,978 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)
and, @ console, this:
https://pastebin.com/raw/Nu1RCbat
> Are the files getting truncated somehow?
Perhaps? I'd guess that since curl of the source file against tika , as above, works ok, that what's feeding tika -- namely dovecot's fts plugin -- would be a likely candidate.
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tim Allison <ta...@apache.org>.
If I truncate the test file with a hexeditor, I see this:
INFO [main] 12:04:20,641 org.apache.tika.server.core.TikaServerProcess
Starting Apache Tika 2.4.2-SNAPSHOT server
INFO [main] 12:04:20,823 org.apache.tika.server.core.TikaServerProcess
loading resource from SPI: class
org.apache.tika.server.standard.resource.XMPMetadataResource
INFO [main] 12:04:21,044 org.apache.cxf.endpoint.ServerImpl Setting the
server's publish address to be http://localhost:9998/
INFO [main] 12:04:21,111 org.eclipse.jetty.util.log Logging initialized
@1675ms to org.eclipse.jetty.util.log.Slf4jLog
INFO [main] 12:04:21,169 org.eclipse.jetty.server.Server
jetty-9.4.48.v20220622; built: 2022-06-21T20:42:25.880Z; git:
6b67c5719d1f4371b33655ff2d047d24e171e49a; jvm 11.0.11+9
INFO [main] 12:04:21,205 org.eclipse.jetty.server.AbstractConnector
Started ServerConnector@352e787a{HTTP/1.1, (http/1.1)}{localhost:9998}
INFO [main] 12:04:21,205 org.eclipse.jetty.server.Server Started @1771ms
WARN [main] 12:04:21,212 org.eclipse.jetty.server.handler.ContextHandler
Empty contextPath
INFO [main] 12:04:21,226 org.eclipse.jetty.server.handler.ContextHandler
Started o.e.j.s.h.ContextHandler@408b87aa{/,null,AVAILABLE}
INFO [main] 12:04:21,232 org.apache.tika.server.core.TikaServerProcess
Started Apache Tika server fabf267b-a86c-43d7-9845-e15f36d032e2 at
http://localhost:9998/
INFO [qtp499951827-28] 12:04:24,324
org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type)
WARN [qtp499951827-28] 12:04:24,683 org.apache.pdfbox.pdfparser.COSParser
Skipped incomplete object stream:108 0 R at 67085
WARN [qtp499951827-28] 12:04:24,688
org.apache.tika.server.core.resource.TikaResource tika: Text extraction
failed (null)
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.pdf.PDFParser@5ec70124
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.tika.server.core.resource.TikaResource.lambda$produceOutput$2(TikaResource.java:680)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.eclipse.jetty.server.Server.handle(Server.java:516)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.io.IOException: Page tree root must be a dictionary
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
... 35 more
ERROR [qtp499951827-28] 12:04:24,705 org.apache.cxf.jaxrs.utils.JAXRSUtils
Problem with writing the data, class
org.apache.tika.server.core.resource.TikaResource$$Lambda$349/0x0000000800394c40,
ContentType: text/xml
On Fri, Jul 15, 2022 at 12:01 PM Tim Allison <ta...@apache.org> wrote:
> If you curl the test file (GetStartedWithSmallpdf.pdf) against your
> tika-server, what do you see? The test file works for me with
> 2.4.2-SNAPSHOT at least. Are the files getting truncated somehow?
>
>
>
> On Fri, Jul 15, 2022 at 9:41 AM PGNet Dev <pg...@gmail.com> wrote:
>
>> i'm running tika-server 2.4.1 on a linux box,
>>
>> lsb_release -rd
>> Description: Fedora release 36 (Thirty Six)
>> Release: 36
>>
>> uname -rm
>> 5.18.11-200.fc36.x86_64 x86_64
>>
>> java -version
>> Picked up JAVA_TOOL_OPTIONS: -Xmx512M
>> openjdk version "18.0.1" 2022-04-19
>> OpenJDK Runtime Environment 22.3 (build 18.0.1+10)
>> OpenJDK 64-Bit Server VM 22.3 (build 18.0.1+10, mixed
>> mode, sharing)
>>
>>
>> ps ax | grep tika-server
>> 1003 ? Ssl 0:12 /usr/bin/java -jar
>> /srv/webapps/tika/tika-server.jar -c
>> /usr/local/etc/tika/tika-server-config-custom.xml
>> 1143 ? Sl 0:37 /usr/bin/java -Xms1g -Xmx1g
>> -Dpdfbox.fontcache=/var/tika -Dlog4j2.info -Djava.awt.headless=true -cp
>> /srv/webapps/tika/tika-server.jar -Dtika.server.id=
>> org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i -c
>> /usr/local/etc/tika/tika-server-config-custom.xml -forkedStatusFile
>> /tmp/apache-tika-server-forked-tmp-9638775429532759882 -numRestarts 0
>>
>> it's invoked from a dovecot imap server instance, for attachment parsing,
>>
>> dovecot --version
>> 2.3.19.1 (9b53102964)
>>
>> cat dovecot/conf.d/10-master.com
>> ...
>> plugin {
>> ...
>> fts_tika = http://127.0.0.1:9998/tika/
>> }
>> ...
>>
>> on receipt of an email with a standard attachment/exmaple -- e.g. the
>> example pdf @
>>
>> https://smallpdf.com/edit-pdf
>>
>> , per journal logs, the message is submitted to tika, but fails due to a
>> 'corrupt stream'
>>
>> Jul 15 08:41:27 mx tika[1143]: INFO [qtp1837533591-27]
>> 08:41:27,224 org.apache.tika.server.core.resource.TikaResource /tika
>> (application/pdf)
>> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
>> 08:41:27,453 org.apache.pdfbox.pdfparser.COSParser The end of the stream
>> doesn't point to the correct offset, using workaround to read the stream,
>> stream start position: 104315, length: 356, expected end position: 104671
>> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
>> 08:41:27,457 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
>> corrupt stream due to a DataFormatException
>> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
>> 08:41:27,730 org.apache.pdfbox.pdfparser.COSParser The end of the stream
>> doesn't point to the correct offset, using workaround to read the stream,
>> stream start position: 101699, length: 1472, expected end position: 103171
>> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
>> 08:41:27,735 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
>> corrupt stream due to a DataFormatException
>> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
>> 08:41:27,742 org.apache.pdfbox.pdfparser.COSParser The end of the stream
>> doesn't point to the correct offset, using workaround to read the stream,
>> stream start position: 101509, length: 66, expected end position: 101575
>> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
>> 08:41:27,744 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
>> corrupt stream due to a DataFormatException
>> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
>> 08:41:27,748 org.apache.pdfbox.pdfparser.COSParser The end of the stream
>> doesn't point to the correct offset, using workaround to read the stream,
>> stream start position: 2011, length: 2482, expected end position: 4493
>> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
>> 08:41:27,752 org.apache.tika.server.core.resource.TikaResource tika/: Text
>> extraction failed (test.pdf)
>> Jul 15 08:41:27 mx tika[1143]:
>> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
>> org.apache.tika.parser.pdf.PDFParser@356fdbd7
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.server.Server.handle(Server.java:516)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> java.lang.Thread.run(Thread.java:833) ~[?:?]
>> Jul 15 08:41:27 mx tika[1143]: Caused by: java.io.IOException:
>> Page tree root must be a dictionary
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>> Jul 15 08:41:27 mx tika[1143]: ... 37 more
>> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
>> 08:41:27,767 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the
>> data, class
>> org.apache.tika.server.core.resource.TikaResource$$Lambda$337/0x0000000800eabbf8,
>> ContentType: text/plain
>>
>> Is this likely an issue with tika-server itself? &/or java/dovecot?
>>
>> What additional diagnostics can help narrow it down?
>>
>
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tilman Hausherr <TH...@t-online.de>.
I got it to work with tika-server-standard and
curl -X PUT --data-binary @Get_Started_With_Smallpdf.pdf
http://localhost:9998/tika --header "Content-type: application/pdf"
and got a text and no nasty stuff on the console.
Tilman
Am 15.07.2022 um 18:01 schrieb Tim Allison:
> If you curl the test file (GetStartedWithSmallpdf.pdf) against your
> tika-server, what do you see? The test file works for me with
> 2.4.2-SNAPSHOT at least. Are the files getting truncated somehow?
>
>
>
> On Fri, Jul 15, 2022 at 9:41 AM PGNet Dev <pg...@gmail.com> wrote:
>
> i'm running tika-server 2.4.1 on a linux box,
>
> lsb_release -rd
> Description: Fedora release 36 (Thirty Six)
> Release: 36
>
> uname -rm
> 5.18.11-200.fc36.x86_64 x86_64
>
> java -version
> Picked up JAVA_TOOL_OPTIONS: -Xmx512M
> openjdk version "18.0.1" 2022-04-19
> OpenJDK Runtime Environment 22.3 (build 18.0.1+10)
> OpenJDK 64-Bit Server VM 22.3 (build 18.0.1+10,
> mixed mode, sharing)
>
>
> ps ax | grep tika-server
> 1003 ? Ssl 0:12 /usr/bin/java -jar
> /srv/webapps/tika/tika-server.jar -c
> /usr/local/etc/tika/tika-server-config-custom.xml
> 1143 ? Sl 0:37 /usr/bin/java -Xms1g -Xmx1g
> -Dpdfbox.fontcache=/var/tika -Dlog4j2.info
> -Djava.awt.headless=true -cp /srv/webapps/tika/tika-server.jar
> -Dtika.server.id <http://Dtika.server.id>=
> org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998
> -i -c /usr/local/etc/tika/tika-server-config-custom.xml
> -forkedStatusFile
> /tmp/apache-tika-server-forked-tmp-9638775429532759882 -numRestarts 0
>
> it's invoked from a dovecot imap server instance, for attachment
> parsing,
>
> dovecot --version
> 2.3.19.1 (9b53102964)
>
> cat dovecot/conf.d/10-master.com <http://10-master.com>
> ...
> plugin {
> ...
> fts_tika = http://127.0.0.1:9998/tika/
> }
> ...
>
> on receipt of an email with a standard attachment/exmaple -- e.g.
> the example pdf @
>
> https://smallpdf.com/edit-pdf
>
> , per journal logs, the message is submitted to tika, but fails
> due to a 'corrupt stream'
>
> Jul 15 08:41:27 mx tika[1143]: INFO [qtp1837533591-27]
> 08:41:27,224 org.apache.tika.server.core.resource.TikaResource
> /tika (application/pdf)
> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
> 08:41:27,453 org.apache.pdfbox.pdfparser.COSParser The end of the
> stream doesn't point to the correct offset, using workaround to
> read the stream, stream start position: 104315, length: 356,
> expected end position: 104671
> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,457 org.apache.pdfbox.filter.FlateFilter FlateFilter:
> stop reading corrupt stream due to a DataFormatException
> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
> 08:41:27,730 org.apache.pdfbox.pdfparser.COSParser The end of the
> stream doesn't point to the correct offset, using workaround to
> read the stream, stream start position: 101699, length: 1472,
> expected end position: 103171
> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,735 org.apache.pdfbox.filter.FlateFilter FlateFilter:
> stop reading corrupt stream due to a DataFormatException
> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
> 08:41:27,742 org.apache.pdfbox.pdfparser.COSParser The end of the
> stream doesn't point to the correct offset, using workaround to
> read the stream, stream start position: 101509, length: 66,
> expected end position: 101575
> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,744 org.apache.pdfbox.filter.FlateFilter FlateFilter:
> stop reading corrupt stream due to a DataFormatException
> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
> 08:41:27,748 org.apache.pdfbox.pdfparser.COSParser The end of the
> stream doesn't point to the correct offset, using workaround to
> read the stream, stream start position: 2011, length: 2482,
> expected end position: 4493
> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
> 08:41:27,752 org.apache.tika.server.core.resource.TikaResource
> tika/: Text extraction failed (test.pdf)
> Jul 15 08:41:27 mx tika[1143]:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal
> IOException from org.apache.tika.parser.pdf.PDFParser@356fdbd7
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.Server.handle(Server.java:516)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.io
> <http://org.eclipse.jetty.io>.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.io
> <http://org.eclipse.jetty.io>.FillInterest.fillable(FillInterest.java:105)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.io
> <http://org.eclipse.jetty.io>.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> java.lang.Thread.run(Thread.java:833) ~[?:?]
> Jul 15 08:41:27 mx tika[1143]: Caused by:
> java.io.IOException: Page tree root must be a dictionary
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: ... 37 more
> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,767 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with
> writing the data, class
> org.apache.tika.server.core.resource.TikaResource$$Lambda$337/0x0000000800eabbf8,
> ContentType: text/plain
>
> Is this likely an issue with tika-server itself? &/or java/dovecot?
>
> What additional diagnostics can help narrow it down?
>
Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?
Posted by Tim Allison <ta...@apache.org>.
If you curl the test file (GetStartedWithSmallpdf.pdf) against your
tika-server, what do you see? The test file works for me with
2.4.2-SNAPSHOT at least. Are the files getting truncated somehow?
On Fri, Jul 15, 2022 at 9:41 AM PGNet Dev <pg...@gmail.com> wrote:
> i'm running tika-server 2.4.1 on a linux box,
>
> lsb_release -rd
> Description: Fedora release 36 (Thirty Six)
> Release: 36
>
> uname -rm
> 5.18.11-200.fc36.x86_64 x86_64
>
> java -version
> Picked up JAVA_TOOL_OPTIONS: -Xmx512M
> openjdk version "18.0.1" 2022-04-19
> OpenJDK Runtime Environment 22.3 (build 18.0.1+10)
> OpenJDK 64-Bit Server VM 22.3 (build 18.0.1+10, mixed
> mode, sharing)
>
>
> ps ax | grep tika-server
> 1003 ? Ssl 0:12 /usr/bin/java -jar
> /srv/webapps/tika/tika-server.jar -c
> /usr/local/etc/tika/tika-server-config-custom.xml
> 1143 ? Sl 0:37 /usr/bin/java -Xms1g -Xmx1g
> -Dpdfbox.fontcache=/var/tika -Dlog4j2.info -Djava.awt.headless=true -cp
> /srv/webapps/tika/tika-server.jar -Dtika.server.id=
> org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i -c
> /usr/local/etc/tika/tika-server-config-custom.xml -forkedStatusFile
> /tmp/apache-tika-server-forked-tmp-9638775429532759882 -numRestarts 0
>
> it's invoked from a dovecot imap server instance, for attachment parsing,
>
> dovecot --version
> 2.3.19.1 (9b53102964)
>
> cat dovecot/conf.d/10-master.com
> ...
> plugin {
> ...
> fts_tika = http://127.0.0.1:9998/tika/
> }
> ...
>
> on receipt of an email with a standard attachment/exmaple -- e.g. the
> example pdf @
>
> https://smallpdf.com/edit-pdf
>
> , per journal logs, the message is submitted to tika, but fails due to a
> 'corrupt stream'
>
> Jul 15 08:41:27 mx tika[1143]: INFO [qtp1837533591-27]
> 08:41:27,224 org.apache.tika.server.core.resource.TikaResource /tika
> (application/pdf)
> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
> 08:41:27,453 org.apache.pdfbox.pdfparser.COSParser The end of the stream
> doesn't point to the correct offset, using workaround to read the stream,
> stream start position: 104315, length: 356, expected end position: 104671
> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,457 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
> corrupt stream due to a DataFormatException
> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
> 08:41:27,730 org.apache.pdfbox.pdfparser.COSParser The end of the stream
> doesn't point to the correct offset, using workaround to read the stream,
> stream start position: 101699, length: 1472, expected end position: 103171
> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,735 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
> corrupt stream due to a DataFormatException
> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
> 08:41:27,742 org.apache.pdfbox.pdfparser.COSParser The end of the stream
> doesn't point to the correct offset, using workaround to read the stream,
> stream start position: 101509, length: 66, expected end position: 101575
> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,744 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
> corrupt stream due to a DataFormatException
> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
> 08:41:27,748 org.apache.pdfbox.pdfparser.COSParser The end of the stream
> doesn't point to the correct offset, using workaround to read the stream,
> stream start position: 2011, length: 2482, expected end position: 4493
> Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
> 08:41:27,752 org.apache.tika.server.core.resource.TikaResource tika/: Text
> extraction failed (test.pdf)
> Jul 15 08:41:27 mx tika[1143]:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@356fdbd7
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.Server.handle(Server.java:516)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> java.lang.Thread.run(Thread.java:833) ~[?:?]
> Jul 15 08:41:27 mx tika[1143]: Caused by: java.io.IOException:
> Page tree root must be a dictionary
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
> Jul 15 08:41:27 mx tika[1143]: ... 37 more
> Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,767 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the
> data, class
> org.apache.tika.server.core.resource.TikaResource$$Lambda$337/0x0000000800eabbf8,
> ContentType: text/plain
>
> Is this likely an issue with tika-server itself? &/or java/dovecot?
>
> What additional diagnostics can help narrow it down?
>