You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by PGNet Dev <pg...@gmail.com> on 2022/07/15 13:41:16 UTC

tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

  i'm running tika-server 2.4.1 on a linux box,

	lsb_release -rd
		Description:    Fedora release 36 (Thirty Six)
		Release:        36

	uname -rm
		5.18.11-200.fc36.x86_64 x86_64

	java -version
		Picked up JAVA_TOOL_OPTIONS: -Xmx512M
		openjdk version "18.0.1" 2022-04-19
		OpenJDK Runtime Environment 22.3 (build 18.0.1+10)
		OpenJDK 64-Bit Server VM 22.3 (build 18.0.1+10, mixed mode, sharing)


	ps ax | grep tika-server
	   1003 ?        Ssl    0:12 /usr/bin/java -jar /srv/webapps/tika/tika-server.jar -c /usr/local/etc/tika/tika-server-config-custom.xml
	   1143 ?        Sl     0:37 /usr/bin/java -Xms1g -Xmx1g -Dpdfbox.fontcache=/var/tika -Dlog4j2.info -Djava.awt.headless=true -cp /srv/webapps/tika/tika-server.jar -Dtika.server.id= org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i  -c /usr/local/etc/tika/tika-server-config-custom.xml -forkedStatusFile /tmp/apache-tika-server-forked-tmp-9638775429532759882 -numRestarts 0

it's invoked from a dovecot imap server instance, for attachment parsing,

	dovecot --version
		2.3.19.1 (9b53102964)

	cat dovecot/conf.d/10-master.com
		...
		plugin {
			...
			fts_tika = http://127.0.0.1:9998/tika/
		}
		...

on receipt of an email with a standard attachment/exmaple -- e.g. the example pdf @

	https://smallpdf.com/edit-pdf

, per journal logs, the message is submitted to tika, but fails due to a 'corrupt stream'

	Jul 15 08:41:27 mx tika[1143]: INFO  [qtp1837533591-27] 08:41:27,224 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)
	Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27] 08:41:27,453 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 104315, length: 356, expected end position: 104671
	Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,457 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
	Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27] 08:41:27,730 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 101699, length: 1472, expected end position: 103171
	Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,735 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
	Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27] 08:41:27,742 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 101509, length: 66, expected end position: 101575
	Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,744 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
	Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27] 08:41:27,748 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 2011, length: 2482, expected end position: 4493
	Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27] 08:41:27,752 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (test.pdf)
	Jul 15 08:41:27 mx tika[1143]: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@356fdbd7
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at java.lang.Thread.run(Thread.java:833) ~[?:?]
	Jul 15 08:41:27 mx tika[1143]: Caused by: java.io.IOException: Page tree root must be a dictionary
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.4.1.jar:2.4.1]
	Jul 15 08:41:27 mx tika[1143]:         ... 37 more
	Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,767 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$337/0x0000000800eabbf8, ContentType: text/plain

Is this likely an issue with tika-server itself? &/or java/dovecot?

What additional diagnostics can help narrow it down?

Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/15/22 11:22 AM, Tilman Hausherr wrote:
> likely invalid PDFs. Please upload them somewhere for inspection

I'm seeing this with all the PDFs I've tried ... so far.

Including the one I grabbed from the site I referenced in the OP, which I've re-uploaded to:

   https://ufile.io/dkew7k0u




Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 15.07.2022 um 15:41 schrieb PGNet Dev:
>     Jul 15 08:41:27 mx tika[1143]: INFO  [qtp1837533591-27] 
> 08:41:27,224 org.apache.tika.server.core.resource.TikaResource /tika 
> (application/pdf)
>     Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27] 
> 08:41:27,453 org.apache.pdfbox.pdfparser.COSParser The end of the 
> stream doesn't point to the correct offset, using workaround to read 
> the stream, stream start position: 104315, length: 356, expected end 
> position: 104671
>     Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 
> 08:41:27,457 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop 
> reading corrupt stream due to a DataFormatException
>     Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27] 
> 08:41:27,730 org.apache.pdfbox.pdfparser.COSParser The end of the 
> stream doesn't point to the correct offset, using workaround to read 
> the stream, stream start position: 101699, length: 1472, expected end 
> position: 103171
>     Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 
> 08:41:27,735 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop 
> reading corrupt stream due to a DataFormatException
>     Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27] 
> 08:41:27,742 org.apache.pdfbox.pdfparser.COSParser The end of the 
> stream doesn't point to the correct offset, using workaround to read 
> the stream, stream start position: 101509, length: 66, expected end 
> position: 101575
>     Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 
> 08:41:27,744 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop 
> reading corrupt stream due to a DataFormatException
>     Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27] 
> 08:41:27,748 org.apache.pdfbox.pdfparser.COSParser The end of the 
> stream doesn't point to the correct offset, using workaround to read 
> the stream, stream start position: 2011, length: 2482, expected end 
> position: 4493


>     Jul 15 08:41:27 mx tika[1143]: Caused by: java.io.IOException: 
> Page tree root must be a dictionary

likely invalid PDFs. Please upload them somewhere for inspection

Tilman


Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/15/22 10:43 PM, Tilman Hausherr wrote:
> That's what I also get.
> 
> The next that could be done is to debug this, if possible. Tim suggested the file might be truncated.
> 
> I don't know if it is possible, if you can run tika in a debugger, then stop at org.apache.pdfbox.pdfparser.PDFParser.initialParse() where the exception "Page tree root must be a dictionary" happens. There try to access this.fileLen . Compare that number to your file length.

1st stab at debugging this, i launch tika with debug tooling,

	/usr/bin/java \
	 -agentlib:jdwp=transport=dt_socket,address=127.0.0.1:8080,server=y,suspend=n \
	 -jar /srv/tika/tika-server.jar \
	 -c /etc/tika/tika-server-config-custom.xml

in another shell, attach the debugger

	jdb -attach 127.0.0.1:8080

then set the bp

	> stop in org.apache.pdfbox.pdfparser.PDFParser.initialParse
		Deferring breakpoint org.apache.pdfbox.pdfparser.PDFParser.initialParse.
		It will be set after the class is loaded.

i then send/receive the email with PDF attachment -- through dovecot>tika -- as above

i again see the scan-fail error in tika logs, but never see a

	Breakpoint hit: ...

dumping at prompt anyway,

	> dump this.fileLen
		No current thread
		 this.fileLen = null
	> threads
		Group system:
		  (java.lang.ref.Reference$ReferenceHandler)2788 Reference Handler   running
		  (java.lang.ref.Finalizer$FinalizerThread)2789  Finalizer           cond. waiting
		  (java.lang.Thread)2790                         Signal Dispatcher   running
		  (java.lang.Thread)2791                         Notification Thread running
		  (java.lang.Thread)2792                         process reaper      running
		Group main:
		  (java.lang.Thread)1                            main                cond. waiting
		  (java.lang.Thread)2780                         pool-2-thread-1     cond. waiting
		  (java.lang.Thread)2795                         Thread-2            running
		Group InnocuousThreadGroup:
		  (jdk.internal.misc.InnocuousThread)2796        Common-Cleaner      cond. waiting

am i even setting the stop correctly, in order to get at the fail?

> An alternative would be that 1) I add the file length in PDFBox exception 2) you create a Tika build with the PDFBox snapshot.

atm, i'm not building tika-server myself. rather, using just the DL'd runnable jar from

	https://dlcdn.apache.org/tika/2.4.1/tika-server-standard-2.4.1.jar



Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/23/22 2:01 PM, Tilman Hausherr wrote:
> Weird... all changes I did today are these in the parent pom and in the tika-pipes pom, it updates some versions and it removes version dependencies that were needed to avoid conflicts but these weren't happening anymore so I removed them to simply version maintenance.

I can't confirm that it started working WITH this update.  only at/prior to it.

entirely possible that another very recent update -- in the last couple of days' snap builds -- fixed it, and started working when
i was 'looking elsewhere'.

in any case, it appears to be behaving itself for now ... fingers crossed

well enough that I can (mostly) reliably re-scan/index my local server's email collection.

still getting some timeouts in dovecot handoff if OCR is taking too long on a big source, and currently having some fun with figuring out why i can't manage to set/override tesseract OCR params in tika config.xml (new thread on this ML),
but iiuc, neither's related to this^ prior issue .  i.e, now is *IS* scanning.

Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 23.07.2022 um 19:46 schrieb PGNet Dev:
> with update to
>
>   tika-server-standard-2.4.2-20220723.145242-114.jar
>
> no more errors. so far.
>
> dunno what change fixed it ...
>
Weird... all changes I did today are these in the parent pom and in the 
tika-pipes pom, it updates some versions and it removes version 
dependencies that were needed to avoid conflicts but these weren't 
happening anymore so I removed them to simply version maintenance.

# This patch file was generated by NetBeans IDE
# It uses platform neutral UTF-8 encoding and \n newlines.
--- a/<html>pom.xml (<b>e8ff570</b>)</html>
+++ b/<html>pom.xml (<b>3d8e7ef</b>)</html>
@@ -288,7 +288,7 @@

      <!-- dependency versions -->

-    <aws.version>1.12.265</aws.version>
+    <aws.version>1.12.267</aws.version>
<google.cloud.version>2.10.0</google.cloud.version>
      <asm.version>9.3</asm.version>
      <boilerpipe.version>1.1.0</boilerpipe.version>
@@ -473,7 +473,7 @@
        <dependency>
          <groupId>com.google.protobuf</groupId>
          <artifactId>protobuf-java</artifactId>
-        <version>3.21.2</version>
+        <version>3.21.3</version>
        </dependency>
        <dependency>
          <groupId>com.ibm.icu</groupId>
@@ -526,11 +526,6 @@
<version>${javax.annotation.version}</version>
        </dependency>
        <dependency>
-        <groupId>javax.servlet</groupId>
-        <artifactId>servlet-api</artifactId>
-        <version>2.5</version>
-      </dependency>
-      <dependency>
          <groupId>javax.xml.soap</groupId>
          <artifactId>javax.xml.soap-api</artifactId>
          <version>1.4.0</version>


# This patch file was generated by NetBeans IDE
# It uses platform neutral UTF-8 encoding and \n newlines.
--- a/<html>pom.xml (<b>1385efc</b>)</html>
+++ b/<html><b>Current File</b></html>
@@ -112,11 +112,6 @@
        </dependency>
        <dependency>
          <groupId>io.netty</groupId>
- <artifactId>netty-tcnative-classes</artifactId>
-        <version>2.0.53.Final</version>
-      </dependency>
-      <dependency>
-        <groupId>io.netty</groupId>
          <artifactId>netty-transport</artifactId>
          <version>${netty.version}</version>
        </dependency>
@@ -130,16 +125,6 @@
<artifactId>netty-transport-native-epoll</artifactId>
          <version>${netty.version}</version>
        </dependency>
-      <dependency>
-        <groupId>io.projectreactor.netty</groupId>
-        <artifactId>reactor-netty-http</artifactId>
-        <version>1.0.21</version>
-      </dependency>
-      <dependency>
-        <groupId>io.projectreactor</groupId>
-        <artifactId>reactor-core</artifactId>
-        <version>3.4.21</version>
-      </dependency>
      </dependencies>
    </dependencyManagement>
    <build>


Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
with update to

   tika-server-standard-2.4.2-20220723.145242-114.jar

no more errors. so far.

dunno what change fixed it ...


Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 20.07.2022 um 22:34 schrieb PGNet Dev:
>
>
>     curl -v  --header "Accept: text/plain" -T 
> ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/tika


It's mysterious, it works fine at work, but not at home. I don't think 
that's the cause of your problem because you never got THAT bug.

Tilman


Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/20/22 2:13 PM, Tilman Hausherr wrote:
> I noticed you have "Accept: text/plain"
> 
> When I try this:
> 
> curl -T Get_Started_With_Smallpdf.pdf http://localhost:9998/tika --header "Accept: text/plain"
> 
> I get
> 
> Caused by: java.util.NoSuchElementException: No value present
>          at java.util.OptionalInt.getAsInt(OptionalInt.java:130) ~[?:?]
>          at org.apache.tika.server.core.ProduceTypeResourceComparator.compareProduceTypes(ProduceTypeResourceComparator.java:136) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
>          at org.apache.tika.server.core.ProduceTypeResourceComparator.compare(ProduceTypeResourceComparator.java:97) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
>          at org.apache.cxf.jaxrs.model.OperationResourceInfoComparator.compare(OperationResourceInfoComparator.java:69) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
>          at org.apache.cxf.jaxrs.model.OperationResourceInfoComparator.compare(OperationResourceInfoComparator.java:31) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
>          at java.util.TreeMap.put(TreeMap.java:795) ~[?:?]
>          at java.util.TreeMap.put(TreeMap.java:534) ~[?:?]
>          at org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:551) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
> 
> without the header, I get the html output.

i don't see your error with curl, with or without header spec'm

here, *with* 'text/plain' header specified,

	curl -v  --header "Accept: text/plain" -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/tika
		*   Trying 127.0.0.1:9998...
		* Connected to 127.0.0.1 (127.0.0.1) port 9998 (#0)
		> PUT /tika HTTP/1.1
		> Host: 127.0.0.1:9998
		> User-Agent: curl/7.82.0
		> Accept: text/plain
		> Content-Length: 69451
		> Expect: 100-continue
		>
		* Mark bundle as not supporting multiuse
		< HTTP/1.1 100 Continue
		* We are completely uploaded and fine
		* Mark bundle as not supporting multiuse
		< HTTP/1.1 200 OK
		< Date: Wed, 20 Jul 2022 20:27:25 GMT
		< Content-Type: text/plain
		< Transfer-Encoding: chunked
		< Server: Jetty(9.4.48.v20220622)
		<

		Welcome to Smallpdf

		Digital Documents—All In One Place

		Access Files Anytime, Anywhere

		Enhance Documents in One Click

		Collaborate With Others

		With the new Smallpdf experience, you can
		freely upload, organize, and share digital
		documents. When you enable the ‘Storage’
		option, we’ll also store all processed files here.

		You can access files stored on Smallpdf from
		your computer, phone, or tablet. We’ll also
		sync files from the Smallpdf Mobile App to our
		online portal

		When you right-click on a file, we’ll present
		you with an array of options to convert,
		compress, or modify it.

		Forget mundane administrative tasks. With
		Smallpdf, you can request e-signatures, send
		large files, or even enable the Smallpdf G Suite
		App for your entire organization.

		Ready to take document management to the next level?

		https://bit.ly/smallpdf-preferences-en
		https://bit.ly/smallpdf-preferences-en
		https://bit.ly/smallpdf-download-en
		https://bit.ly/smallpdf-chrome-extension
		https://bit.ly/smallpdf-chrome-extension

		* Connection #0 to host 127.0.0.1 left intact

it requests & returns text, no error.

and withOUT,

	curl -v  -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/tika

		*   Trying 127.0.0.1:9998...
		* Connected to 127.0.0.1 (127.0.0.1) port 9998 (#0)
		> PUT /tika HTTP/1.1
		> Host: 127.0.0.1:9998
		> User-Agent: curl/7.82.0
		> Accept: */*
		> Content-Length: 69451
		> Expect: 100-continue
		>
		* Mark bundle as not supporting multiuse
		< HTTP/1.1 100 Continue
		* We are completely uploaded and fine
		* Mark bundle as not supporting multiuse
		< HTTP/1.1 200 OK
		< Date: Wed, 20 Jul 2022 20:28:56 GMT
		< Content-Type: text/xml
		< Transfer-Encoding: chunked
		< Server: Jetty(9.4.48.v20220622)
		<
		<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
		
		    <head>
		
		        <meta name="pdf:PDFVersion" content="1.7"/>
		
		        <meta name="xmp:CreatorTool" content="Adobe InDesign 15.1 (Macintosh)"/>
		
		        <meta name="pdf:hasXFA" content="false"/>
		
		        <meta name="access_permission:modify_annotations" content="true"/>
		
		        <meta name="access_permission:can_print_degraded" content="true"/>
		
		        <meta name="dcterms:created" content="2020-10-14T15:08:10Z"/>
		
		        <meta name="dcterms:modified" content="2020-10-14T15:08:10Z"/>
		
		        <meta name="dc:format" content="application/pdf; version=1.7"/>
		
		        <meta name="xmpMM:DocumentID" content="xmp.id:7a865d84-8dbf-4015-96b7-fdae89a9603b"/>
		
		        <meta name="pdf:docinfo:creator_tool" content="Adobe InDesign 15.1 (Macintosh)"/>
		
		        <meta name="access_permission:fill_in_form" content="true"/>
		
		        <meta name="pdf:docinfo:modified" content="2020-10-14T15:08:10Z"/>
		
		        <meta name="pdf:hasCollection" content="false"/>
		
		        <meta name="pdf:encrypted" content="false"/>
		
		        <meta name="xmp:CreateDate" content="2020-10-14T17:08:10Z"/>
		
		        <meta name="Content-Length" content="69451"/>
		
		        <meta name="pdf:hasMarkedContent" content="false"/>
		
		        <meta name="Content-Type" content="application/pdf"/>
		
		        <meta name="xmp:ModifyDate" content="2020-10-14T17:08:10Z"/>
		
		        <meta name="xmp:MetadataDate" content="2020-10-14T17:08:10Z"/>
		
		        <meta name="dc:language" content="en-US"/>
		
		        <meta name="pdf:producer" content="Adobe PDF Library 15.0"/>
		
		        <meta name="X-TIKA:digest:SHA256" content="91184c3c4db0d5d6fdac1d33a220f208e29df1b4c06daebc0591ff6447bcfed2"/>
		
		        <meta name="access_permission:extract_for_accessibility" content="true"/>
		
		        <meta name="access_permission:assemble_document" content="true"/>
		
		        <meta name="xmpTPg:NPages" content="1"/>
		
		        <meta name="pdf:hasXMP" content="true"/>
		
		        <meta name="access_permission:extract_content" content="true"/>
		
		        <meta name="xmpMM:DerivedFrom:DocumentID" content="xmp.did:b47e2f57-0029-45c5-8e1d-97f7c1535615"/>
		
		        <meta name="access_permission:can_print" content="true"/>
		
		        <meta name="pdf:docinfo:trapped" content="False"/>
		
		        <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
		
		        <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
		
		        <meta name="xmpMM:DerivedFrom:InstanceID" content="xmp.iid:20710a9c-3691-41fa-bd81-adf858100386"/>
		
		        <meta name="access_permission:can_modify" content="true"/>
		
		        <meta name="pdf:docinfo:producer" content="Adobe PDF Library 15.0"/>
		
		        <meta name="pdf:docinfo:created" content="2020-10-14T15:08:10Z"/>
		
		        <title>&#0;</title>
		
		    </head>
		
		    <body>
		        <div class="page">
		            <p/>
		
		            <p>Welcome to Smallpdf
		</p>
		
		            <p>Digital Documents—All In One Place
		</p>
		
		            <p>Access Files Anytime, Anywhere
		</p>
		
		            <p>Enhance Documents in One Click
		</p>
		
		            <p>Collaborate With Others
		</p>
		
		            <p>With the new Smallpdf experience, you can
		freely upload, organize, and share digital
		documents. When you enable the ‘Storage’
		option, we’ll also store all processed files here.
		</p>

		            <p>You can access files stored on Smallpdf from
		your computer, phone, or tablet. We’ll also
		sync files from the Smallpdf Mobile App to our
		online portal
		</p>

		            <p>When you right-click on a file, we’ll present
		you with an array of options to convert,
		compress, or modify it.
		</p>

		            <p>Forget mundane administrative tasks. With
		Smallpdf, you can request e-signatures, send
		large files, or even enable the Smallpdf G Suite
		App for your entire organization.
		</p>

		            <p>Ready to take document management to the next level? </p>

		            <p/>

		            <div class="annotation">
		                <a href="https://bit.ly/smallpdf-preferences-en">https://bit.ly/smallpdf-preferences-en</a>
		            </div>

		            <div class="annotation">
		                <a href="https://bit.ly/smallpdf-preferences-en">https://bit.ly/smallpdf-preferences-en</a>
		            </div>

		            <div class="annotation">
		                <a href="https://bit.ly/smallpdf-download-en">https://bit.ly/smallpdf-download-en</a>
		            </div>

		            <div class="annotation">
		                <a href="https://bit.ly/smallpdf-chrome-extension">https://bit.ly/smallpdf-chrome-extension</a>
		            </div>

		            <div class="annotation">
		                <a href="https://bit.ly/smallpdf-chrome-extension">https://bit.ly/smallpdf-chrome-extension</a>
		            </div>

		        </div>

		    </body>
		</html>
		* Connection #0 to host 127.0.0.1 left intact

, requests '*/*' and returns "text/xml'

just to check, if I use your at-the-end header arg placement

	curl -v -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/tika --header "Accept: text/plain"

i again see no error,

	*   Trying 127.0.0.1:9998...
	* Connected to 127.0.0.1 (127.0.0.1) port 9998 (#0)
	> PUT /tika HTTP/1.1
	> Host: 127.0.0.1:9998
	> User-Agent: curl/7.82.0
	> Accept: text/plain
	> Content-Length: 69451
	> Expect: 100-continue
	>
	* Mark bundle as not supporting multiuse
	< HTTP/1.1 100 Continue
	* We are completely uploaded and fine
	* Mark bundle as not supporting multiuse
	< HTTP/1.1 200 OK
	< Date: Wed, 20 Jul 2022 20:32:00 GMT
	< Content-Type: text/plain
	< Transfer-Encoding: chunked
	< Server: Jetty(9.4.48.v20220622)
	<

	Welcome to Smallpdf

	Digital Documents—All In One Place

	Access Files Anytime, Anywhere

	Enhance Documents in One Click

	Collaborate With Others

	With the new Smallpdf experience, you can
	freely upload, organize, and share digital
	documents. When you enable the ‘Storage’
	option, we’ll also store all processed files here.

	You can access files stored on Smallpdf from
	your computer, phone, or tablet. We’ll also
	sync files from the Smallpdf Mobile App to our
	online portal

	When you right-click on a file, we’ll present
	you with an array of options to convert,
	compress, or modify it.

	Forget mundane administrative tasks. With
	Smallpdf, you can request e-signatures, send
	large files, or even enable the Smallpdf G Suite
	App for your entire organization.

	Ready to take document management to the next level?

	https://bit.ly/smallpdf-preferences-en
	https://bit.ly/smallpdf-preferences-en
	https://bit.ly/smallpdf-download-en
	https://bit.ly/smallpdf-chrome-extension
	https://bit.ly/smallpdf-chrome-extension

	* Connection #0 to host 127.0.0.1 left intact


this is with

	curl -V
		curl 7.82.0 (x86_64-redhat-linux-gnu) libcurl/7.82.0 OpenSSL/3.0.5 zlib/1.2.11 brotli/1.0.9 libidn2/2.3.3 libpsl/0.21.1 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.46.0 OpenLDAP/2.6.2
		Release-Date: 2022-03-05
		Protocols: dict file ftp ftps gopher gophers http https imap imaps ldap ldaps mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp
		Features: alt-svc AsynchDNS brotli GSS-API HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM NTLM_WB PSL SPNEGO SSL TLS-SRP UnixSockets

and

	tika-server-standard-2.4.2-20220720.025305-98.jar




Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/20/22 4:10 PM, PGNet Dev wrote:
> what _should_ it be for programmatic submission (e.g. via dovecot fts-tika) to tika?  text or html?

it *appears* to me that the flow is

   email+attachment -> dovecot/fts-tika ------[ pdf attachment ]-----> tika-backend -----[ text result ]-----> dovecot/fts-flatcurve

where flatcurve is the indexer.  it's expecting data in text format.

from the curl results, seems that for tika-backend to return the text/plain result, it needs the "Accept: text/plain"

so, not unexpectedly  "Accept: text/plain" is passed in the PUT

Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/20/22 2:13 PM, Tilman Hausherr wrote:
> I noticed you have "Accept: text/plain"
> 
> When I try this:
> 
> curl -T Get_Started_With_Smallpdf.pdf http://localhost:9998/tika --header "Accept: text/plain"
> 
> I get
> 
> Caused by: java.util.NoSuchElementException: No value present
>          at java.util.OptionalInt.getAsInt(OptionalInt.java:130) ~[?:?]
>          at org.apache.tika.server.core.ProduceTypeResourceComparator.compareProduceTypes(ProduceTypeResourceComparator.java:136) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
>          at org.apache.tika.server.core.ProduceTypeResourceComparator.compare(ProduceTypeResourceComparator.java:97) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
>          at org.apache.cxf.jaxrs.model.OperationResourceInfoComparator.compare(OperationResourceInfoComparator.java:69) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
>          at org.apache.cxf.jaxrs.model.OperationResourceInfoComparator.compare(OperationResourceInfoComparator.java:31) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
>          at java.util.TreeMap.put(TreeMap.java:795) ~[?:?]
>          at java.util.TreeMap.put(TreeMap.java:534) ~[?:?]
>          at org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:551) ~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
> 
> without the header, I get the html output.

interesting catch. what _should_ it be for programmatic submission (e.g. via dovecot fts-tika) to tika?  text or html?

it's reported here in the tika logs I posted, earliest at

	...
	DEBUG [qtp485047320-28] 11:01:15,794 org.eclipse.jetty.server.HttpChannel REQUEST for //127.0.0.1:9998/tika/ on HttpChannelOverHttp@2ab20b5f{s=HttpChannelState@1dd88b59{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=1,c=false/false,a=IDLE,uri=//127.0.0.1:9998/tika/,age=1}
	PUT //127.0.0.1:9998/tika/ HTTP/1.1
	Host: 127.0.0.1:9998
	Date: Wed, 20 Jul 2022 15:01:15 GMT
	Transfer-Encoding: chunked
	Connection: keep-alive
	Content-Type: application/pdf
	Content-Disposition: attachment; filename="Get_Started_With_Smallpdf.pdf"
!!	Accept: text/plain
	...


which appears to be the PUT, I assume, pushed by the dovecot-end of the handshake.

checking dovecot source, it hails from here,

	https://github.com/dovecot/core/blob/main/src/plugins/fts/fts-parser-tika.c#L170

		if (parser_context->content_disposition != NULL)
				http_client_request_add_header(http_req, "Content-Disposition",
							       parser_context->content_disposition);
!!	170		http_client_request_add_header(http_req, "Accept", "text/plain");

			parser->http_req = http_req;
			return &parser->parser;
		}

The '"Accept", "text/plain"' has been there awhile; e.g., quick-checking old release source for v2.3.8, from Oct 8, 2019,

	https://github.com/dovecot/core/blob/release-2.3.8/src/plugins/fts/fts-parser-tika.c#L163



Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tilman Hausherr <TH...@t-online.de>.
I noticed you have "Accept: text/plain"

When I try this:

curl -T Get_Started_With_Smallpdf.pdf http://localhost:9998/tika 
--header "Accept: text/plain"

I get

Caused by: java.util.NoSuchElementException: No value present
         at java.util.OptionalInt.getAsInt(OptionalInt.java:130) ~[?:?]
         at 
org.apache.tika.server.core.ProduceTypeResourceComparator.compareProduceTypes(ProduceTypeResourceComparator.java:136) 
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
         at 
org.apache.tika.server.core.ProduceTypeResourceComparator.compare(ProduceTypeResourceComparator.java:97) 
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
         at 
org.apache.cxf.jaxrs.model.OperationResourceInfoComparator.compare(OperationResourceInfoComparator.java:69) 
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
         at 
org.apache.cxf.jaxrs.model.OperationResourceInfoComparator.compare(OperationResourceInfoComparator.java:31) 
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
         at java.util.TreeMap.put(TreeMap.java:795) ~[?:?]
         at java.util.TreeMap.put(TreeMap.java:534) ~[?:?]
         at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:551) 
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]

without the header, I get the html output.

Tilman


Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
hi,

On 7/20/22 10:07 AM, Tim Allison wrote:
> Sorry...just catching up on this.  If you want the digest of the incoming bytes and you can configure tika-server via a config file, try this as the config (e.g. tika-config-digest.xml)
>
> <properties>
>    <server>
>      <params>
>        <digest>sha256</digest>
>      </params>
>    </server>
> </properties>
>
> then start the server: java -jar tika-server-standard-xyz.jar -c tika-config-digest.xml
>
> Then send the file: curl -T ~/Downloads/Get_Started_With_Smallpdf.pdf http://localhost:9998/tika <http://localhost:9998/tika>

i'm already normally launching tika service as,

	cat  /etc/systemd/system/tika.service
		[Unit]
		Description=Apache Tika server
		After=network-online.target
		Requires=network-online.target

		[Service]
		SyslogIdentifier=tika
		User=tika
		Group=tika
		ExecStart=/usr/bin/java \
		 -jar /srv/tika/tika-server.jar \
!!		 -c /etc/tika/tika-server-config-custom.xml

		[Install]
		WantedBy=multi-user.target

where

	cat /etc/tika/tika-server-config-custom.xml
		<?xml version="1.0" encoding="UTF-8"?>
		<properties>
		  <server>
		    <params>
		      <logLevel>debug</logLevel>
		      <port>9998</port>
		      <host>127.0.0.1</host>
		      <javaPath>/usr/bin/java</javaPath>
		      <noFork>false</noFork>
		      <forkedJvmArgs>
		        <arg>-Xms1g</arg>
		        <arg>-Xmx1g</arg>
		        <arg>-Dpdfbox.fontcache=/var/tika</arg>
		        <arg>-Dlog4j2.debug</arg>
		      </forkedJvmArgs>

!!		      <digest>sha256</digest>
		      <enableUnsecureFeatures>false</enableUnsecureFeatures>
		      <id></id>
		      <maxFiles>100000</maxFiles>
		      <maxForkedStartupMillis>120000</maxForkedStartupMillis>
		      <maxRestarts>-1</maxRestarts>
		      <minimumTimeoutMillis>30000</minimumTimeoutMillis>
		      <returnStackTrace>false</returnStackTrace>
		      <taskPulseMillis>10000</taskPulseMillis>
		      <taskTimeoutMillis>300000</taskTimeoutMillis>

		      <endpoints>
		        <endpoint>tika</endpoint>
		        <endpoint>status</endpoint>
		        <endpoint>rmeta</endpoint>
		      </endpoints>

		    </params>
		  </server>
		</properties>

DL'ing the _latest_ build

	F="tika-server-standard-2.4.2-20220720.025305-98.jar"
	D="/srv/tika"
	cd ${D}
	rm -rf TMP
	mkdir -p TMP/mod
	cd TMP
	rm -f ${F}*
	wget https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-server-standard/2.4.2-SNAPSHOT/${F}
	cd mod

extract

	jar -xfv ../${F}

mod logging

	perl -pi -e 's|Root level="info"|Root level="debug"|g' log4j2.xml

repack

	jar -cvmf META-INF/MANIFEST.MF ../mod.jar *

my usual target symlink

	cd ${D}
	ln -sf TMP/mod.jar tika-server.jar

stop tiks service, if any

	systemctl stop tika
	systemctl disable tika
	systemctl status tika  -ln0
		○ tika.service - Apache Tika server
		     Loaded: loaded (/etc/systemd/system/tika.service; disabled; vendor preset: disabled)
		     Active: inactive (dead)
	ps ax | grep tika
		(empty)

start manually

	/usr/bin/java \
	 -jar /srv/tika/tika-server.jar \
	 -c /etc/tika/tika-server-config-custom.xml

		...
		INFO  [main] 10:49:37,925 org.apache.tika.server.core.TikaServerProcess Started Apache Tika server  at http://127.0.0.1:9998/

, console persists here for this active process

	ps ax | grep tika
		29181 pts/0    Sl+    0:07 /usr/bin/java -jar /srv/tika/tika-server.jar -c /etc/tika/tika-server-config-custom.xml
		29202 pts/0    Sl+    0:16 /usr/bin/java -Xms1g -Xmx1g -Dpdfbox.fontcache=/var/tika -Dlog4j2.debug -Djava.awt.headless=true -cp /srv/tika/tika-server.jar -Dtika.server.id= org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i  -c /etc/tika/tika-server-config-custom.xml -forkedStatusFile /tmp/apache-tika-server-forked-tmp-9024552766199524298 -numRestarts 0


exec in other shell window

	curl -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/tika

@ console for the *curl* command, I see

	<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
	    <head>
	        <meta name="pdf:PDFVersion" content="1.7"/>
	        ...
	        <meta name="X-TIKA:digest:SHA256" content="91184c3c4db0d5d6fdac1d33a220f208e29df1b4c06daebc0591ff6447bcfed2"/>

but nothing seemingly relevant/informative in the java/tika console session; lots of DEBUG etc, but no sha256sum info

in any case, for this scenario, checking original

	sha256sum ~/Get_Started_With_Smallpdf.pdf
		91184c3c4db0d5d6fdac1d33a220f208e29df1b4c06daebc0591ff6447bcfed2  /root/Get_Started_With_Smallpdf.pdf

it's a match.

but that's NOT testing the fail scenario.

THAT scenario is email send/receive -> dovecot -> dovecot fts-tika plugin -> tika-server.

config'ing dovecot to use fts-tika scanning

	fts_tika = http://127.0.0.1:9998/tika/

& generate verbose debug logs

	mail_debug = yes

when I exec that send/receive -- from, e.g., an external gmail account to my server

I see the attachment handoff.  1st, sent from dovecot fts-tika

	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: queue http://127.0.0.1:9998: Connection to peer 127.0.0.1:9998 claimed request [Req1: PUT http://127.0.0.1:9998/tika/]
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: conn 127.0.0.1:9998 [1]: Claimed request [Req1: PUT http://127.0.0.1:9998/tika/]
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Sent header
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 5562, buffered=5570)
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: peer 127.0.0.1:9998: No more requests to service for this peer (1 connections exist, 0 pending)
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 6048, buffered=6056)
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 3409, buffered=3416)
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Finished sending payload
	2022-07-20 11:07:02 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request to finish

	==> /var/log/dovecot/dovecot-info.log <==
	2022-07-20 11:07:02 lmtp(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw>: Info: sieve: msgid=<8e...@fastmail.fm>: stored mail into mailbox 'INBOX'

	==> /var/log/dovecot/dovecot-debug.log <==
	2022-07-20 11:07:02 lmtp(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw>: Debug: sieve: msgid=<8e...@fastmail.fm>: Finish implicit keep action
	2022-07-20 11:07:02 lmtp(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw>: Debug: sieve: msgid=<8e...@fastmail.fm>: Finishing actions
	2022-07-20 11:07:02 lmtp(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw>: Debug: sieve: msgid=<8e...@fastmail.fm>: Finished executing result (final, status=ok, keep=yes)
	2022-07-20 11:07:02 lmtp(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw>: Debug: sieve: multi-script: Sequence finished (status=ok, keep=yes)
	2022-07-20 11:07:02 lmtp(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw>: Debug: sieve: multi-script: Destroy
	2022-07-20 11:07:02 lmtp(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw>: Debug: lmtp-server: conn unix:pid=39462,uid=89 [1]: rcpt user01@example.com: duplicate db: Cleanup
	2022-07-20 11:07:02 lmtp(39463): Debug: lmtp-server: conn unix:pid=39462,uid=89 [1]: rcpt user01@example.com: User session is finished
	2022-07-20 11:07:02 lmtp(39463): Debug: lmtp-server: conn unix:pid=39462,uid=89 [1]: rcpt user01@example.com: dict(file): dict destroyed

	==> /var/log/dovecot/dovecot-info.log <==
	2022-07-20 11:07:02 lmtp(39463): Info: Disconnect from local: Logged out (state=READY)

	==> /var/log/dovecot/dovecot-debug.log <==
	2022-07-20 11:07:06 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: conn 127.0.0.1:9998 [1]: Got 200 response for request [Req1: PUT http://127.0.0.1:9998/tika/]: OK (took 3327 ms + 217 ms in queue)
	2022-07-20 11:07:06 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: conn 127.0.0.1:9998 [1]: Response payload stream destroyed (20 ms after initial response)
	2022-07-20 11:07:06 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Finished
	2022-07-20 11:07:06 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: queue http://127.0.0.1:9998: Dropping request [Req1: PUT http://127.0.0.1:9998/tika/]
	2022-07-20 11:07:06 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: host 127.0.0.1: Host is idle (timeout = 100 msecs)
	2022-07-20 11:07:06 indexer-worker(user01@example.com)<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>: Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Free (requests left=1)

at this point, dovecot's 'done' with the attachment as far as tika is involved, and it's 'in' tika-backend's control; dovecot DOES of course continue to process, and ultimately deliver, the email+attachment to my inbox.  where, as reported earlier, I can verify that the RECEIVED attachment is identical in size/sha256sum to the original.

i do see the handoff to tika-backend,

	...
	DEBUG [qtp485047320-28] 11:01:15,794 org.eclipse.jetty.server.HttpChannel REQUEST for //127.0.0.1:9998/tika/ on HttpChannelOverHttp@2ab20b5f{s=HttpChannelState@1dd88b59{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=1,c=false/false,a=IDLE,uri=//127.0.0.1:9998/tika/,age=1}
	PUT //127.0.0.1:9998/tika/ HTTP/1.1
	Host: 127.0.0.1:9998
	Date: Wed, 20 Jul 2022 15:01:15 GMT
	Transfer-Encoding: chunked
	Connection: keep-alive
	Content-Type: application/pdf
	Content-Disposition: attachment; filename="Get_Started_With_Smallpdf.pdf"
	Accept: text/plain


	DEBUG [qtp485047320-28] 11:01:15,799 org.eclipse.jetty.server.HttpConnection HttpConnection@7d858986::SocketChannelEndPoint@7f055fae{l=/127.0.0.1:9998,r=/127.0.0.1:59150,OPEN,fill=-,flush=-,to=43/200000}{io=0/0,kio=0,kro=1}->HttpConnection@7d858986[p=HttpParser{s=CHUNKED_CONTENT,0 of -1},g=HttpGenerator@127a4f1e{s=START}]=>HttpChannelOverHttp@2ab20b5f{s=HttpChannelState@1dd88b59{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=1,c=false/false,a=IDLE,uri=//127.0.0.1:9998/tika/,age=6} parsed true HttpParser{s=CHUNKED_CONTENT,0 of -1}
	...
	TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
	DEBUG [qtp485047320-31] 11:07:03,442 org.apache.cxf.transport.http.Headers Request Headers: {Accept=[text/plain], Authorization=[***], connection=[keep-alive], Content-Disposition=[attachment; filename="Get_Started_With_Smallpdf.pdf"], content-type=[application/pdf], Date=[Wed, 20 Jul 2022 15:07:02 GMT], Host=[127.0.0.1:9998], Proxy-Authorization=[***], transfer-encoding=[chunked]}
	TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
	...

but no trace, that I can find in any log, of sha256sum generated by tika, as in the curl case above.

THAT is the necessary bit here -- getting at, and confirming, the size/sha256sum of what Tika has received -- from dovecot's fts-tika handoff.

how/where to get tika to spit our THAT info?
either as loggable/logged response to dovecot's http-client connection, on successful handoff,
in its own logs,
or, just trapping the file and checking manually?


Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tim Allison <ta...@apache.org>.
Sorry...just catching up on this.  If you want the digest of the incoming
bytes and you can configure tika-server via a config file, try this as the
config (e.g. tika-config-digest.xml)

<properties>
  <server>
    <params>
      <digest>sha256</digest>
    </params>
  </server>
</properties>

then start the server: java -jar tika-server-standard-xyz.jar -c
tika-config-digest.xml

Then send the file: curl -T ~/Downloads/Get_Started_With_Smallpdf.pdf
http://localhost:9998/tika

This should be in the output: <meta name="X-TIKA:digest:SHA256"
content="91184c3c4db0d5d6fdac1d33a220f208e29df1b4c06daebc0591ff6447bcfed2"/>

I confirmed this value with shasum -a 256.



On Tue, Jul 19, 2022 at 1:11 PM PGNet Dev <pg...@gmail.com> wrote:

> On 7/19/22 12:24 PM, Tilman Hausherr wrote:
> > The checkstyle violation is about the coding style. You can delete that
> part in the tika-parent/pom.xml if you want, or add <skip>true</skip> below
> "<configuration>" in that plugin. Same for the ossindex-maven-plugin and
> the forbiddenapis plugin.
>
> > If the debugger didn't stop, then the breakpoint was at the wrong place.
> Or it's not possible to debug.
>
> I'll give the pom mod a try in a bit.
>
> As to which breakpoint, I certainly don't know the tika/java internals
> well enough to say what is/isn't correct, yet.
>
> > Re "is there anything informative in that now-more-verbose DEBUG output?
> " well yes, the MD5 output. This proves that the file is different. (ok,
> the different length showed that too)
>
> I've asked over at Dovecot ML what, specifically, dovecot 'sends' to the
> tika backend via their fts-tika plugin:
>
>    the original/complete/unmodified attachment, suggesting that the file
> size / MD5 hash should be the same as what tika's trapping
>
> or,
>
>    some modification to the file is made (trimmed, or add'l headers, etc
> etc), and that the size/hash are not _expected_ to be the same
>
> we'll see what i hear
>
>

Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/19/22 12:24 PM, Tilman Hausherr wrote:
> The checkstyle violation is about the coding style. You can delete that part in the tika-parent/pom.xml if you want, or add <skip>true</skip> below "<configuration>" in that plugin. Same for the ossindex-maven-plugin and the forbiddenapis plugin.

> If the debugger didn't stop, then the breakpoint was at the wrong place. Or it's not possible to debug.

I'll give the pom mod a try in a bit.

As to which breakpoint, I certainly don't know the tika/java internals well enough to say what is/isn't correct, yet.

> Re "is there anything informative in that now-more-verbose DEBUG output? " well yes, the MD5 output. This proves that the file is different. (ok, the different length showed that too)

I've asked over at Dovecot ML what, specifically, dovecot 'sends' to the tika backend via their fts-tika plugin:

   the original/complete/unmodified attachment, suggesting that the file size / MD5 hash should be the same as what tika's trapping

or,

   some modification to the file is made (trimmed, or add'l headers, etc etc), and that the size/hash are not _expected_ to be the same

we'll see what i hear


Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tilman Hausherr <TH...@t-online.de>.
The checkstyle violation is about the coding style. You can delete that 
part in the tika-parent/pom.xml if you want, or add <skip>true</skip> 
below "<configuration>" in that plugin. Same for the 
ossindex-maven-plugin and the forbiddenapis plugin.

If the debugger didn't stop, then the breakpoint was at the wrong place. 
Or it's not possible to debug.

Re "is there anything informative in that now-more-verbose DEBUG output? 
" well yes, the MD5 output. This proves that the file is different. (ok, 
the different length showed that too)

Tilman


Am 19.07.2022 um 11:37 schrieb PGNet Dev:
> On 7/18/22 11:05 PM, Tilman Hausherr wrote:
>> Yes the file is deleted...
>
>>
>> Alternatively, grab the source code from the trunk, and add this line 
>> in the file
>> tika-main\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pdf-module\src\main\java\org\apache\tika\parser\pdf\PDFParser.java 
>>
>>
>> Files.write(Paths.get("/tmp/yourfile.pdf"), 
>> Files.readAllBytes(tstream.getPath()));
>>
>> after the line that has ", md5: ".
>>
>> Then build the parser module, and then the standard server subproject 
>> with "mvn -DskipTests install".
>
> 1st, attempting the build, FAILs
>
>     cd src/tika
>     EDIT 
> tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
>
>             ...
>     168       if (LOG.isDebugEnabled() && tstream != null) {
>                 LOG.debug("File: " + tstream.getPath() + ", length: " 
> + tstream.getLength() +
>                         ", md5: " + calcMD5(tstream.getPath()));
>         +        Files.write(Paths.get("/tmp/yourfile.pdf"), 
> Files.readAllBytes(tstream.getPath()));
>             }
>             ...
>
>
>     mvn install -pl tika-parsers -am
>     mvn -DskipTests install
>         ...
>         [INFO] BUILD FAILURE
>         [INFO] 
> ------------------------------------------------------------------------
>         [INFO] Total time:  31.493 s
>         [INFO] Finished at: 2022-07-19T04:48:43-04:00
>         [INFO] 
> ------------------------------------------------------------------------
>         [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-checkstyle-plugin:3.1.2:check 
> (validate) on project tika-parser-pdf-module: You have 1 Checkstyle 
> violation. -> [Help 1]
>
>
>> try setting a breakpoint in org.apache.tika.parser.pdf.PDFParser so 
>> that you get that file.
>
> next, run in debugger instead,
>
>     sudo -u tika /usr/bin/jdb \
>      -classpath /srv/tika/tika-server.jar \
>      org.apache.tika.server.core.TikaServerCli \
>      -c /etc/tika/tika-server-config-custom.xml
>
>         Initializing jdb ...
>
> set breakpoint
>
>     > stop in org.apache.tika.parser.pdf.PDFParser
>     Deferring breakpoint org.apache.tika.parser.pdf.PDFParser.
>     It will be set after the class is loaded.
>
> run it
>
>     > run
>     run org.apache.tika.server.core.TikaServerCli -c 
> /etc/tika/tika-server-config-custom.xml
>     Set uncaught java.lang.Throwable
>     Set deferred uncaught java.lang.Throwable
>     >
>     VM Started: DEBUG [pool-2-thread-1] 05:21:37,469 
> org.apache.tika.server.core.TikaServerWatchDog forked process 
> commandline: [/usr/bin/java, -Xms1g, -Xmx1g, 
> -Dpdfbox.fontcache=/var/tika, -Dlog4j2.debug, 
> -Djava.awt.headless=true, -cp, /srv/tika/tika-server.jar, 
> -Dtika.server.id=, org.apache.tika.server.core.TikaServerProcess, -h, 
> 127.0.0.1, -p, 9998, -i, , -c, 
> /etc/tika/tika-server-config-custom.xml, -forkedStatusFile, 
> /tmp/apache-tika-server-forked-tmp-11335114907490900739, -numRestarts, 0]
>     ...
>     DEBUG [main] 05:21:50,871 org.apache.cxf.endpoint.ServerImpl 
> register the server to serverRegistry
>     TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor 
> class org.apache.tika.server.core.ServerStatusWatcher
>     INFO  [main] 05:21:50,906 
> org.apache.tika.server.core.TikaServerProcess Started Apache Tika 
> server  at http://127.0.0.1:9998/
>
> receive email+attachment
>
> *lots* of debug logs @ jdb console,
>
>     -> https://pastebin.com/HDtR9RKP
>
> NOTE, there,
>
>     ...
>     DEBUG [qtp485047320-31] 05:22:58,423 
> org.apache.tika.parser.pdf.PDFParser File: 
> /tmp/apache-tika-11251774738482156793.tmp, length: 104932, md5: 
> 092bf24b2cac33fac27965549c99613a
>     ...
>
> but, no file captured
>
>     ls -al /tmp/apache-tika*tmp
>         ls: cannot access '/tmp/apache-tika*tmp': No such file or 
> directory
>
> is there anything informative in that now-more-verbose DEBUG output?
>
>
>


Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/18/22 11:05 PM, Tilman Hausherr wrote:
> Yes the file is deleted...

> 
> Alternatively, grab the source code from the trunk, and add this line in the file
> tika-main\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pdf-module\src\main\java\org\apache\tika\parser\pdf\PDFParser.java
> 
> Files.write(Paths.get("/tmp/yourfile.pdf"), Files.readAllBytes(tstream.getPath()));
> 
> after the line that has ", md5: ".
> 
> Then build the parser module, and then the standard server subproject with "mvn -DskipTests install".

1st, attempting the build, FAILs

	cd src/tika
	EDIT tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java

			...
	168	   if (LOG.isDebugEnabled() && tstream != null) {
				LOG.debug("File: " + tstream.getPath() + ", length: " + tstream.getLength() +
						", md5: " + calcMD5(tstream.getPath()));
		+		Files.write(Paths.get("/tmp/yourfile.pdf"), Files.readAllBytes(tstream.getPath()));
			}
			...


	mvn install -pl tika-parsers -am
	mvn -DskipTests install
		...
		[INFO] BUILD FAILURE
		[INFO] ------------------------------------------------------------------------
		[INFO] Total time:  31.493 s
		[INFO] Finished at: 2022-07-19T04:48:43-04:00
		[INFO] ------------------------------------------------------------------------
		[ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:3.1.2:check (validate) on project tika-parser-pdf-module: You have 1 Checkstyle violation. -> [Help 1]


> try setting a breakpoint in org.apache.tika.parser.pdf.PDFParser so that you get that file.

next, run in debugger instead,

	sudo -u tika /usr/bin/jdb \
	 -classpath /srv/tika/tika-server.jar \
	 org.apache.tika.server.core.TikaServerCli \
	 -c /etc/tika/tika-server-config-custom.xml

		Initializing jdb ...

set breakpoint

	> stop in org.apache.tika.parser.pdf.PDFParser
	Deferring breakpoint org.apache.tika.parser.pdf.PDFParser.
	It will be set after the class is loaded.

run it

	> run
	run org.apache.tika.server.core.TikaServerCli -c /etc/tika/tika-server-config-custom.xml
	Set uncaught java.lang.Throwable
	Set deferred uncaught java.lang.Throwable
	>
	VM Started: DEBUG [pool-2-thread-1] 05:21:37,469 org.apache.tika.server.core.TikaServerWatchDog forked process commandline: [/usr/bin/java, -Xms1g, -Xmx1g, -Dpdfbox.fontcache=/var/tika, -Dlog4j2.debug, -Djava.awt.headless=true, -cp, /srv/tika/tika-server.jar, -Dtika.server.id=, org.apache.tika.server.core.TikaServerProcess, -h, 127.0.0.1, -p, 9998, -i, , -c, /etc/tika/tika-server-config-custom.xml, -forkedStatusFile, /tmp/apache-tika-server-forked-tmp-11335114907490900739, -numRestarts, 0]
	...
	DEBUG [main] 05:21:50,871 org.apache.cxf.endpoint.ServerImpl register the server to serverRegistry
	TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.tika.server.core.ServerStatusWatcher
	INFO  [main] 05:21:50,906 org.apache.tika.server.core.TikaServerProcess Started Apache Tika server  at http://127.0.0.1:9998/

receive email+attachment

*lots* of debug logs @ jdb console,

	-> https://pastebin.com/HDtR9RKP

NOTE, there,

	...
	DEBUG [qtp485047320-31] 05:22:58,423 org.apache.tika.parser.pdf.PDFParser File: /tmp/apache-tika-11251774738482156793.tmp, length: 104932, md5: 092bf24b2cac33fac27965549c99613a
	...

but, no file captured

	ls -al /tmp/apache-tika*tmp
		ls: cannot access '/tmp/apache-tika*tmp': No such file or directory

is there anything informative in that now-more-verbose DEBUG output?




Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Yes the file is deleted... try setting a breakpoint in 
org.apache.tika.parser.pdf.PDFParser so that you get that file.

Alternatively, grab the source code from the trunk, and add this line in 
the file
tika-main\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pdf-module\src\main\java\org\apache\tika\parser\pdf\PDFParser.java

Files.write(Paths.get("/tmp/yourfile.pdf"), 
Files.readAllBytes(tstream.getPath()));

after the line that has ", md5: ".

Then build the parser module, and then the standard server subproject 
with "mvn -DskipTests install".

The file tika-server-standard-2.4.2-SNAPSHOT.jar will be in

tika-main\tika-server\tika-server-standard\target

I can also do it for you and upload the jar file somewhere, but 
obviously that's risky.

Tilman

Am 19.07.2022 um 03:53 schrieb PGNet Dev:
>
>>
>> I've just improved the output, I'm adding an MD5 checksum. This would 
>> be another indicator that something is wrong (or not).
>
> indeed.
>
> i now see in the logs
>
>     Jul 18 21:28:23 mx-test tika[18970]: DEBUG [qtp977522995-24] 
> 21:28:23,264 org.apache.tika.parser.pdf.PDFParser File: 
> /tmp/apache-tika-9115808773791090696.tmp, length: 104932, md5: 
> 092bf24b2cac33fac27965549c99613a
>
> checking the original attachment
>
>     ls -al Get_Started_With_Smallpdf.pdf
>         -rw-r--r-- 1 root root 68K Jul 15 12:16 
> Get_Started_With_Smallpdf.pdf
>
>     file Get_Started_With_Smallpdf.pdf
>         Get_Started_With_Smallpdf.pdf: PDF document, version 1.7
>
>     md5sum Get_Started_With_Smallpdf.pdf
>         14266e428c6a5f371c5abe164026c762 Get_Started_With_Smallpdf.pdf
>
> checking,
>
>     ls -al /tmp/apache-tika-9115808773791090696.tmp
>         ls: cannot access '/tmp/apache-tika-9115808773791090696.tmp': 
> No such file or directory
>
> is not persisted.
>
> in any case, the  /tmp file's NOT the same size as the orig pdf -- 
> oddly, LARGER than the original file.
> dunno what to make of that yet.
>
> fwiw, the received attachment is verified to be identical to the sent 
> original.



Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/18/22 11:56 AM, Tilman Hausherr wrote:
> Something doesn't work properly on your side, I get a lot of "DEBUG" lines. I opened tika-server-standard-2.4.2-SNAPSHOT.jar with 7zip, extracted it, changed it, and put it back. This is how it looks (comment removed):
> 
> <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
> <Configuration status="WARN">
>    <Appenders>
>      <Console name="Console" target="SYSTEM_ERR">
>        <PatternLayout pattern="%-5p [%t] %d{HH:mm:ss,SSS} %c %m%n"/>
>      </Console>
>    </Appenders>
>    <Loggers>
>      <Root level="debug">
>        <AppenderRef ref="Console"/>
>      </Root>
>    </Loggers>
> </Configuration>

editing log4j2.xml directly in the jar, and repacking works.  no idea why other method doesn't.

	D="/srv/tika"
	F="tika-server-standard-2.4.2-20220718.165252-94.jar"
	cd ${D}
	rm -rf TMP
	mkdir -p TMP/mod
	cd TMP
	rm -f ${F}*
	wget https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-server-standard/2.4.2-SNAPSHOT/${F}
	cd mod
	jar -xfv ../${F}
	perl -pi -e 's|Root level="info"|Root level="debug"|g' log4j2.xml
	jar -cvmf META-INF/MANIFEST.MF ../mod.jar *

launch tika using 'mod.jar'

verify

	ls -al /srv/tika/tika-server.jar
		lrwxrwxrwx 1 root root 11 Jul 18 14:46 /srv/tika/tika-server.jar -> TMP/mod.jar

	systemctl status tika -ln0
		● tika.service - Apache Tika server
		     Loaded: loaded (/etc/systemd/system/tika.service; enabled; vendor preset: disabled)
		     Active: active (running) since Mon 2022-07-18 21:24:40 EDT; 18s ago
		   Main PID: 18935 (java)
		      Tasks: 54 (limit: 8811)
		     Memory: 174.0M
		        CPU: 24.491s
		     CGroup: /system.slice/tika.service
		             ├─ 18935 /usr/bin/java -jar /srv/tika/tika-server.jar -c /etc/tika/tika-server-config-custom.xml
		             └─ 18970 /usr/bin/java -Xms1g -Xmx1g -Dpdfbox.fontcache=/var/tika -Dlog4j2.debug -Djava.awt.headless=true -cp /srv/tika/tika-server.jar -Dtika.server.id= org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i "" -c /etc/tika/tika-server-config-custom.xml -forkedStatusFile /tmp/apache-tika-server-forked-tmp-1104448251575803884 -numRestarts 0

re-send message with attachment ...

verbose/DEBUG logs

	journalctl -f -u dovecot

		->	https://pastebin.com/raw/sk5xevAM

> The output contains a line with "DEBUG" and "org.apache.tika.parser.pdf.PDFParser".
> 
> I've just improved the output, I'm adding an MD5 checksum. This would be another indicator that something is wrong (or not).

indeed.

i now see in the logs

	Jul 18 21:28:23 mx-test tika[18970]: DEBUG [qtp977522995-24] 21:28:23,264 org.apache.tika.parser.pdf.PDFParser File: /tmp/apache-tika-9115808773791090696.tmp, length: 104932, md5: 092bf24b2cac33fac27965549c99613a

checking the original attachment

	ls -al Get_Started_With_Smallpdf.pdf
		-rw-r--r-- 1 root root 68K Jul 15 12:16 Get_Started_With_Smallpdf.pdf

	file Get_Started_With_Smallpdf.pdf
		Get_Started_With_Smallpdf.pdf: PDF document, version 1.7

	md5sum Get_Started_With_Smallpdf.pdf
		14266e428c6a5f371c5abe164026c762  Get_Started_With_Smallpdf.pdf

checking,

	ls -al /tmp/apache-tika-9115808773791090696.tmp
		ls: cannot access '/tmp/apache-tika-9115808773791090696.tmp': No such file or directory

is not persisted.

in any case, the  /tmp file's NOT the same size as the orig pdf -- oddly, LARGER than the original file.
dunno what to make of that yet.

fwiw, the received attachment is verified to be identical to the sent original.

Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Something doesn't work properly on your side, I get a lot of "DEBUG" 
lines. I opened tika-server-standard-2.4.2-SNAPSHOT.jar with 7zip, 
extracted it, changed it, and put it back. This is how it looks (comment 
removed):

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<Configuration status="WARN">
   <Appenders>
     <Console name="Console" target="SYSTEM_ERR">
       <PatternLayout pattern="%-5p [%t] %d{HH:mm:ss,SSS} %c %m%n"/>
     </Console>
   </Appenders>
   <Loggers>
     <Root level="debug">
       <AppenderRef ref="Console"/>
     </Root>
   </Loggers>
</Configuration>

The output contains a line with "DEBUG" and 
"org.apache.tika.parser.pdf.PDFParser".

I've just improved the output, I'm adding an MD5 checksum. This would be 
another indicator that something is wrong (or not).

Tilman


Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/17/22 11:10 PM, Tilman Hausherr wrote:
> Alternatively, make your own, and use it with
> -Dlog4j.configuration=file:./log4j.xml
> when starting the server.

per

	https://logging.apache.org/log4j/2.x/manual/configuration.html

create

	cat /etc/tika/log4j2.xml
		<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
		<Configuration status="WARN">
		  <Appenders>
		    <Console name="Console" target="SYSTEM_ERR">
		      <PatternLayout pattern="%-5p [%t] %d{HH:mm:ss,SSS} %c %m%n"/>
		    </Console>
		  </Appenders>
		  <Loggers>
		    <Logger name="org.apache.pdfbox" level="debug">
		      <AppenderRef ref="Console"/>
		    </Logger>
		    <Root level="debug">
		      <AppenderRef ref="Console"/>
		    </Root>
		  </Loggers>
		</Configuration>

launch

	systemctl status tika -ln0
		● tika.service - Apache Tika server
		     Loaded: loaded (/etc/systemd/system/tika.service; enabled; vendor preset: disabled)
		     Active: active (running) since Mon 2022-07-18 07:42:34 EDT; 2min 54s ago
		   Main PID: 52278 (java)
		      Tasks: 54 (limit: 8811)
		     Memory: 205.8M
		        CPU: 27.392s
		     CGroup: /system.slice/tika.service
	!	             ├─ 52278 /usr/bin/java -Dlog4j.configuration=file:/etc/tika/log4j2.xml -jar /srv/tika/tika-server.jar -c /etc/tika/tika-server-config-custom.xml
		                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
		             └─ 52313 /usr/bin/java -Xms1g -Xmx1g -Dpdfbox.fontcache=/var/tika -Dlog4j2.debug -Djava.awt.headless=true -cp /srv/tika/tika-server.jar -Dtika.server.id= org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i "" -c /etc/tika/tika-server-config-custom.xml -forkedStatusFile /tmp/apache-tika-server-forked-tmp-4132939610106343699 -numRestarts 0


logs,

	journalctl -f
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:01 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.eclipse.jetty.util.log.Slf4jLog
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
		Jul 18 08:05:02 mx-test tika[53433]: INFO  [qtp1401737458-25] 08:05:02,442 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.tika.io.TemporaryResources
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.tika.renderer.pdf.pdfbox.PDFBoxRenderer
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:02 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:03 mx-test tika[53433]: WARN  [qtp1401737458-25] 08:05:03,067 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 104319, length: 366, expected end position: 104685
		Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:03 mx-test tika[53433]: ERROR [qtp1401737458-25] 08:05:03,134 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
		Jul 18 08:05:03 mx-test tika[53433]: WARN  [qtp1401737458-25] 08:05:03,431 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 101704, length: 1475, expected end position: 103179
		Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
		Jul 18 08:05:03 mx-test tika[53433]: ERROR [qtp1401737458-25] 08:05:03,445 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
		Jul 18 08:05:03 mx-test tika[53433]: WARN  [qtp1401737458-25] 08:05:03,447 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 101514, length: 66, expected end position: 101580
		Jul 18 08:05:03 mx-test tika[53433]: ERROR [qtp1401737458-25] 08:05:03,449 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
		Jul 18 08:05:03 mx-test tika[53433]: WARN  [qtp1401737458-25] 08:05:03,459 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 2011, length: 2482, expected end position: 4493
		Jul 18 08:05:03 mx-test tika[53433]: WARN  [qtp1401737458-25] 08:05:03,466 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (Get_Started_With_Smallpdf.pdf)
		Jul 18 08:05:03 mx-test tika[53433]: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@4f3e230b
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at java.lang.Thread.run(Thread.java:833) ~[?:?]
		Jul 18 08:05:03 mx-test tika[53433]: Caused by: java.io.IOException: Page tree root must be a dictionary
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:291) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:178) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 18 08:05:03 mx-test tika[53433]:         ... 37 more
		Jul 18 08:05:03 mx-test tika[53433]: ERROR [qtp1401737458-25] 08:05:03,587 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$344/0x0000000800eb2e78, ContentType: text/plain
		Jul 18 08:05:03 mx-test tika[53433]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger




Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tilman Hausherr <TH...@t-online.de>.
There's a file log4j2.xml in the jar file, edit that one and put it back 
into the jar.

Alternatively, make your own, and use it with

-Dlog4j.configuration=file:./log4j.xml

when starting the server.

Tilman


Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/17/22 11:52 AM, Tilman Hausherr wrote:
> https://issues.apache.org/jira/browse/TIKA-3819
> This will show filename and length but only if logging is in DEBUG log level. The modified version will appear at
> https://repository.apache.org/content/groups/snapshots/org/apache/tika/
> in a few hours.

thx o/

checking

	https://issues.apache.org/jira/browse/TIKA-3819

i see

	Fix Version/s: 2.4.2
	https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/697/
	Build #697 (Jul 17, 2022, 3:47:56 PM)

i installed

	tika-server-standard-2.4.2-20220717.154907-90.jar

set

	cat tika-server-config-custom.xml
		<?xml version="1.0" encoding="UTF-8"?>

		<properties>
		  <server>
		    <params>
!		      <logLevel>debug</logLevel>
		      ...
		      <forkedJvmArgs>
		        ...
!		        <arg>-Dlog4j2.debug</arg>
		        ...

and launched,

	systemctl status tika -l
		● tika.service - Apache Tika server
		     Loaded: loaded (/etc/systemd/system/tika.service; enabled; vendor preset: disabled)
		     Active: active (running) since Sun 2022-07-17 20:51:36 EDT; 5min ago
		   Main PID: 25001 (java)
		      Tasks: 54 (limit: 8811)
		     Memory: 208.3M
		        CPU: 31.115s
		     CGroup: /system.slice/tika.service
		             ├─ 25001 /usr/bin/java -jar /srv/tika/tika-server.jar -c /etc/tika/tika-server-config-custom.xml
		             └─ 25039 /usr/bin/java -Xms1g -Xmx1g -Dpdfbox.fontcache=/var/tika -Dlog4j2.debug -Djava.awt.headless=true -cp /srv/tika/tika-server.jar -Dtika.server.id= org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i "" -c /etc/tika/tika-server-config-custom.xml -forkedStatusFile /tmp/apache-tika-server-forked-tmp-8013562591697588923 -numRestarts 0

		Jul 17 20:52:15 mx-test tika[25039]:         at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:52:15 mx-test tika[25039]:         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:52:15 mx-test tika[25039]:         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:52:15 mx-test tika[25039]:         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:52:15 mx-test tika[25039]:         at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:291) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:52:15 mx-test tika[25039]:         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:178) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:52:15 mx-test tika[25039]:         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:52:15 mx-test tika[25039]:         ... 37 more
		Jul 17 20:52:15 mx-test tika[25039]: ERROR [qtp1401737458-25] 20:52:15,597 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$344/0x0000000800eb2e78, ContentType: text/plain
		Jul 17 20:52:15 mx-test tika[25039]: TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.cxf.common.logging.Slf4jLogger

on receipt of email + pdf attachment, FAIL as before,

	journalctl -f -u tika

		Jul 17 20:59:42 mx-test tika[25039]: INFO  [qtp1401737458-25] 20:59:42,066 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)
		Jul 17 20:59:42 mx-test tika[25039]: WARN  [qtp1401737458-25] 20:59:42,243 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 104319, length: 366, expected end position: 104685
		Jul 17 20:59:42 mx-test tika[25039]: ERROR [qtp1401737458-25] 20:59:42,245 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
		Jul 17 20:59:42 mx-test tika[25039]: WARN  [qtp1401737458-25] 20:59:42,467 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 101704, length: 1475, expected end position: 103179
		Jul 17 20:59:42 mx-test tika[25039]: ERROR [qtp1401737458-25] 20:59:42,469 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
		Jul 17 20:59:42 mx-test tika[25039]: WARN  [qtp1401737458-25] 20:59:42,481 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 101514, length: 66, expected end position: 101580
		Jul 17 20:59:42 mx-test tika[25039]: ERROR [qtp1401737458-25] 20:59:42,482 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
		Jul 17 20:59:42 mx-test tika[25039]: WARN  [qtp1401737458-25] 20:59:42,493 org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 2011, length: 2482, expected end position: 4493
		Jul 17 20:59:42 mx-test tika[25039]: WARN  [qtp1401737458-25] 20:59:42,495 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (Get_Started_With_Smallpdf.pdf)
		Jul 17 20:59:42 mx-test tika[25039]: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@4f3e230b
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at java.lang.Thread.run(Thread.java:833) ~[?:?]
		Jul 17 20:59:42 mx-test tika[25039]: Caused by: java.io.IOException: Page tree root must be a dictionary
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:291) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:178) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.4.2-20220717.154907-90.jar:2.4.2-SNAPSHOT]
		Jul 17 20:59:42 mx-test tika[25039]:         ... 37 more
		Jul 17 20:59:42 mx-test tika[25039]: ERROR [qtp1401737458-25] 20:59:42,499 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$344/0x0000000800eb2e78, ContentType: text/plain


where, the attachment is,

	pdfinfo Get_Started_With_Smallpdf.pdf
		Creator:         Adobe InDesign 15.1 (Macintosh)
		Producer:        Adobe PDF Library 15.0
		CreationDate:    Wed Oct 14 11:08:10 2020 EDT
		ModDate:         Wed Oct 14 11:08:10 2020 EDT
		Custom Metadata: no
		Metadata Stream: yes
		Tagged:          no
		UserProperties:  no
		Suspects:        no
		Form:            none
		JavaScript:      no
		Pages:           1
		Encrypted:       no
		Page size:       595.276 x 841.89 pts (A4)
		Page rot:        0
		File size:       69451 bytes
		Optimized:       no
		PDF version:     1.7

i don't see any additional DEBUG info, or the file length targeted.

additional steps/config needed to enable the DEBUG output from the snapshot?

Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/17/22 11:52 AM, Tilman Hausherr wrote:
> I'll add some logging when in debug mode, maybe this will help in the future. I still believe the error is on your side, but debugging would help "proving" this.

since, per prior advice, I can curl the attachment to tika server with no error, I tend to agree.

as mentioned, I suspect dovecot's fts-tika plugin.

finding the issue to 'prove' it is the challenge.  iiuc, to do that, i need to run debug on tika-server as fed by dovecot/fts-tika -- i.e., in the receive-a-mail use case.

i've asked on dovecot ML if anyone else can confirm, or not, the error.

i know that both in my case, and on ML, 'this' *was* previously working.

> https://issues.apache.org/jira/browse/TIKA-3819
> This will show filename and length but only if logging is in DEBUG log level. The modified version will appear at
> https://repository.apache.org/content/groups/snapshots/org/apache/tika/
> in a few hours.
> Tilman

i'll set up a test env where i can replicate the problem, and watch for the snap to give it a go

o/

Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 17.07.2022 um 17:09 schrieb PGNet Dev:
> On 7/17/22 10:24 AM, Tilman Hausherr wrote:
>> That is in pdfbox, not in tika.
>>
>> There's also a PDFParser.parse() in tika, which then calls 
>> PDDocument.load(). However I don't know if this will use the 
>> InputStream call, or the one with File. If it uses the one with the 
>> file, then check the length and content of the file (tika does 
>> sometimes store streams into a temporary file).
>
> i see the same results -- i.e., nada -- with explicit stop in 
> PDFParser.parse
>
>> Re the failed build: remove the segment with ossindex-maven-plugin 
>> from the parent pom.xml . That plugin (or rather, the company behind 
>> it) has gone crazy, we've partly disabled it in the current trunk.
>
> no idea what specifically to do there.
>
> trying building 'main' with those partial disables, rather than 
> '2.4.1', that also fails, 


I'll add some logging when in debug mode, maybe this will help in the 
future. I still believe the error is on your side, but debugging would 
help "proving" this.

https://issues.apache.org/jira/browse/TIKA-3819

This will show filename and length but only if logging is in DEBUG log 
level. The modified version will appear at

https://repository.apache.org/content/groups/snapshots/org/apache/tika/

in a few hours.

Tilman

Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/17/22 10:24 AM, Tilman Hausherr wrote:
> That is in pdfbox, not in tika.
> 
> There's also a PDFParser.parse() in tika, which then calls PDDocument.load(). However I don't know if this will use the InputStream call, or the one with File. If it uses the one with the file, then check the length and content of the file (tika does sometimes store streams into a temporary file).

i see the same results -- i.e., nada -- with explicit stop in PDFParser.parse

> Re the failed build: remove the segment with ossindex-maven-plugin from the parent pom.xml . That plugin (or rather, the company behind it) has gone crazy, we've partly disabled it in the current trunk.

no idea what specifically to do there.

trying building 'main' with those partial disables, rather than '2.4.1', that also fails,

INFO  [pool-6-thread-1] 10:59:03,890 org.apache.tika.pipes.PipesClient pipesClientId=2 parse success: myId in 58 ms
ERROR [main] 10:59:03,907 org.apache.tika.pipes.PipesServer oom: myId
java.lang.OutOfMemoryError: oom message
         at jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:67) ~[?:?]
         at java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499) ~[?:?]
         at java.lang.reflect.Constructor.newInstance(Constructor.java:483) ~[?:?]
         at org.apache.tika.parser.mock.MockParser.throwIt(MockParser.java:428) ~[test-classes/:?]
         at org.apache.tika.parser.mock.MockParser.throwIt(MockParser.java:374) ~[test-classes/:?]
         at org.apache.tika.parser.mock.MockParser.executeAction(MockParser.java:155) ~[test-classes/:?]
         at org.apache.tika.parser.mock.MockParser.parse(MockParser.java:134) ~[test-classes/:?]
         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[classes/:?]
         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[classes/:?]
         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167) ~[classes/:?]
         at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:163) ~[classes/:?]
         at org.apache.tika.pipes.PipesServer.parseRecursive(PipesServer.java:540) ~[classes/:?]
         at org.apache.tika.pipes.PipesServer.parse(PipesServer.java:473) ~[classes/:?]
         at org.apache.tika.pipes.PipesServer.parseIt(PipesServer.java:420) ~[classes/:?]
         at org.apache.tika.pipes.PipesServer.actuallyParse(PipesServer.java:340) ~[classes/:?]
         at org.apache.tika.pipes.PipesServer.parseOne(PipesServer.java:311) ~[classes/:?]
         at org.apache.tika.pipes.PipesServer.processRequests(PipesServer.java:232) ~[classes/:?]
         at org.apache.tika.pipes.PipesServer.main(PipesServer.java:168) ~[classes/:?]

my 1st priority is a stable dovecot search env, so i've removed tika from use & its config.

for now, i'll have to pass this^ on to an admin here that works regularly in a full java env, and won't have to keep guessing at how to debug the app.


Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 17.07.2022 um 15:58 schrieb PGNet Dev:
> On 7/16/22 10:51 PM, Tilman Hausherr wrote:
>> You didn't get the exception I mentioned; then set the breakpoint at 
>> parse() to get the fileLen. The current error messages suggests that 
>> bytes have been changed or have been lost.
>>
>> IIRC tika saves the PDF in a file in the temp directory before 
>> parsing, maybe look there at that time and compare the length and 
>> content with your own.
>
>
> i haven't managed to stop at any *.parse bkpt i set after `jdb -attach`

That is in pdfbox, not in tika.

There's also a PDFParser.parse() in tika, which then calls 
PDDocument.load(). However I don't know if this will use the InputStream 
call, or the one with File. If it uses the one with the file, then check 
the length and content of the file (tika does sometimes store streams 
into a temporary file).

Re the failed build: remove the segment with ossindex-maven-plugin from 
the parent pom.xml . That plugin (or rather, the company behind it) has 
gone crazy, we've partly disabled it in the current trunk.

Tilman


>
> wondering if req'd debug info is included/complete in the runnable 
> jar, i decided to try a clean mvn build
>
>     git checkout 2.4.1
>     mvn clean
>     mvn -X compile -am -pl :tika-server-standard
>
> which fails
>
>     ...
>     [DEBUG] 82 component-reports; 16.90 ms
>     [WARNING] Excluding coordinates: com.google.guava:guava:31.1-jre
>     [INFO] 
> ------------------------------------------------------------------------
>     [INFO] Reactor Summary for Apache Tika parent 2.4.1:
>     [INFO]
>     [INFO] Apache Tika parent ................................. 
> SUCCESS [  0.790 s]
>     [INFO] Apache Tika core ................................... 
> SUCCESS [  4.806 s]
>     [INFO] Apache Tika serialization .......................... 
> SUCCESS [  0.698 s]
>     [INFO] Apache Tika parser modules ......................... 
> SUCCESS [  0.045 s]
>     [INFO] Apache Tika standard parser modules and package .... 
> SUCCESS [  0.033 s]
>     [INFO] Apache Tika standard parser modules ................ 
> SUCCESS [  0.030 s]
>     [INFO] Apache Tika html commons ........................... 
> SUCCESS [  0.114 s]
>     [INFO] Apache Tika digest commons ......................... 
> SUCCESS [  0.154 s]
>     [INFO] Apache Tika mail commons ........................... 
> SUCCESS [  0.078 s]
>     [INFO] Apache Tika XMP commons ............................ 
> SUCCESS [  0.120 s]
>     [INFO] Apache Tika ZIP commons ............................ 
> SUCCESS [  0.213 s]
>     [INFO] Apache Tika image parser module .................... 
> SUCCESS [  0.355 s]
>     [INFO] Apache Tika OCR parser module ...................... 
> SUCCESS [  0.302 s]
>     [INFO] Apache Tika audiovideo parser module ............... 
> SUCCESS [  0.369 s]
>     [INFO] Apache Tika text parser module ..................... 
> SUCCESS [  0.424 s]
>     [INFO] Apache Tika code parser module ..................... 
> SUCCESS [  0.205 s]
>     [INFO] Apache Tika html parser module ..................... 
> SUCCESS [  0.305 s]
>     [INFO] Apache Tika font parser module ..................... 
> SUCCESS [  0.078 s]
>     [INFO] Apache Tika XML parser module ...................... 
> SUCCESS [  0.132 s]
>     [INFO] Apache Tika Microsoft parser module ................ 
> SUCCESS [  2.600 s]
>     [INFO] Apache Tika package parser module .................. 
> SUCCESS [  0.145 s]
>     [INFO] Apache Tika PDF parser module ...................... 
> SUCCESS [  0.667 s]
>     [INFO] Apache Tika Apple parser module .................... 
> SUCCESS [  0.216 s]
>     [INFO] Apache Tika cad parser module ...................... 
> SUCCESS [  0.203 s]
>     [INFO] Apache Tika mail parser module ..................... 
> SUCCESS [  0.187 s]
>     [INFO] Apache Tika miscellaneous office format parser module 
> SUCCESS [  0.421 s]
>     [INFO] Apache Tika news parser module ..................... 
> SUCCESS [  0.163 s]
>     [INFO] Apache Tika crypto parser module ................... 
> SUCCESS [  0.106 s]
>     [INFO] Apache Tika WARC parser module ..................... 
> SUCCESS [  0.104 s]
>     [INFO] Apache Tika standard parser package ................ 
> SUCCESS [  0.565 s]
>     [INFO] Apache Tika XMP .................................... 
> SUCCESS [  0.286 s]
>     [INFO] Apache Tika language detection ..................... 
> SUCCESS [  0.021 s]
>     [INFO] Apache Tika langdetect test commons ................ 
> SUCCESS [  0.057 s]
>     [INFO] Apache Tika Optimaize langdetect ................... 
> SUCCESS [  0.108 s]
>     [INFO] Apache Tika OpenNLP langdetect ..................... 
> SUCCESS [  0.114 s]
>     [INFO] Apache Tika pipes .................................. 
> SUCCESS [  0.018 s]
>     [INFO] Apache Tika emitters ............................... 
> SUCCESS [  0.017 s]
>     [INFO] Apache Tika filesystem emitter ..................... 
> SUCCESS [  0.065 s]
>     [INFO] Apache Tika translate .............................. 
> SUCCESS [  0.446 s]
>     [INFO] Apache Tika server module .......................... 
> SUCCESS [  0.019 s]
>     [INFO] Apache Tika server core ............................ 
> FAILURE [  0.112 s]
>     [INFO] Apache Tika standard server ........................ SKIPPED
>     [INFO] 
> ------------------------------------------------------------------------
>     [INFO] BUILD FAILURE
>     [INFO] 
> ------------------------------------------------------------------------
>     [INFO] Total time:  16.545 s
>     [INFO] Finished at: 2022-07-17T09:41:53-04:00
>     [INFO] 
> ------------------------------------------------------------------------
>     [ERROR] Failed to execute goal 
> org.sonatype.ossindex.maven:ossindex-maven-plugin:3.2.0:audit 
> (audit-dependencies) on project tika-server-core: Detected 2 
> vulnerable components:
>     [ERROR] 
> org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile; 
> https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-server@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
>     [ERROR]     * [CVE-2022-2047] CWE-20: Improper Input Validation 
> (2.7); 
> https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
>     [ERROR] org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile; 
> https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-http@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
>     [ERROR]     * [CVE-2022-2047] CWE-20: Improper Input Validation 
> (2.7); 
> https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
>     [ERROR]
>     [ERROR] Excluded coordinates:
>     [ERROR]   - com.google.guava:guava:31.1-jre
>     [ERROR]
>     [ERROR] -> [Help 1]
>     org.apache.maven.lifecycle.LifecycleExecutionException: Failed to 
> execute goal 
> org.sonatype.ossindex.maven:ossindex-maven-plugin:3.2.0:audit 
> (audit-dependencies) on project tika-server-core: Detected 2 
> vulnerable components:
>       org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile; 
> https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-server@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1 
>
>         * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); 
> https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1 
>
>       org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile; 
> https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-http@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
>         * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); 
> https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1 
>
>
>     Excluded coordinates:
>       - com.google.guava:guava:31.1-jre
>
>         at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:215)
>         at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:156)
>         at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:148)
>         at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:117)
>         at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:81)
>         at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build 
> (SingleThreadedBuilder.java:56)
>         at 
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute 
> (LifecycleStarter.java:128)
>         at org.apache.maven.DefaultMaven.doExecute 
> (DefaultMaven.java:305)
>         at org.apache.maven.DefaultMaven.doExecute 
> (DefaultMaven.java:192)
>         at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
>         at org.apache.maven.cli.MavenCli.execute (MavenCli.java:972)
>         at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
>         at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
>         at jdk.internal.reflect.DirectMethodHandleAccessor.invoke 
> (DirectMethodHandleAccessor.java:104)
>         at java.lang.reflect.Method.invoke (Method.java:577)
>         at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced 
> (Launcher.java:282)
>         at org.codehaus.plexus.classworlds.launcher.Launcher.launch 
> (Launcher.java:225)
>         at 
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode 
> (Launcher.java:406)
>         at org.codehaus.plexus.classworlds.launcher.Launcher.main 
> (Launcher.java:347)
>     Caused by: org.apache.maven.plugin.MojoFailureException: Detected 
> 2 vulnerable components:
>       org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile; 
> https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-server@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1 
>
>         * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); 
> https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1 
>
>       org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile; 
> https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-http@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
>         * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); 
> https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1 
>
>
>     Excluded coordinates:
>       - com.google.guava:guava:31.1-jre
>
>         at org.sonatype.ossindex.maven.plugin.AuditMojoSupport.execute 
> (AuditMojoSupport.java:257)
>         at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo 
> (DefaultBuildPluginManager.java:137)
>         at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:210)
>         at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:156)
>         at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:148)
>         at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:117)
>         at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:81)
>         at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build 
> (SingleThreadedBuilder.java:56)
>         at 
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute 
> (LifecycleStarter.java:128)
>         at org.apache.maven.DefaultMaven.doExecute 
> (DefaultMaven.java:305)
>         at org.apache.maven.DefaultMaven.doExecute 
> (DefaultMaven.java:192)
>         at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
>         at org.apache.maven.cli.MavenCli.execute (MavenCli.java:972)
>         at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
>         at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
>         at jdk.internal.reflect.DirectMethodHandleAccessor.invoke 
> (DirectMethodHandleAccessor.java:104)
>         at java.lang.reflect.Method.invoke (Method.java:577)
>         at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced 
> (Launcher.java:282)
>         at org.codehaus.plexus.classworlds.launcher.Launcher.launch 
> (Launcher.java:225)
>         at 
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode 
> (Launcher.java:406)
>         at org.codehaus.plexus.classworlds.launcher.Launcher.main 
> (Launcher.java:347)
>     [ERROR]
>     [ERROR]
>     [ERROR] For more information about the errors and possible 
> solutions, please read the following articles:
>     [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
>     [ERROR]
>     [ERROR] After correcting the problems, you can resume the build 
> with the command
>     [ERROR]   mvn <args> -rf :tika-server-core
>
> checking @
>
>     https://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException 
>
>
>         "Unlike many other errors, this exception is not generated by 
> the Maven core itself but by a plugin. As a rule of thumb, plugins use 
> this error to signal a failure of the build because there is something 
> wrong with the dependencies or sources of a project, e.g. a 
> compilation or a test failure."
>
> in /tmp
>
> immediately after tika-server start
>
>     '/usr/bin/tree -Csup --timefmt "%F %R:%S %z"' /tmp | grep tika
>         ├── [-rw------- tika               0 2022-07-17 09:54:08 
> -0400]  apache-tika-server-forked-tmp-16337036696243797817
>         ├── [drwxr-xr-x tika              80 2022-07-17 09:54:08 
> -0400]  hsperfdata_tika
>         │   ├── [-rw------- tika           32768 2022-07-17 09:54:04 
> -0400]  15865
>         │   └── [-rw------- tika           32768 2022-07-17 09:54:08 
> -0400]  15902
>
> , and, same -- i.e. nothing added -- after receipt of email with 
> failed tika scan/parse
>
> anyone have some explicit instructions for setting a catchable 
> breakpoint in a jdb -attach to tika-server?
> or, error-free build instructions?



Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/16/22 10:51 PM, Tilman Hausherr wrote:
> You didn't get the exception I mentioned; then set the breakpoint at parse() to get the fileLen. The current error messages suggests that bytes have been changed or have been lost.
> 
> IIRC tika saves the PDF in a file in the temp directory before parsing, maybe look there at that time and compare the length and content with your own.


i haven't managed to stop at any *.parse bkpt i set after `jdb -attach`

wondering if req'd debug info is included/complete in the runnable jar, i decided to try a clean mvn build

	git checkout 2.4.1
	mvn clean
	mvn -X compile -am -pl :tika-server-standard

which fails

	...
	[DEBUG] 82 component-reports; 16.90 ms
	[WARNING] Excluding coordinates: com.google.guava:guava:31.1-jre
	[INFO] ------------------------------------------------------------------------
	[INFO] Reactor Summary for Apache Tika parent 2.4.1:
	[INFO]
	[INFO] Apache Tika parent ................................. SUCCESS [  0.790 s]
	[INFO] Apache Tika core ................................... SUCCESS [  4.806 s]
	[INFO] Apache Tika serialization .......................... SUCCESS [  0.698 s]
	[INFO] Apache Tika parser modules ......................... SUCCESS [  0.045 s]
	[INFO] Apache Tika standard parser modules and package .... SUCCESS [  0.033 s]
	[INFO] Apache Tika standard parser modules ................ SUCCESS [  0.030 s]
	[INFO] Apache Tika html commons ........................... SUCCESS [  0.114 s]
	[INFO] Apache Tika digest commons ......................... SUCCESS [  0.154 s]
	[INFO] Apache Tika mail commons ........................... SUCCESS [  0.078 s]
	[INFO] Apache Tika XMP commons ............................ SUCCESS [  0.120 s]
	[INFO] Apache Tika ZIP commons ............................ SUCCESS [  0.213 s]
	[INFO] Apache Tika image parser module .................... SUCCESS [  0.355 s]
	[INFO] Apache Tika OCR parser module ...................... SUCCESS [  0.302 s]
	[INFO] Apache Tika audiovideo parser module ............... SUCCESS [  0.369 s]
	[INFO] Apache Tika text parser module ..................... SUCCESS [  0.424 s]
	[INFO] Apache Tika code parser module ..................... SUCCESS [  0.205 s]
	[INFO] Apache Tika html parser module ..................... SUCCESS [  0.305 s]
	[INFO] Apache Tika font parser module ..................... SUCCESS [  0.078 s]
	[INFO] Apache Tika XML parser module ...................... SUCCESS [  0.132 s]
	[INFO] Apache Tika Microsoft parser module ................ SUCCESS [  2.600 s]
	[INFO] Apache Tika package parser module .................. SUCCESS [  0.145 s]
	[INFO] Apache Tika PDF parser module ...................... SUCCESS [  0.667 s]
	[INFO] Apache Tika Apple parser module .................... SUCCESS [  0.216 s]
	[INFO] Apache Tika cad parser module ...................... SUCCESS [  0.203 s]
	[INFO] Apache Tika mail parser module ..................... SUCCESS [  0.187 s]
	[INFO] Apache Tika miscellaneous office format parser module SUCCESS [  0.421 s]
	[INFO] Apache Tika news parser module ..................... SUCCESS [  0.163 s]
	[INFO] Apache Tika crypto parser module ................... SUCCESS [  0.106 s]
	[INFO] Apache Tika WARC parser module ..................... SUCCESS [  0.104 s]
	[INFO] Apache Tika standard parser package ................ SUCCESS [  0.565 s]
	[INFO] Apache Tika XMP .................................... SUCCESS [  0.286 s]
	[INFO] Apache Tika language detection ..................... SUCCESS [  0.021 s]
	[INFO] Apache Tika langdetect test commons ................ SUCCESS [  0.057 s]
	[INFO] Apache Tika Optimaize langdetect ................... SUCCESS [  0.108 s]
	[INFO] Apache Tika OpenNLP langdetect ..................... SUCCESS [  0.114 s]
	[INFO] Apache Tika pipes .................................. SUCCESS [  0.018 s]
	[INFO] Apache Tika emitters ............................... SUCCESS [  0.017 s]
	[INFO] Apache Tika filesystem emitter ..................... SUCCESS [  0.065 s]
	[INFO] Apache Tika translate .............................. SUCCESS [  0.446 s]
	[INFO] Apache Tika server module .......................... SUCCESS [  0.019 s]
	[INFO] Apache Tika server core ............................ FAILURE [  0.112 s]
	[INFO] Apache Tika standard server ........................ SKIPPED
	[INFO] ------------------------------------------------------------------------
	[INFO] BUILD FAILURE
	[INFO] ------------------------------------------------------------------------
	[INFO] Total time:  16.545 s
	[INFO] Finished at: 2022-07-17T09:41:53-04:00
	[INFO] ------------------------------------------------------------------------
	[ERROR] Failed to execute goal org.sonatype.ossindex.maven:ossindex-maven-plugin:3.2.0:audit (audit-dependencies) on project tika-server-core: Detected 2 vulnerable components:
	[ERROR]   org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-server@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
	[ERROR]     * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
	[ERROR]   org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-http@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
	[ERROR]     * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
	[ERROR]
	[ERROR] Excluded coordinates:
	[ERROR]   - com.google.guava:guava:31.1-jre
	[ERROR]
	[ERROR] -> [Help 1]
	org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.sonatype.ossindex.maven:ossindex-maven-plugin:3.2.0:audit (audit-dependencies) on project tika-server-core: Detected 2 vulnerable components:
	  org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-server@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
	    * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
	  org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-http@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
	    * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1

	Excluded coordinates:
	  - com.google.guava:guava:31.1-jre

	    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215)
	    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
	    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
	    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
	    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
	    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56)
	    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
	    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
	    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
	    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
	    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:972)
	    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
	    at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
	    at jdk.internal.reflect.DirectMethodHandleAccessor.invoke (DirectMethodHandleAccessor.java:104)
	    at java.lang.reflect.Method.invoke (Method.java:577)
	    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
	    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
	    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
	    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
	Caused by: org.apache.maven.plugin.MojoFailureException: Detected 2 vulnerable components:
	  org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-server@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
	    * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
	  org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-http@9.4.46.v20220331?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
	    * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1

	Excluded coordinates:
	  - com.google.guava:guava:31.1-jre

	    at org.sonatype.ossindex.maven.plugin.AuditMojoSupport.execute (AuditMojoSupport.java:257)
	    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137)
	    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210)
	    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
	    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
	    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
	    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
	    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56)
	    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
	    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
	    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
	    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
	    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:972)
	    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
	    at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
	    at jdk.internal.reflect.DirectMethodHandleAccessor.invoke (DirectMethodHandleAccessor.java:104)
	    at java.lang.reflect.Method.invoke (Method.java:577)
	    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
	    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
	    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
	    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
	[ERROR]
	[ERROR]
	[ERROR] For more information about the errors and possible solutions, please read the following articles:
	[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
	[ERROR]
	[ERROR] After correcting the problems, you can resume the build with the command
	[ERROR]   mvn <args> -rf :tika-server-core

checking @

	https://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

		"Unlike many other errors, this exception is not generated by the Maven core itself but by a plugin. As a rule of thumb, plugins use this error to signal a failure of the build because there is something wrong with the dependencies or sources of a project, e.g. a compilation or a test failure."

in /tmp

immediately after tika-server start

	'/usr/bin/tree -Csup --timefmt "%F %R:%S %z"' /tmp | grep tika
		├── [-rw------- tika               0 2022-07-17 09:54:08 -0400]  apache-tika-server-forked-tmp-16337036696243797817
		├── [drwxr-xr-x tika              80 2022-07-17 09:54:08 -0400]  hsperfdata_tika
		│   ├── [-rw------- tika           32768 2022-07-17 09:54:04 -0400]  15865
		│   └── [-rw------- tika           32768 2022-07-17 09:54:08 -0400]  15902

, and, same -- i.e. nothing added -- after receipt of email with failed tika scan/parse

anyone have some explicit instructions for setting a catchable breakpoint in a jdb -attach to tika-server?
or, error-free build instructions?

Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 16.07.2022 um 18:43 schrieb PGNet Dev:
>
> i don't get any more useful info on failure,
>
>     --> https://pastebin.com/raw/DsrLxbeg

You didn't get the exception I mentioned; then set the breakpoint at 
parse() to get the fileLen. The current error messages suggests that 
bytes have been changed or have been lost.

IIRC tika saves the PDF in a file in the temp directory before parsing, 
maybe look there at that time and compare the length and content with 
your own.

Tilman


Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tim Allison <ta...@apache.org>.
Right. I think Tilman was suggesting adding new debug logging to
tika-server.

On Sat, Jul 16, 2022 at 12:43 PM PGNet Dev <pg...@gmail.com> wrote:

> with debug log levels set in tika config
>
> cat tika-server-config-custom.xml
>         <?xml version="1.0" encoding="UTF-8"?>
>                 <properties>
>                   <server>
>                     <params>
>                       <logLevel>debug</logLevel>
>                       <port>9998</port>
>                       <host>127.0.0.1</host>
>                       <javaPath>/usr/bin/java</javaPath>
>                       <noFork>false</noFork>
>                       <forkedJvmArgs>
>                         <arg>-Xms1g</arg>
>                         <arg>-Xmx1g</arg>
>                         <arg>-Dpdfbox.fontcache=/var/tika</arg>
>                         <arg>-Dlog4j2.debug</arg>
>                       </forkedJvmArgs>
>                 ...
>
> i don't get any more useful info on failure,
>
>         --> https://pastebin.com/raw/DsrLxbeg
>
> .  unless there's more relevant debug info to squeeze out from config
> alone,
>
> On 7/15/22 10:43 PM, Tilman Hausherr wrote:
> > The next that could be done is to debug this, if possible. Tim suggested
> the file might be truncated.
> >
> > I don't know if it is possible, if you can run tika in a debugger, then
> stop at org.apache.pdfbox.pdfparser.PDFParser.initialParse() where the
> exception "Page tree root must be a dictionary" happens. There try to
> access this.fileLen . Compare that number to your file length.
>
> , I'll figure out how to debug the tika-server java backend, while being
> fed by the dovecot attachment submission task.
> guessing 'jdb',
>
>         jdb -version
>           This is jdb version 18.0 (Java SE version 18.0.1)
>
> is the right tool for that.
>
>

Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
with debug log levels set in tika config

cat tika-server-config-custom.xml
	<?xml version="1.0" encoding="UTF-8"?>
		<properties>
		  <server>
		    <params>
		      <logLevel>debug</logLevel>
		      <port>9998</port>
		      <host>127.0.0.1</host>
		      <javaPath>/usr/bin/java</javaPath>
		      <noFork>false</noFork>
		      <forkedJvmArgs>
		        <arg>-Xms1g</arg>
		        <arg>-Xmx1g</arg>
		        <arg>-Dpdfbox.fontcache=/var/tika</arg>
		        <arg>-Dlog4j2.debug</arg>
		      </forkedJvmArgs>
		...

i don't get any more useful info on failure,

	--> https://pastebin.com/raw/DsrLxbeg

.  unless there's more relevant debug info to squeeze out from config alone,

On 7/15/22 10:43 PM, Tilman Hausherr wrote:
> The next that could be done is to debug this, if possible. Tim suggested the file might be truncated.
> 
> I don't know if it is possible, if you can run tika in a debugger, then stop at org.apache.pdfbox.pdfparser.PDFParser.initialParse() where the exception "Page tree root must be a dictionary" happens. There try to access this.fileLen . Compare that number to your file length.

, I'll figure out how to debug the tika-server java backend, while being fed by the dovecot attachment submission task.
guessing 'jdb',

	jdb -version
	  This is jdb version 18.0 (Java SE version 18.0.1)

is the right tool for that.


Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tilman Hausherr <TH...@t-online.de>.
That's what I also get.

The next that could be done is to debug this, if possible. Tim suggested 
the file might be truncated.

I don't know if it is possible, if you can run tika in a debugger, then 
stop at org.apache.pdfbox.pdfparser.PDFParser.initialParse() where the 
exception "Page tree root must be a dictionary" happens. There try to 
access this.fileLen . Compare that number to your file length.

(I'm wondering if we are offering some debug info in the tika server, or 
if we could offer it in the future, e.g. telling the length, and/or 
offering an MD5 checksum if log debug mode is on)

An alternative would be that 1) I add the file length in PDFBox 
exception 2) you create a Tika build with the PDFBox snapshot.

Tilman

Am 15.07.2022 um 18:26 schrieb PGNet Dev:
> On 7/15/22 12:01 PM, Tim Allison wrote:
>> If you curl the test file (GetStartedWithSmallpdf.pdf) against your 
>> tika-server, what do you see?  The test file works for me with 
>> 2.4.2-SNAPSHOT at least.  Are the files getting truncated somehow?
>
>
>> If you curl the test file (GetStartedWithSmallpdf.pdf) against your 
>> tika-server, what do you see?
>
> in journal log, only this:
>
>     Jul 15 12:24:47 mx.loc tika[1143]: INFO  [qtp1837533591-23] 
> 12:24:47,978 org.apache.tika.server.core.resource.TikaResource /tika 
> (application/pdf)
>
> and, @ console, this:
>
>     https://pastebin.com/raw/Nu1RCbat
>
>
>
>> Are the files getting truncated somehow?
>
> Perhaps?  I'd guess that since curl of the source file against tika , 
> as above, works ok, that what's feeding tika -- namely dovecot's fts 
> plugin -- would be a likely candidate.



Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/15/22 12:01 PM, Tim Allison wrote:
> If you curl the test file (GetStartedWithSmallpdf.pdf) against your tika-server, what do you see?  The test file works for me with 2.4.2-SNAPSHOT at least.  Are the files getting truncated somehow?


> If you curl the test file (GetStartedWithSmallpdf.pdf) against your tika-server, what do you see?

in journal log, only this:

	Jul 15 12:24:47 mx.loc tika[1143]: INFO  [qtp1837533591-23] 12:24:47,978 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)

and, @ console, this:

	https://pastebin.com/raw/Nu1RCbat



> Are the files getting truncated somehow?

Perhaps?  I'd guess that since curl of the source file against tika , as above, works ok, that what's feeding tika -- namely dovecot's fts plugin -- would be a likely candidate.

Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tim Allison <ta...@apache.org>.
If I truncate the test file with a hexeditor, I see this:

INFO  [main] 12:04:20,641 org.apache.tika.server.core.TikaServerProcess
Starting Apache Tika 2.4.2-SNAPSHOT server
INFO  [main] 12:04:20,823 org.apache.tika.server.core.TikaServerProcess
loading resource from SPI: class
org.apache.tika.server.standard.resource.XMPMetadataResource
INFO  [main] 12:04:21,044 org.apache.cxf.endpoint.ServerImpl Setting the
server's publish address to be http://localhost:9998/
INFO  [main] 12:04:21,111 org.eclipse.jetty.util.log Logging initialized
@1675ms to org.eclipse.jetty.util.log.Slf4jLog
INFO  [main] 12:04:21,169 org.eclipse.jetty.server.Server
jetty-9.4.48.v20220622; built: 2022-06-21T20:42:25.880Z; git:
6b67c5719d1f4371b33655ff2d047d24e171e49a; jvm 11.0.11+9
INFO  [main] 12:04:21,205 org.eclipse.jetty.server.AbstractConnector
Started ServerConnector@352e787a{HTTP/1.1, (http/1.1)}{localhost:9998}
INFO  [main] 12:04:21,205 org.eclipse.jetty.server.Server Started @1771ms
WARN  [main] 12:04:21,212 org.eclipse.jetty.server.handler.ContextHandler
Empty contextPath
INFO  [main] 12:04:21,226 org.eclipse.jetty.server.handler.ContextHandler
Started o.e.j.s.h.ContextHandler@408b87aa{/,null,AVAILABLE}
INFO  [main] 12:04:21,232 org.apache.tika.server.core.TikaServerProcess
Started Apache Tika server fabf267b-a86c-43d7-9845-e15f36d032e2 at
http://localhost:9998/
INFO  [qtp499951827-28] 12:04:24,324
org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type)
WARN  [qtp499951827-28] 12:04:24,683 org.apache.pdfbox.pdfparser.COSParser
Skipped incomplete object stream:108 0 R at 67085
WARN  [qtp499951827-28] 12:04:24,688
org.apache.tika.server.core.resource.TikaResource tika: Text extraction
failed (null)
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.pdf.PDFParser@5ec70124
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.tika.server.core.resource.TikaResource.lambda$produceOutput$2(TikaResource.java:680)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.eclipse.jetty.server.Server.handle(Server.java:516)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.io.IOException: Page tree root must be a dictionary
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
~[tika-server-standard-2.4.2-SNAPSHOT.jar:2.4.2-SNAPSHOT]
... 35 more
ERROR [qtp499951827-28] 12:04:24,705 org.apache.cxf.jaxrs.utils.JAXRSUtils
Problem with writing the data, class
org.apache.tika.server.core.resource.TikaResource$$Lambda$349/0x0000000800394c40,
ContentType: text/xml

On Fri, Jul 15, 2022 at 12:01 PM Tim Allison <ta...@apache.org> wrote:

> If you curl the test file (GetStartedWithSmallpdf.pdf) against your
> tika-server, what do you see?  The test file works for me with
> 2.4.2-SNAPSHOT at least.  Are the files getting truncated somehow?
>
>
>
> On Fri, Jul 15, 2022 at 9:41 AM PGNet Dev <pg...@gmail.com> wrote:
>
>>   i'm running tika-server 2.4.1 on a linux box,
>>
>>         lsb_release -rd
>>                 Description:    Fedora release 36 (Thirty Six)
>>                 Release:        36
>>
>>         uname -rm
>>                 5.18.11-200.fc36.x86_64 x86_64
>>
>>         java -version
>>                 Picked up JAVA_TOOL_OPTIONS: -Xmx512M
>>                 openjdk version "18.0.1" 2022-04-19
>>                 OpenJDK Runtime Environment 22.3 (build 18.0.1+10)
>>                 OpenJDK 64-Bit Server VM 22.3 (build 18.0.1+10, mixed
>> mode, sharing)
>>
>>
>>         ps ax | grep tika-server
>>            1003 ?        Ssl    0:12 /usr/bin/java -jar
>> /srv/webapps/tika/tika-server.jar -c
>> /usr/local/etc/tika/tika-server-config-custom.xml
>>            1143 ?        Sl     0:37 /usr/bin/java -Xms1g -Xmx1g
>> -Dpdfbox.fontcache=/var/tika -Dlog4j2.info -Djava.awt.headless=true -cp
>> /srv/webapps/tika/tika-server.jar -Dtika.server.id=
>> org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i  -c
>> /usr/local/etc/tika/tika-server-config-custom.xml -forkedStatusFile
>> /tmp/apache-tika-server-forked-tmp-9638775429532759882 -numRestarts 0
>>
>> it's invoked from a dovecot imap server instance, for attachment parsing,
>>
>>         dovecot --version
>>                 2.3.19.1 (9b53102964)
>>
>>         cat dovecot/conf.d/10-master.com
>>                 ...
>>                 plugin {
>>                         ...
>>                         fts_tika = http://127.0.0.1:9998/tika/
>>                 }
>>                 ...
>>
>> on receipt of an email with a standard attachment/exmaple -- e.g. the
>> example pdf @
>>
>>         https://smallpdf.com/edit-pdf
>>
>> , per journal logs, the message is submitted to tika, but fails due to a
>> 'corrupt stream'
>>
>>         Jul 15 08:41:27 mx tika[1143]: INFO  [qtp1837533591-27]
>> 08:41:27,224 org.apache.tika.server.core.resource.TikaResource /tika
>> (application/pdf)
>>         Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27]
>> 08:41:27,453 org.apache.pdfbox.pdfparser.COSParser The end of the stream
>> doesn't point to the correct offset, using workaround to read the stream,
>> stream start position: 104315, length: 356, expected end position: 104671
>>         Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
>> 08:41:27,457 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
>> corrupt stream due to a DataFormatException
>>         Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27]
>> 08:41:27,730 org.apache.pdfbox.pdfparser.COSParser The end of the stream
>> doesn't point to the correct offset, using workaround to read the stream,
>> stream start position: 101699, length: 1472, expected end position: 103171
>>         Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
>> 08:41:27,735 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
>> corrupt stream due to a DataFormatException
>>         Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27]
>> 08:41:27,742 org.apache.pdfbox.pdfparser.COSParser The end of the stream
>> doesn't point to the correct offset, using workaround to read the stream,
>> stream start position: 101509, length: 66, expected end position: 101575
>>         Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
>> 08:41:27,744 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
>> corrupt stream due to a DataFormatException
>>         Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27]
>> 08:41:27,748 org.apache.pdfbox.pdfparser.COSParser The end of the stream
>> doesn't point to the correct offset, using workaround to read the stream,
>> stream start position: 2011, length: 2482, expected end position: 4493
>>         Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27]
>> 08:41:27,752 org.apache.tika.server.core.resource.TikaResource tika/: Text
>> extraction failed (test.pdf)
>>         Jul 15 08:41:27 mx tika[1143]:
>> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
>> org.apache.tika.parser.pdf.PDFParser@356fdbd7
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.server.Server.handle(Server.java:516)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> java.lang.Thread.run(Thread.java:833) ~[?:?]
>>         Jul 15 08:41:27 mx tika[1143]: Caused by: java.io.IOException:
>> Page tree root must be a dictionary
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>> ~[tika-server-standard-2.4.1.jar:2.4.1]
>>         Jul 15 08:41:27 mx tika[1143]:         ... 37 more
>>         Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
>> 08:41:27,767 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the
>> data, class
>> org.apache.tika.server.core.resource.TikaResource$$Lambda$337/0x0000000800eabbf8,
>> ContentType: text/plain
>>
>> Is this likely an issue with tika-server itself? &/or java/dovecot?
>>
>> What additional diagnostics can help narrow it down?
>>
>

Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tilman Hausherr <TH...@t-online.de>.
I got it to work with tika-server-standard and

curl -X PUT --data-binary @Get_Started_With_Smallpdf.pdf 
http://localhost:9998/tika --header "Content-type: application/pdf"

and got a text and no nasty stuff on the console.

Tilman

Am 15.07.2022 um 18:01 schrieb Tim Allison:
> If you curl the test file (GetStartedWithSmallpdf.pdf) against your 
> tika-server, what do you see?  The test file works for me with 
> 2.4.2-SNAPSHOT at least.  Are the files getting truncated somehow?
>
>
>
> On Fri, Jul 15, 2022 at 9:41 AM PGNet Dev <pg...@gmail.com> wrote:
>
>       i'm running tika-server 2.4.1 on a linux box,
>
>             lsb_release -rd
>                     Description:    Fedora release 36 (Thirty Six)
>                     Release:        36
>
>             uname -rm
>                     5.18.11-200.fc36.x86_64 x86_64
>
>             java -version
>                     Picked up JAVA_TOOL_OPTIONS: -Xmx512M
>                     openjdk version "18.0.1" 2022-04-19
>                     OpenJDK Runtime Environment 22.3 (build 18.0.1+10)
>                     OpenJDK 64-Bit Server VM 22.3 (build 18.0.1+10,
>     mixed mode, sharing)
>
>
>             ps ax | grep tika-server
>                1003 ?        Ssl    0:12 /usr/bin/java -jar
>     /srv/webapps/tika/tika-server.jar -c
>     /usr/local/etc/tika/tika-server-config-custom.xml
>                1143 ?        Sl     0:37 /usr/bin/java -Xms1g -Xmx1g
>     -Dpdfbox.fontcache=/var/tika -Dlog4j2.info
>     -Djava.awt.headless=true -cp /srv/webapps/tika/tika-server.jar
>     -Dtika.server.id <http://Dtika.server.id>=
>     org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998
>     -i  -c /usr/local/etc/tika/tika-server-config-custom.xml
>     -forkedStatusFile
>     /tmp/apache-tika-server-forked-tmp-9638775429532759882 -numRestarts 0
>
>     it's invoked from a dovecot imap server instance, for attachment
>     parsing,
>
>             dovecot --version
>                     2.3.19.1 (9b53102964)
>
>             cat dovecot/conf.d/10-master.com <http://10-master.com>
>                     ...
>                     plugin {
>                             ...
>                             fts_tika = http://127.0.0.1:9998/tika/
>                     }
>                     ...
>
>     on receipt of an email with a standard attachment/exmaple -- e.g.
>     the example pdf @
>
>     https://smallpdf.com/edit-pdf
>
>     , per journal logs, the message is submitted to tika, but fails
>     due to a 'corrupt stream'
>
>             Jul 15 08:41:27 mx tika[1143]: INFO [qtp1837533591-27]
>     08:41:27,224 org.apache.tika.server.core.resource.TikaResource
>     /tika (application/pdf)
>             Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
>     08:41:27,453 org.apache.pdfbox.pdfparser.COSParser The end of the
>     stream doesn't point to the correct offset, using workaround to
>     read the stream, stream start position: 104315, length: 356,
>     expected end position: 104671
>             Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
>     08:41:27,457 org.apache.pdfbox.filter.FlateFilter FlateFilter:
>     stop reading corrupt stream due to a DataFormatException
>             Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
>     08:41:27,730 org.apache.pdfbox.pdfparser.COSParser The end of the
>     stream doesn't point to the correct offset, using workaround to
>     read the stream, stream start position: 101699, length: 1472,
>     expected end position: 103171
>             Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
>     08:41:27,735 org.apache.pdfbox.filter.FlateFilter FlateFilter:
>     stop reading corrupt stream due to a DataFormatException
>             Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
>     08:41:27,742 org.apache.pdfbox.pdfparser.COSParser The end of the
>     stream doesn't point to the correct offset, using workaround to
>     read the stream, stream start position: 101509, length: 66,
>     expected end position: 101575
>             Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
>     08:41:27,744 org.apache.pdfbox.filter.FlateFilter FlateFilter:
>     stop reading corrupt stream due to a DataFormatException
>             Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
>     08:41:27,748 org.apache.pdfbox.pdfparser.COSParser The end of the
>     stream doesn't point to the correct offset, using workaround to
>     read the stream, stream start position: 2011, length: 2482,
>     expected end position: 4493
>             Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
>     08:41:27,752 org.apache.tika.server.core.resource.TikaResource
>     tika/: Text extraction failed (test.pdf)
>             Jul 15 08:41:27 mx tika[1143]:
>     org.apache.tika.exception.TikaException: TIKA-198: Illegal
>     IOException from org.apache.tika.parser.pdf.PDFParser@356fdbd7
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.server.Server.handle(Server.java:516)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.io
>     <http://org.eclipse.jetty.io>.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.io
>     <http://org.eclipse.jetty.io>.FillInterest.fillable(FillInterest.java:105)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.io
>     <http://org.eclipse.jetty.io>.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     java.lang.Thread.run(Thread.java:833) ~[?:?]
>             Jul 15 08:41:27 mx tika[1143]: Caused by:
>     java.io.IOException: Page tree root must be a dictionary
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         at
>     org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>     ~[tika-server-standard-2.4.1.jar:2.4.1]
>             Jul 15 08:41:27 mx tika[1143]:         ... 37 more
>             Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
>     08:41:27,767 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with
>     writing the data, class
>     org.apache.tika.server.core.resource.TikaResource$$Lambda$337/0x0000000800eabbf8,
>     ContentType: text/plain
>
>     Is this likely an issue with tika-server itself? &/or java/dovecot?
>
>     What additional diagnostics can help narrow it down?
>

Re: tika-server 2.4.1 'corrupt stream' error scanning attachments, via dovecot fts plugin ?

Posted by Tim Allison <ta...@apache.org>.
If you curl the test file (GetStartedWithSmallpdf.pdf) against your
tika-server, what do you see?  The test file works for me with
2.4.2-SNAPSHOT at least.  Are the files getting truncated somehow?



On Fri, Jul 15, 2022 at 9:41 AM PGNet Dev <pg...@gmail.com> wrote:

>   i'm running tika-server 2.4.1 on a linux box,
>
>         lsb_release -rd
>                 Description:    Fedora release 36 (Thirty Six)
>                 Release:        36
>
>         uname -rm
>                 5.18.11-200.fc36.x86_64 x86_64
>
>         java -version
>                 Picked up JAVA_TOOL_OPTIONS: -Xmx512M
>                 openjdk version "18.0.1" 2022-04-19
>                 OpenJDK Runtime Environment 22.3 (build 18.0.1+10)
>                 OpenJDK 64-Bit Server VM 22.3 (build 18.0.1+10, mixed
> mode, sharing)
>
>
>         ps ax | grep tika-server
>            1003 ?        Ssl    0:12 /usr/bin/java -jar
> /srv/webapps/tika/tika-server.jar -c
> /usr/local/etc/tika/tika-server-config-custom.xml
>            1143 ?        Sl     0:37 /usr/bin/java -Xms1g -Xmx1g
> -Dpdfbox.fontcache=/var/tika -Dlog4j2.info -Djava.awt.headless=true -cp
> /srv/webapps/tika/tika-server.jar -Dtika.server.id=
> org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i  -c
> /usr/local/etc/tika/tika-server-config-custom.xml -forkedStatusFile
> /tmp/apache-tika-server-forked-tmp-9638775429532759882 -numRestarts 0
>
> it's invoked from a dovecot imap server instance, for attachment parsing,
>
>         dovecot --version
>                 2.3.19.1 (9b53102964)
>
>         cat dovecot/conf.d/10-master.com
>                 ...
>                 plugin {
>                         ...
>                         fts_tika = http://127.0.0.1:9998/tika/
>                 }
>                 ...
>
> on receipt of an email with a standard attachment/exmaple -- e.g. the
> example pdf @
>
>         https://smallpdf.com/edit-pdf
>
> , per journal logs, the message is submitted to tika, but fails due to a
> 'corrupt stream'
>
>         Jul 15 08:41:27 mx tika[1143]: INFO  [qtp1837533591-27]
> 08:41:27,224 org.apache.tika.server.core.resource.TikaResource /tika
> (application/pdf)
>         Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27]
> 08:41:27,453 org.apache.pdfbox.pdfparser.COSParser The end of the stream
> doesn't point to the correct offset, using workaround to read the stream,
> stream start position: 104315, length: 356, expected end position: 104671
>         Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,457 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
> corrupt stream due to a DataFormatException
>         Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27]
> 08:41:27,730 org.apache.pdfbox.pdfparser.COSParser The end of the stream
> doesn't point to the correct offset, using workaround to read the stream,
> stream start position: 101699, length: 1472, expected end position: 103171
>         Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,735 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
> corrupt stream due to a DataFormatException
>         Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27]
> 08:41:27,742 org.apache.pdfbox.pdfparser.COSParser The end of the stream
> doesn't point to the correct offset, using workaround to read the stream,
> stream start position: 101509, length: 66, expected end position: 101575
>         Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,744 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading
> corrupt stream due to a DataFormatException
>         Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27]
> 08:41:27,748 org.apache.pdfbox.pdfparser.COSParser The end of the stream
> doesn't point to the correct offset, using workaround to read the stream,
> stream start position: 2011, length: 2482, expected end position: 4493
>         Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27]
> 08:41:27,752 org.apache.tika.server.core.resource.TikaResource tika/: Text
> extraction failed (test.pdf)
>         Jul 15 08:41:27 mx tika[1143]:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@356fdbd7
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.server.Server.handle(Server.java:516)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> java.lang.Thread.run(Thread.java:833) ~[?:?]
>         Jul 15 08:41:27 mx tika[1143]: Caused by: java.io.IOException:
> Page tree root must be a dictionary
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> ~[tika-server-standard-2.4.1.jar:2.4.1]
>         Jul 15 08:41:27 mx tika[1143]:         ... 37 more
>         Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
> 08:41:27,767 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the
> data, class
> org.apache.tika.server.core.resource.TikaResource$$Lambda$337/0x0000000800eabbf8,
> ContentType: text/plain
>
> Is this likely an issue with tika-server itself? &/or java/dovecot?
>
> What additional diagnostics can help narrow it down?
>