You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by PGNet Dev <pg...@gmail.com> on 2022/07/19 22:45:59 UTC

finding/setting correct breakpoint in jdb debugging of pdfbox/pdfparser in tika?

hi,

i'm debugging a problem with email attachment scanning by tika-server.

dovecot imap server receives email+attachment, then hands off the attachment (modified, or unmodified, dunno yet) via its 'fts-tika' plugin.

with

	dovecot 2.3.19.1
	tika 2.4.2-snapshot
	openjdk version "18.0.1" 2022-04-19

this used to work with earlier versions (haven't bisected the problem yet).

with that^ version mix, it's failing.

it appears to be failing @ ~ PDFParser.

i've been trying to debug in this thread,

	https://lists.apache.org/thread/pztsq8tb8xqz3s4kmjpnt9p3zt07y05k

but have hit a current (temporary?) impasse.

at both Tika & Dovecot mailing lists, it's suggested to capture the /tmp/file @ failure.

to do so, i've -- per instruction -- set a jdb bkpt @

	org.apache.tika.parser.pdf.PDFParser

, but on exec, the errant file's not persisted

one suggestion as to why not is,

	"If the debugger didn't stop, then the breakpoint was at the wrong place. Or it's not possible to debug."

seems *really* odd that it can't be debugged ... thought best to ask _here_ first.

Q:

	IS it possible to debug?  ()

	what's the RIGHT breakpoint to set to make sure to halt, & catch the tmp file?


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: finding/setting correct breakpoint in jdb debugging of pdfbox/pdfparser in tika?

Posted by Tilman Hausherr <TH...@t-online.de>.
Within pdfbox, set a breakpoint here:


If execution stops at that point, then the file still exists.
Tilman

Am 20.07.2022 um 12:32 schrieb PGNet Dev:
> On 7/19/22 10:46 PM, Tilman Hausherr wrote:
>> You can also put a breakpoint in PDFBox, then go to
>> org.apache.pdfbox.pdfparser.PDFParser.parse()
>> and when it does breakpoint-stop there (it definitively passes that 
>> point!), then look into your /tmp directory for the file that is 
>> mentioned in the tika debug output and copy it somewhere else.
>
> the bkpt guessing and various builds i've attempted (other thread) 
> haven't been fruitful
>
> at this point, it'd be helpful to be specific about the correct 
> breakpoint
>
> what do you explicitly intend for "set a breakpoint" and "go to"?
>
> atm, i DL
>
>     D="/srv/tika"
>     F="tika-server-standard-2.4.2-20220720.025305-98.jar"
>     cd ${D}
>     rm -rf TMP
>     mkdir -p TMP/mod
>     cd TMP
>     rm -f ${F}*
>     wget 
> https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-server-standard/2.4.2-SNAPSHOT/${F}
>     cd mod
>
> extract & turn debug loggin ON
>
>     jar -xfv ../${F}
>     perl -pi -e 's|Root level="info"|Root level="debug"|g' log4j2.xml
>
> repack
>
>     jar -cvmf META-INF/MANIFEST.MF ../mod.jar *
>
> get main class
>
>     cd ${D}
>     unzip -p TMP/mod.jar META-INF/MANIFEST.MF | grep Main-Class
>         Main-Class: org.apache.tika.server.core.TikaServerCli
>
> launch under jdb
>
>     sudo -u tika /usr/bin/jdb \
>      -classpath /srv/tika/TMP/mod.jar \
>      org.apache.tika.server.core.TikaServerCli \
>      -c /etc/tika/tika-server-config-custom.xml
>
> now, what specific breakpoint(s) to set here
>
>     > stop in ???
>
> so that on
>
>     > run
>
> and stop on/after email receipt + failed scan, I will find that 
> trapped file in /tmp ?
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>

Re: finding/setting correct breakpoint in jdb debugging of pdfbox/pdfparser in tika?

Posted by PGNet Dev <pg...@gmail.com>.
On 7/19/22 10:46 PM, Tilman Hausherr wrote:
> You can also put a breakpoint in PDFBox, then go to
> org.apache.pdfbox.pdfparser.PDFParser.parse()
> and when it does breakpoint-stop there (it definitively passes that point!), then look into your /tmp directory for the file that is mentioned in the tika debug output and copy it somewhere else.

the bkpt guessing and various builds i've attempted (other thread) haven't been fruitful

at this point, it'd be helpful to be specific about the correct breakpoint

what do you explicitly intend for "set a breakpoint" and "go to"?

atm, i DL

	D="/srv/tika"
	F="tika-server-standard-2.4.2-20220720.025305-98.jar"
	cd ${D}
	rm -rf TMP
	mkdir -p TMP/mod
	cd TMP
	rm -f ${F}*
	wget https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-server-standard/2.4.2-SNAPSHOT/${F}
	cd mod

extract & turn debug loggin ON

	jar -xfv ../${F}
	perl -pi -e 's|Root level="info"|Root level="debug"|g' log4j2.xml

repack

	jar -cvmf META-INF/MANIFEST.MF ../mod.jar *

get main class

	cd ${D}
	unzip -p TMP/mod.jar META-INF/MANIFEST.MF | grep Main-Class
		Main-Class: org.apache.tika.server.core.TikaServerCli

launch under jdb

	sudo -u tika /usr/bin/jdb \
	 -classpath /srv/tika/TMP/mod.jar \
	 org.apache.tika.server.core.TikaServerCli \
	 -c /etc/tika/tika-server-config-custom.xml

now, what specific breakpoint(s) to set here

	> stop in ???

so that on

	> run

and stop on/after email receipt + failed scan, I will find that trapped file in /tmp ?




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: finding/setting correct breakpoint in jdb debugging of pdfbox/pdfparser in tika?

Posted by Tilman Hausherr <TH...@t-online.de>.
I'm also here 😂

You can also put a breakpoint in PDFBox, then go to
org.apache.pdfbox.pdfparser.PDFParser.parse()
and when it does breakpoint-stop there (it definitively passes that 
point!), then look into your /tmp directory for the file that is 
mentioned in the tika debug output and copy it somewhere else.

Tilman

Am 20.07.2022 um 00:45 schrieb PGNet Dev:
> hi,
>
> i'm debugging a problem with email attachment scanning by tika-server.
>
> dovecot imap server receives email+attachment, then hands off the 
> attachment (modified, or unmodified, dunno yet) via its 'fts-tika' 
> plugin.
>
> with
>
>     dovecot 2.3.19.1
>     tika 2.4.2-snapshot
>     openjdk version "18.0.1" 2022-04-19
>
> this used to work with earlier versions (haven't bisected the problem 
> yet).
>
> with that^ version mix, it's failing.
>
> it appears to be failing @ ~ PDFParser.
>
> i've been trying to debug in this thread,
>
>     https://lists.apache.org/thread/pztsq8tb8xqz3s4kmjpnt9p3zt07y05k
>
> but have hit a current (temporary?) impasse.
>
> at both Tika & Dovecot mailing lists, it's suggested to capture the 
> /tmp/file @ failure.
>
> to do so, i've -- per instruction -- set a jdb bkpt @
>
>     org.apache.tika.parser.pdf.PDFParser
>
> , but on exec, the errant file's not persisted
>
> one suggestion as to why not is,
>
>     "If the debugger didn't stop, then the breakpoint was at the wrong 
> place. Or it's not possible to debug."
>
> seems *really* odd that it can't be debugged ... thought best to ask 
> _here_ first.
>
> Q:
>
>     IS it possible to debug?  ()
>
>     what's the RIGHT breakpoint to set to make sure to halt, & catch 
> the tmp file?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org