You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2021/03/15 06:49:00 UTC

[jira] [Comment Edited] (TIKA-3319) Caused by: java.lang.NullPointerException (and more!)

    [ https://issues.apache.org/jira/browse/TIKA-3319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301426#comment-17301426 ] 

Tilman Hausherr edited comment on TIKA-3319 at 3/15/21, 6:48 AM:
-----------------------------------------------------------------

Re the warnings, here's what I do:
{noformat}
java -cp "tika-app-1.26-SNAPSHOT.jar;lib/*" org.apache.tika.cli.TikaCLI
{noformat}
and in lib, I have these files:
{noformat}
jai-imageio-core-1.4.0.jar
jai-imageio-jpeg2000-1.4.1-SNAPSHOT.jar
sqlite-jdbc-3.34.0.jar
{noformat}
(the names and versions may be slightly different for you because I have in-development stuff, and the class is probably {{org.apache.tika.gui.TikaGUI}} for you); and for people on linux who read this, use ":" instead of ";" in the classpath.


was (Author: tilman):
Re the warnings, here's what I do:
{noformat}
java -cp "tika-app-1.26-SNAPSHOT.jar;lib/*" org.apache.tika.cli.TikaCLI
{noformat}
and in lib, I have these files:
{noformat}
jai-imageio-core-1.4.0.jar
jai-imageio-jpeg2000-1.4.1-SNAPSHOT.jar
sqlite-jdbc-3.34.0.jar
{noformat}
(the names and versions may be slightly different for you because I have in-development stuff, and the class is probably TikaGUI for you); and for people on linux who read this, use ":" instead of ";" in the classpath.

> Caused by: java.lang.NullPointerException (and more!)
> -----------------------------------------------------
>
>                 Key: TIKA-3319
>                 URL: https://issues.apache.org/jira/browse/TIKA-3319
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.24.1
>         Environment: Windows 10
> Tika 1.24.1.jar
> Tika 1.24 python module
> python 3.9.2
> tesseract-ocr-w64-setup-v5.0.0-alpha.20201127
> (anything else that may be relevant?)
>            Reporter: Richard Kraus
>            Priority: Major
>
> So...in sum
>  1) it somehow doesn't "point" to a parser? (but it kinda does...)
>  2) it says that I'm excluding tesseract from tika....I don't know how this happened to begin with
>  3) and now...urllib in python by using the tika package suddenly can't figure out tika exists...
> Please assist. Thank you in advance. 
> 01 Tika-1.24.1.jar and 1.24 python module have been running well for months on my machine.
>  02 Then I get tesseract and a couple other things to integrate with it.
>  03 Then I upgrade python from 3.8.2 to 3.9.2
>  04 So I have always set the windows 10 $env: variable to something like TIKA_SERVER_JAR="<yourpath>/tika-server.jar"
>  05 Then I run the tika python module. I get this urllib problem....
>  urllib.error.URLError: <urlopen error unknown url type: c>
>  06 Supposedly this is fixed by setting the $env: variable to something like...
>  TIKA_SERVER_JAR="file:///<yourpath>/tika-server.jar"
>  07 So I do this and mess around with it; no dice.
>  08 So then I'm trying to run Tika on powershell right?
>  java -jar "C:\PATH\TO\tika-app-1.24.1.jar" --gui
>  brings up the gui but it gives me these "Warnings" now...
>  
> {quote}Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
>  WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
>  See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
>  for optional dependencies.
> Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
>  WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
>  you've excluded the TesseractOCRParser from the default parser.
>  Tesseract may dramatically slow down content extraction (TIKA-2359).
>  As of Tika 1.15 (and prior versions), Tesseract is automatically called.
>  In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
>  Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
>  WARNING: org.xerial's sqlite-jdbc is not loaded.
>  Please provide the jar on your classpath to parse sqlite files.
>  See tika-parsers/pom.xml for the correct version.
> {quote}
> 09 so now when I try to use the --gui to parse a file I have parsed before it shows this message...
>  
> {quote}Apache Tika was unable to parse the documentApache Tika was unable to parse the documentat C:\CODING\Apache Tika\Test03.pdf.
>  The full exception stack trace is included below:
>  org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@473cb131 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309) at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:267) at java.desktop/javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1967) at java.desktop/javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2308) at java.desktop/javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:405) at java.desktop/javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:262) at java.desktop/javax.swing.AbstractButton.doClick(AbstractButton.java:369) at java.desktop/javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:1020) at java.desktop/javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:1064) at java.desktop/java.awt.Component.processMouseEvent(Component.java:6636) at java.desktop/javax.swing.JComponent.processMouseEvent(JComponent.java:3342) at java.desktop/java.awt.Component.processEvent(Component.java:6401) at java.desktop/java.awt.Container.processEvent(Container.java:2263) at java.desktop/java.awt.Component.dispatchEventImpl(Component.java:5012) at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2321) at java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at java.desktop/java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4919) at java.desktop/java.awt.LightweightDispatcher.processMouseEvent(Container.java:4548) at java.desktop/java.awt.LightweightDispatcher.dispatchEvent(Container.java:4489) at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2307) at java.desktop/java.awt.Window.dispatchEventImpl(Window.java:2764) at java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:772) at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) at java.base/java.security.AccessController.doPrivileged(AccessController.java:391) at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85) at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:95) at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:745) at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:743) at java.base/java.security.AccessController.doPrivileged(AccessController.java:391) at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85) at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:742) at java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203) at java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124) at java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113) at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109) at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) at java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90)Caused by: java.lang.NullPointerException at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractXMPXFA(AbstractPDF2XHTML.java:209) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:678) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:174) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 44 more
> {quote}
> 10 most notably these lines...
> {quote}A) org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@473cb131
>  B) Caused by: java.lang.NullPointerException
> {quote}
> 11 now here's my java -jar tika-app-1.24.1.jar --dump-current-config
> {quote}Mar 14, 2021 10:15:23 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
>  WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
>  See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
>  for optional dependencies.
> Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
>  WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
>  you've excluded the TesseractOCRParser from the default parser.
>  Tesseract may dramatically slow down content extraction (TIKA-2359).
>  As of Tika 1.15 (and prior versions), Tesseract is automatically called.
>  In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
>  Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
>  WARNING: org.xerial's sqlite-jdbc is not loaded.
>  Please provide the jar on your classpath to parse sqlite files.
>  See tika-parsers/pom.xml for the correct version.
>  <?xml version="1.0" encoding="UTF-8" standalone="no"?>
>  <properties>
>  <!--for example: <mimeTypeRepository resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
>  <service-loader dynamic="true" loadErrorHandler="IGNORE"/>
>  <encodingDetectors>
>  <encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/>
>  </encodingDetectors>
>  <translator class="org.apache.tika.language.translate.DefaultTranslator"/>
>  <detectors>
>  <detector class="org.apache.tika.detect.DefaultDetector"/>
>  </detectors>
>  <parsers>
>  <parser class="org.apache.tika.parser.DefaultParser"/>
>  </parsers>
>  </properties>
> {quote}
> 12 any help would be greatly appreciated. 
>  13A the odd thing is when I run something like...
>  java -jar tika-app-1.24.1.jar -t Test03.pdf output.txt
> 13B it will print the document text in powershell then print this below it (which I have never gotten before)...
> {quote}Exception in thread "main" java.net.MalformedURLException: no protocol: output.txt
>  at java.base/java.net.URL.<init>(URL.java:672)
>  at java.base/java.net.URL.<init>(URL.java:568)
>  at java.base/java.net.URL.<init>(URL.java:515)
>  at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:488)
>  at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)