You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/02/01 18:34:00 UTC

[jira] [Comment Edited] (TIKA-3657) Microsoft documents are not text parsed when running under Docker

    [ https://issues.apache.org/jira/browse/TIKA-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485409#comment-17485409 ] 

Tim Allison edited comment on TIKA-3657 at 2/1/22, 6:33 PM:
------------------------------------------------------------

I physically removed a detector from the jar/war hoping that that might prevent the loading of classes after that, and it doesn't.  I configured a misspelled detector hoping that might prevent the loading of classes after that, and it doesn't.

If I set a value > Integer.MAX in the config file, I get something that is not a silent failure of class loading:

{noformat}
re><p><b>Root Cause</b></p><pre>java.lang.NumberFormatException: For input string: &quot;13423423424322217728&quot;
	java.base&#47;java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	java.base&#47;java.lang.Integer.parseInt(Integer.java:652)
	java.base&#47;java.lang.Integer.&lt;init&gt;(Integer.java:1105)
	java.base&#47;jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	java.base&#47;jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	java.base&#47;jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	java.base&#47;java.lang.reflect.Constructor.newInstance(Constructor.java:490)
	org.apache.tika.config.Param.getTypedValue(Param.java:282)
	org.apache.tika.config.Param.load(Param.java:188)
	org.apache.tika.config.TikaConfig$XmlLoader.getParams(TikaConfig.java:793)
	org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:682)
	org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:621)
	org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:155)
	org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:141)
	org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:133)
	org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:129)
	MyServlet.doPut(MyServlet.java:47)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:684)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:764)
	org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
{noformat}

If I misspell the DefaultZipContainerDetector, it simply doesn't load, but the file is parsed by the PackageParser and then the xml parser so there's still a bunch of content.


was (Author: tallison@mitre.org):
I physically removed a detector from the jar/war hoping that that might prevent the loading of classes after that, and it doesn't.  I configured a misspelled detector hoping that might prevent the loading of classes after that, and it doesn't.

If I set a value > Integer.MAX in the config file, I get something that is not a silent failure of class loading:

{noformat}
re><p><b>Root Cause</b></p><pre>java.lang.NumberFormatException: For input string: &quot;13423423424322217728&quot;
	java.base&#47;java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	java.base&#47;java.lang.Integer.parseInt(Integer.java:652)
	java.base&#47;java.lang.Integer.&lt;init&gt;(Integer.java:1105)
	java.base&#47;jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	java.base&#47;jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	java.base&#47;jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	java.base&#47;java.lang.reflect.Constructor.newInstance(Constructor.java:490)
	org.apache.tika.config.Param.getTypedValue(Param.java:282)
	org.apache.tika.config.Param.load(Param.java:188)
	org.apache.tika.config.TikaConfig$XmlLoader.getParams(TikaConfig.java:793)
	org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:682)
	org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:621)
	org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:155)
	org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:141)
	org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:133)
	org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:129)
	MyServlet.doPut(MyServlet.java:47)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:684)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:764)
	org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
{noformat}

> Microsoft documents are not text parsed when running under Docker
> -----------------------------------------------------------------
>
>                 Key: TIKA-3657
>                 URL: https://issues.apache.org/jira/browse/TIKA-3657
>             Project: Tika
>          Issue Type: Bug
>          Components: config, core, depedency
>    Affects Versions: 2.2.0, 2.2.1
>            Reporter: Tim Barrett
>            Priority: Major
>             Fix For: 2.2.2
>
>         Attachments: scenario traces.txt, tika-config.xml
>
>
> We use EmbeddedDocumentExtractor, with this code:
> NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = *new* NalyticsEmbeddedDocumentExtractor(*this*);
> *this*.context.set(EmbeddedDocumentExtractor.*class*, nalyticsEmbeddedDocumentExtractor);
> This all works fine for us, and has been used in production for a few years. This also works under Tika 2.2.0 when running in development environments (Eclipse, Apache Tomcat). However when running under Docker the text withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under Docker, the Microsoft documents are fully parsed, so this problem was introduced in 2.2.0
> Interestingly, I found that if *anything at all* is added to the context via context.set the same problem occurs. Also, if the standard Tika Embedded Document Extractor is used the same problem occurs. Our Docker image contains our application's code which uses Tika, as well as Apache DS. The problem occurs running Docker on Ubuntu, Mac OS and Windows.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)