You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Rob Tulloh (Created) (JIRA)" <ji...@apache.org> on 2011/12/29 19:21:31 UTC

[jira] [Created] (TIKA-835) TNEF parsing unstable

TNEF parsing unstable
---------------------

                 Key: TIKA-835
                 URL: https://issues.apache.org/jira/browse/TIKA-835
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.0
         Environment: CentOS 4.x/5.x/6.x 
Java 6
            Reporter: Rob Tulloh


We are seeing problems in Solr with tika throwing exceptions. Sometimes we see OOM like this:

{noformat}
SEVERE: java.lang.OutOfMemoryError: Java heap space
        at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
        at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
        at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
        at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
{noformat}

Other times, we see errors like this one:

{noformat}
Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
        at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:302)
        at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:53)
        at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
        at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
        at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        ... 26 more
{noformat}

I am able to reproduce these failures with tika-app-1.0.jar. I am not able to share the content as the content is proprietary in nature. The OOM error is particularly problematic as it crashes Solr and causes our document indexing pipeline to get congested while it waits for Solr to restart. Please see also Solr ticket https://issues.apache.org/jira/browse/SOLR-2990 as this contains the original posting of the problem and some details of our environment where the tests are being performed.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-835) TNEF parsing unstable

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177574#comment-13177574 ] 

Nick Burch commented on TIKA-835:
---------------------------------

winmail.dat is a TNEF file, which POI supports through HMEF. All the sample files we have in POI open fine, but it appears that the files we have don't cover all the possible cases... (TNEF is partly documented, but not fully)

If you're able to help look into it, then either the dev@poi.apache.org mailing list or a new bug in the POI bugzilla are the appropriate places to do this (it's a POI issue not a Tika one)
                
> TNEF parsing unstable
> ---------------------
>
>                 Key: TIKA-835
>                 URL: https://issues.apache.org/jira/browse/TIKA-835
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: CentOS 4.x/5.x/6.x 
> Java 6
>            Reporter: Rob Tulloh
>
> We are seeing problems in Solr with tika throwing exceptions. Sometimes we see OOM like this:
> {noformat}
> SEVERE: java.lang.OutOfMemoryError: Java heap space
>         at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
>         at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195)
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>         at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478)
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
>         at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>         at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>         at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>         at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>         at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>         at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>         at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>         at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>         at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> {noformat}
> Other times, we see errors like this one:
> {noformat}
> Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
>         at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:302)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:53)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
>         at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         ... 26 more
> {noformat}
> I am able to reproduce these failures with tika-app-1.0.jar. I am not able to share the content as the content is proprietary in nature. The OOM error is particularly problematic as it crashes Solr and causes our document indexing pipeline to get congested while it waits for Solr to restart. Please see also Solr ticket https://issues.apache.org/jira/browse/SOLR-2990 as this contains the original posting of the problem and some details of our environment where the tests are being performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-835) TNEF parsing unstable

Posted by "Rob Tulloh (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177547#comment-13177547 ] 

Rob Tulloh commented on TIKA-835:
---------------------------------

If you can tell me how to debug this, I'll be glad to try and help you identify the problem. 

I believe the file in question is named winmail.dat which I believe is some kind of standard Microsoft attachment file? If so, then finding an example file may be possible without me having to disclose the proprietary content. 
                
> TNEF parsing unstable
> ---------------------
>
>                 Key: TIKA-835
>                 URL: https://issues.apache.org/jira/browse/TIKA-835
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: CentOS 4.x/5.x/6.x 
> Java 6
>            Reporter: Rob Tulloh
>
> We are seeing problems in Solr with tika throwing exceptions. Sometimes we see OOM like this:
> {noformat}
> SEVERE: java.lang.OutOfMemoryError: Java heap space
>         at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
>         at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195)
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>         at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478)
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
>         at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>         at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>         at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>         at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>         at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>         at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>         at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>         at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>         at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> {noformat}
> Other times, we see errors like this one:
> {noformat}
> Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
>         at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:302)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:53)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
>         at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         ... 26 more
> {noformat}
> I am able to reproduce these failures with tika-app-1.0.jar. I am not able to share the content as the content is proprietary in nature. The OOM error is particularly problematic as it crashes Solr and causes our document indexing pipeline to get congested while it waits for Solr to restart. Please see also Solr ticket https://issues.apache.org/jira/browse/SOLR-2990 as this contains the original posting of the problem and some details of our environment where the tests are being performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-835) TNEF parsing unstable

Posted by "Rob Tulloh (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177625#comment-13177625 ] 

Rob Tulloh commented on TIKA-835:
---------------------------------

Opened POI ticket: https://issues.apache.org/bugzilla/show_bug.cgi?id=52400
                
> TNEF parsing unstable
> ---------------------
>
>                 Key: TIKA-835
>                 URL: https://issues.apache.org/jira/browse/TIKA-835
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: CentOS 4.x/5.x/6.x 
> Java 6
>            Reporter: Rob Tulloh
>
> We are seeing problems in Solr with tika throwing exceptions. Sometimes we see OOM like this:
> {noformat}
> SEVERE: java.lang.OutOfMemoryError: Java heap space
>         at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
>         at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195)
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>         at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478)
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
>         at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>         at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>         at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>         at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>         at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>         at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>         at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>         at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>         at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> {noformat}
> Other times, we see errors like this one:
> {noformat}
> Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
>         at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:302)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:53)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
>         at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         ... 26 more
> {noformat}
> I am able to reproduce these failures with tika-app-1.0.jar. I am not able to share the content as the content is proprietary in nature. The OOM error is particularly problematic as it crashes Solr and causes our document indexing pipeline to get congested while it waits for Solr to restart. Please see also Solr ticket https://issues.apache.org/jira/browse/SOLR-2990 as this contains the original posting of the problem and some details of our environment where the tests are being performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (TIKA-835) TNEF parsing unstable

Posted by "Rob Tulloh (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rob Tulloh closed TIKA-835.
---------------------------

    Resolution: Won't Fix

moving to POI
                
> TNEF parsing unstable
> ---------------------
>
>                 Key: TIKA-835
>                 URL: https://issues.apache.org/jira/browse/TIKA-835
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: CentOS 4.x/5.x/6.x 
> Java 6
>            Reporter: Rob Tulloh
>
> We are seeing problems in Solr with tika throwing exceptions. Sometimes we see OOM like this:
> {noformat}
> SEVERE: java.lang.OutOfMemoryError: Java heap space
>         at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
>         at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195)
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>         at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478)
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
>         at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>         at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>         at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>         at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>         at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>         at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>         at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>         at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>         at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> {noformat}
> Other times, we see errors like this one:
> {noformat}
> Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
>         at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:302)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:53)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
>         at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         ... 26 more
> {noformat}
> I am able to reproduce these failures with tika-app-1.0.jar. I am not able to share the content as the content is proprietary in nature. The OOM error is particularly problematic as it crashes Solr and causes our document indexing pipeline to get congested while it waits for Solr to restart. Please see also Solr ticket https://issues.apache.org/jira/browse/SOLR-2990 as this contains the original posting of the problem and some details of our environment where the tests are being performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-835) TNEF parsing unstable

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177545#comment-13177545 ] 

Nick Burch commented on TIKA-835:
---------------------------------

Without a file, it's going to be very hard for us to identify what's wrong. (It could well be an issue where we're mis-reading the previous attribute, then we're finding junk where the next one should be)

Alas the TNEF format doesn't have nearly as much public documentation as much of the other Microsoft formats, so reverse engineering is often needed (which needs sample files to work against)

Finally, this is a POI bug, so we should take the discussions on how you can identify the problem parts of your file there
                
> TNEF parsing unstable
> ---------------------
>
>                 Key: TIKA-835
>                 URL: https://issues.apache.org/jira/browse/TIKA-835
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: CentOS 4.x/5.x/6.x 
> Java 6
>            Reporter: Rob Tulloh
>
> We are seeing problems in Solr with tika throwing exceptions. Sometimes we see OOM like this:
> {noformat}
> SEVERE: java.lang.OutOfMemoryError: Java heap space
>         at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
>         at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195)
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>         at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478)
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
>         at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>         at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>         at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>         at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>         at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>         at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>         at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>         at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>         at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> {noformat}
> Other times, we see errors like this one:
> {noformat}
> Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
>         at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:302)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:53)
>         at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
>         at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         ... 26 more
> {noformat}
> I am able to reproduce these failures with tika-app-1.0.jar. I am not able to share the content as the content is proprietary in nature. The OOM error is particularly problematic as it crashes Solr and causes our document indexing pipeline to get congested while it waits for Solr to restart. Please see also Solr ticket https://issues.apache.org/jira/browse/SOLR-2990 as this contains the original posting of the problem and some details of our environment where the tests are being performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira