You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by ha...@avident-it.se on 2019/11/25 20:08:17 UTC

Sv: [EXTERNAL] Tika Python questions

Hi

Time flies, only wondering if there area ny news on this?

Are 1.23 being rolled out soon?

 

Kind regards

Hans

 

Från: Tim Allison <ta...@apache.org> 
Skickat: den 14 oktober 2019 13:53
Till: hans.meijer@avident-it.se
Kopia: dev@tika.apache.org
Ämne: Re: [EXTERNAL] Tika Python questions

 

Sorry for the late reply. Once POI is released, we’ll probably roll out 1.23...probably 3-4 weeks?

 

Fellow devs, WDYT?

 

On Mon, Oct 14, 2019 at 6:55 AM <hans.meijer@avident-it.se <ma...@avident-it.se> > wrote:

Hi,

Sorry for disturbing, I do see the commit but any hints on when it can be released?

I assume it will be a new version of Apache Tika, current version seems to be 1.22, so this would be in 1.23?

 

Kind regards

Hans

 

Från: Tim Allison <tallison@apache.org <ma...@apache.org> > 
Skickat: den 10 oktober 2019 05:05
Till: hans.meijer@avident-it.se <ma...@avident-it.se> 
Kopia: <dev@tika.apache.org <ma...@tika.apache.org> > <dev@tika.apache.org <ma...@tika.apache.org> >
Ämne: Re: [EXTERNAL] Tika Python questions

 

Thank you for this report!  I just bumped the max record length for a blob by 10x in POI, which should be released fairly soon.

 

r1868211

 

On Wed, Oct 9, 2019 at 10:20 AM <hans.meijer@avident-it.se <ma...@avident-it.se> > wrote:

Hi,
This is an "old" excel spreadsheet, .xls, that is causing it. If you would like to I can send that as well.

I hope this gives you what you need from the tika-server stacktrace:
INFO  rmeta/text (autodetecting type)
WARN  Ignoring unexpected exception while parsing summary entry DocumentSummaryInformation
org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 1186956, but 1000000 is the maximum for this record type.
If the file is not corrupt, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
        at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:568)
        at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:175)
        at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:547)
        at org.apache.poi.hpsf.Blob.read(Blob.java:33)
        at org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:166)
        at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:176)
        at org.apache.poi.hpsf.Property.<init>(Property.java:179)
        at org.apache.poi.hpsf.Section.<init>(Section.java:241)
        at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:497)
        at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:195)
        at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:83)
        at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:74)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:155)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:232)
        at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:422)
        at org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:144)
        at org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:121)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
        at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
        at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201)
        at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104)
        at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
        at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
        at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
        at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
        at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
        at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
        at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
        at org.eclipse.jetty.server.Server.handle(Server.java:505)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> .FillInterest.fillable(FillInterest.java:103)
        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> .ChannelEndPoint$2.run(ChannelEndPoint.java:117)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
        at java.lang.Thread.run(Thread.java:748)
xterm

/Kind regards
Hans

-----Ursprungligt meddelande-----
Från: Tim Allison <tallison@apache.org <ma...@apache.org> > 
Skickat: den 9 oktober 2019 14:04
Till: Luís Filipe Nassif <lfcnassif@gmail.com <ma...@gmail.com> >
Kopia: <dev@tika.apache.org <ma...@tika.apache.org> > <dev@tika.apache.org <ma...@tika.apache.org> >; hans.meijer@avident-it.se <ma...@avident-it.se> 
Ämne: Re: [EXTERNAL] Tika Python questions

Yep, that's why we added those limits.

Hans, if you can send the full stacktrace that will allow me to see what record type you're running into this with, we may be able to increase it in POI before the next release.

On Tue, Oct 8, 2019 at 2:10 PM Luís Filipe Nassif <lfcnassif@gmail.com <ma...@gmail.com> > wrote:
>
> I think it is not related to file size, but maximum record size 
> handled by POI. It is a protection against OutOfMemoryErrors. I 
> increased this limit to 10M because was seeing many of them. I do not 
> know if it is configurable in tika server.
>
> Regards,
> Luis
>
> Em ter, 8 de out de 2019 17:46, Chris Mattmann <mattmann@apache.org <ma...@apache.org> >
> escreveu:
>
> > Hi,
> >
> >
> >
> > Thanks for your question. Yes, the same way you set the byte size 
> > property in Tika-App (I think through parser configuration) is how 
> > you would do it for Tika-Server. You would just start the Tika 
> > Server yourself with a custom config file that set this property and 
> > then start it on the default port (making sure any other ones were 
> > killed first). Then Tika-Python will use your own Tika Server with 
> > custom config.
> >
> >
> >
> > As for catching errors, it will try its best to do that, but it does 
> > not catch all of them and if you find something it doesn’t catch let 
> > us know and we will work to fix it.
> >
> >
> >
> > Thanks,
> >
> > Chris
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: "hans.meijer@avident-it.se <ma...@avident-it.se> " <hans.meijer@avident-it.se <ma...@avident-it.se> >
> > Organization: Avident-IT
> > Date: Tuesday, October 8, 2019 at 6:06 AM
> > To: "Mattmann, Chris A (US 1761)" <chris.a.mattmann@jpl.nasa.gov <ma...@jpl.nasa.gov> >
> > Subject: [EXTERNAL] Tika Python questions
> >
> >
> >
> > Hi
> >
> > I have had the pleasure of testing the Tika-python library. I am 
> > testing it out in a new application that are developed for customers.
> >
> > It has very good performance, especially for parsing XLSX and XLS files.
> >
> >
> >
> > However, I have two questions:
> > The Tika-Server handles only files with a maximum byte size. I get 
> > this
> > error:
> > org.apache.poi.util.RecordFormatException: Tried to allocate an 
> > array of length 1186956, but 1000000 is the maximum for this record type.
> >
> > increasing the maximum allowable size for this record type.
> >
> > As a temporary workaround, consider setting a higher override value 
> > with
> > IOUtils.setByteArrayMaxOverride()
> >
> > I have tried the Tika-App python (jar file) and it does handle the 
> > file size where files are larger than 1000000.
> >
> > In the Tika documentation it says to set MaxBytes to -1 to override 
> > and handle larger files.
> >
> > Is there any way to handle this via Tika-Python? To set max files 
> > size to unlimited as the “Tika-App” handles it?
> >
> >
> > How is it possible to catch errors via the Tika-python library, like 
> > if files are encrypted, corrupt etc.?
> >
> >
> >
> >
> > Kind regards
> >
> >
> >
> > HANS MEIJER
> >
> >
> >
> >