You are viewing a plain text version of this content. The canonical link for it is here.
Posted to corpora-dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2020/07/29 16:35:36 UTC

Re: Error when parsing of Excel files

> as for files, in my case they are from customer and I don't want to share
them.

https://corpora.tika.apache.org/datasette/corpora-metadata?sql=select+file_path%2C+orig_stack_trace%0D%0Afrom+containers+c%0D%0Ajoin+profiles+p+on+p.container_id%3Dc.container_id%0D%0Ajoin+PARSE_EXCEPTIONS+e+on+p.id%3De.id%0D%0Awhere+orig_stack_trace+like+%27%250x203%25%27%0D%0Aorder+by+file_path+limit+101

triggering file available here:
https://corpora.tika.apache.org/base/docs/bug_trackers/poi/POI-47251-4.xls

Victory for our regression corpus!

On Wed, Jul 29, 2020 at 12:14 PM Slava G <sl...@gmail.com> wrote:

> Thanks Tim.
> Will do, as for files, in my case they are from customer and I don't want
> to share them.
> Thanks
>
> On Wed, Jul 29, 2020, 19:06 Tim Allison <ta...@apache.org> wrote:
>
>> Looks like I identified that one i
>> <https://bz.apache.org/bugzilla/show_bug.cgi?id=60833>n our regression
>> corpus here: https://bz.apache.org/bugzilla/show_bug.cgi?id=60833#c10
>>
>> Please open an issue on POI's bug tracker.  If you need an example file,
>> we can dig one up.
>>
>> On Wed, Jul 29, 2020 at 10:10 AM Slava G <sl...@gmail.com> wrote:
>>
>>> Hi,
>>> I have some Excel files that opens fine in Excel or Numbers in Mac but
>>> TIKA (inprocess and app) throws exception:
>>>
>>>
>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>>> from org.apache.tika.parser.microsoft.OfficeParser@408ae0bf
>>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293)
>>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>>> at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>>> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>>> at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)
>>> at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358)
>>> at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309)
>>> at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:267)
>>> at
>>> javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2022)
>>> at
>>> javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2348)
>>> at
>>> javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:402)
>>> at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:259)
>>> at javax.swing.AbstractButton.doClick(AbstractButton.java:376)
>>> at
>>> javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:842)
>>> at com.apple.laf.AquaMenuItemUI.doClick(AquaMenuItemUI.java:157)
>>> at
>>> javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:886)
>>> at java.awt.Component.processMouseEvent(Component.java:6539)
>>> at javax.swing.JComponent.processMouseEvent(JComponent.java:3324)
>>> at java.awt.Component.processEvent(Component.java:6304)
>>> at java.awt.Container.processEvent(Container.java:2239)
>>> at java.awt.Component.dispatchEventImpl(Component.java:4889)
>>> at java.awt.Container.dispatchEventImpl(Container.java:2297)
>>> at java.awt.Component.dispatchEvent(Component.java:4711)
>>> at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4904)
>>> at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4535)
>>> at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4476)
>>> at java.awt.Container.dispatchEventImpl(Container.java:2283)
>>> at java.awt.Window.dispatchEventImpl(Window.java:2746)
>>> at java.awt.Component.dispatchEvent(Component.java:4711)
>>> at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:760)
>>> at java.awt.EventQueue.access$500(EventQueue.java:97)
>>> at java.awt.EventQueue$3.run(EventQueue.java:709)
>>> at java.awt.EventQueue$3.run(EventQueue.java:703)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at
>>> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
>>> at
>>> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:84)
>>> at java.awt.EventQueue$4.run(EventQueue.java:733)
>>> at java.awt.EventQueue$4.run(EventQueue.java:731)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at
>>> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
>>> at java.awt.EventQueue.dispatchEvent(EventQueue.java:730)
>>> at
>>> java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:205)
>>> at
>>> java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116)
>>> at
>>> java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105)
>>> at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
>>> at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:93)
>>> at java.awt.EventDispatchThread.run(EventDispatchThread.java:82)
>>> Caused by:
>>> org.apache.poi.hssf.record.RecordInputStream$LeftoverDataException:
>>> Initialisation of record 0x203(NumberRecord) left 4 bytes remaining still
>>> to be read.
>>> at
>>> org.apache.poi.hssf.record.RecordInputStream.hasNextRecord(RecordInputStream.java:188)
>>> at
>>> org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:235)
>>> at
>>> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:168)
>>> at
>>> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:129)
>>> at
>>> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:343)
>>> at
>>> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:172)
>>> at
>>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183)
>>> at
>>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
>>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>>> ... 46 more
>>>
>>> I'm using TIKA 1.24.1 but it also happens in the previous version.
>>> Thanks
>>>
>>>

Re: Error when parsing of Excel files

Posted by Slava G <sl...@gmail.com>.
Thanks Tim,
I see there's a bug opened since 2017, so I'll vote for it, but don't think
opening a new one will help.

On Wed, Jul 29, 2020 at 7:35 PM Tim Allison <ta...@apache.org> wrote:

> > as for files, in my case they are from customer and I don't want to
> share them.
>
>
> https://corpora.tika.apache.org/datasette/corpora-metadata?sql=select+file_path%2C+orig_stack_trace%0D%0Afrom+containers+c%0D%0Ajoin+profiles+p+on+p.container_id%3Dc.container_id%0D%0Ajoin+PARSE_EXCEPTIONS+e+on+p.id%3De.id%0D%0Awhere+orig_stack_trace+like+%27%250x203%25%27%0D%0Aorder+by+file_path+limit+101
>
> triggering file available here:
> https://corpora.tika.apache.org/base/docs/bug_trackers/poi/POI-47251-4.xls
>
> Victory for our regression corpus!
>
> On Wed, Jul 29, 2020 at 12:14 PM Slava G <sl...@gmail.com> wrote:
>
>> Thanks Tim.
>> Will do, as for files, in my case they are from customer and I don't want
>> to share them.
>> Thanks
>>
>> On Wed, Jul 29, 2020, 19:06 Tim Allison <ta...@apache.org> wrote:
>>
>>> Looks like I identified that one i
>>> <https://bz.apache.org/bugzilla/show_bug.cgi?id=60833>n our regression
>>> corpus here: https://bz.apache.org/bugzilla/show_bug.cgi?id=60833#c10
>>>
>>> Please open an issue on POI's bug tracker.  If you need an example file,
>>> we can dig one up.
>>>
>>> On Wed, Jul 29, 2020 at 10:10 AM Slava G <sl...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> I have some Excel files that opens fine in Excel or Numbers in Mac but
>>>> TIKA (inprocess and app) throws exception:
>>>>
>>>>
>>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>>>> from org.apache.tika.parser.microsoft.OfficeParser@408ae0bf
>>>> at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293)
>>>> at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>>>> at
>>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>>>> at
>>>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>>>> at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)
>>>> at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358)
>>>> at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309)
>>>> at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:267)
>>>> at
>>>> javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2022)
>>>> at
>>>> javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2348)
>>>> at
>>>> javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:402)
>>>> at
>>>> javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:259)
>>>> at javax.swing.AbstractButton.doClick(AbstractButton.java:376)
>>>> at
>>>> javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:842)
>>>> at com.apple.laf.AquaMenuItemUI.doClick(AquaMenuItemUI.java:157)
>>>> at
>>>> javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:886)
>>>> at java.awt.Component.processMouseEvent(Component.java:6539)
>>>> at javax.swing.JComponent.processMouseEvent(JComponent.java:3324)
>>>> at java.awt.Component.processEvent(Component.java:6304)
>>>> at java.awt.Container.processEvent(Container.java:2239)
>>>> at java.awt.Component.dispatchEventImpl(Component.java:4889)
>>>> at java.awt.Container.dispatchEventImpl(Container.java:2297)
>>>> at java.awt.Component.dispatchEvent(Component.java:4711)
>>>> at
>>>> java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4904)
>>>> at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4535)
>>>> at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4476)
>>>> at java.awt.Container.dispatchEventImpl(Container.java:2283)
>>>> at java.awt.Window.dispatchEventImpl(Window.java:2746)
>>>> at java.awt.Component.dispatchEvent(Component.java:4711)
>>>> at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:760)
>>>> at java.awt.EventQueue.access$500(EventQueue.java:97)
>>>> at java.awt.EventQueue$3.run(EventQueue.java:709)
>>>> at java.awt.EventQueue$3.run(EventQueue.java:703)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at
>>>> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
>>>> at
>>>> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:84)
>>>> at java.awt.EventQueue$4.run(EventQueue.java:733)
>>>> at java.awt.EventQueue$4.run(EventQueue.java:731)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at
>>>> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
>>>> at java.awt.EventQueue.dispatchEvent(EventQueue.java:730)
>>>> at
>>>> java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:205)
>>>> at
>>>> java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116)
>>>> at
>>>> java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105)
>>>> at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
>>>> at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:93)
>>>> at java.awt.EventDispatchThread.run(EventDispatchThread.java:82)
>>>> Caused by:
>>>> org.apache.poi.hssf.record.RecordInputStream$LeftoverDataException:
>>>> Initialisation of record 0x203(NumberRecord) left 4 bytes remaining still
>>>> to be read.
>>>> at
>>>> org.apache.poi.hssf.record.RecordInputStream.hasNextRecord(RecordInputStream.java:188)
>>>> at
>>>> org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:235)
>>>> at
>>>> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:168)
>>>> at
>>>> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:129)
>>>> at
>>>> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:343)
>>>> at
>>>> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:172)
>>>> at
>>>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183)
>>>> at
>>>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
>>>> at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>>>> ... 46 more
>>>>
>>>> I'm using TIKA 1.24.1 but it also happens in the previous version.
>>>> Thanks
>>>>
>>>>

Re: Error when parsing of Excel files

Posted by Slava G <sl...@gmail.com>.
Thanks Tim,
I see there's a bug opened since 2017, so I'll vote for it, but don't think
opening a new one will help.

On Wed, Jul 29, 2020 at 7:35 PM Tim Allison <ta...@apache.org> wrote:

> > as for files, in my case they are from customer and I don't want to
> share them.
>
>
> https://corpora.tika.apache.org/datasette/corpora-metadata?sql=select+file_path%2C+orig_stack_trace%0D%0Afrom+containers+c%0D%0Ajoin+profiles+p+on+p.container_id%3Dc.container_id%0D%0Ajoin+PARSE_EXCEPTIONS+e+on+p.id%3De.id%0D%0Awhere+orig_stack_trace+like+%27%250x203%25%27%0D%0Aorder+by+file_path+limit+101
>
> triggering file available here:
> https://corpora.tika.apache.org/base/docs/bug_trackers/poi/POI-47251-4.xls
>
> Victory for our regression corpus!
>
> On Wed, Jul 29, 2020 at 12:14 PM Slava G <sl...@gmail.com> wrote:
>
>> Thanks Tim.
>> Will do, as for files, in my case they are from customer and I don't want
>> to share them.
>> Thanks
>>
>> On Wed, Jul 29, 2020, 19:06 Tim Allison <ta...@apache.org> wrote:
>>
>>> Looks like I identified that one i
>>> <https://bz.apache.org/bugzilla/show_bug.cgi?id=60833>n our regression
>>> corpus here: https://bz.apache.org/bugzilla/show_bug.cgi?id=60833#c10
>>>
>>> Please open an issue on POI's bug tracker.  If you need an example file,
>>> we can dig one up.
>>>
>>> On Wed, Jul 29, 2020 at 10:10 AM Slava G <sl...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> I have some Excel files that opens fine in Excel or Numbers in Mac but
>>>> TIKA (inprocess and app) throws exception:
>>>>
>>>>
>>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>>>> from org.apache.tika.parser.microsoft.OfficeParser@408ae0bf
>>>> at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293)
>>>> at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>>>> at
>>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>>>> at
>>>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>>>> at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)
>>>> at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358)
>>>> at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309)
>>>> at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:267)
>>>> at
>>>> javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2022)
>>>> at
>>>> javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2348)
>>>> at
>>>> javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:402)
>>>> at
>>>> javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:259)
>>>> at javax.swing.AbstractButton.doClick(AbstractButton.java:376)
>>>> at
>>>> javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:842)
>>>> at com.apple.laf.AquaMenuItemUI.doClick(AquaMenuItemUI.java:157)
>>>> at
>>>> javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:886)
>>>> at java.awt.Component.processMouseEvent(Component.java:6539)
>>>> at javax.swing.JComponent.processMouseEvent(JComponent.java:3324)
>>>> at java.awt.Component.processEvent(Component.java:6304)
>>>> at java.awt.Container.processEvent(Container.java:2239)
>>>> at java.awt.Component.dispatchEventImpl(Component.java:4889)
>>>> at java.awt.Container.dispatchEventImpl(Container.java:2297)
>>>> at java.awt.Component.dispatchEvent(Component.java:4711)
>>>> at
>>>> java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4904)
>>>> at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4535)
>>>> at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4476)
>>>> at java.awt.Container.dispatchEventImpl(Container.java:2283)
>>>> at java.awt.Window.dispatchEventImpl(Window.java:2746)
>>>> at java.awt.Component.dispatchEvent(Component.java:4711)
>>>> at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:760)
>>>> at java.awt.EventQueue.access$500(EventQueue.java:97)
>>>> at java.awt.EventQueue$3.run(EventQueue.java:709)
>>>> at java.awt.EventQueue$3.run(EventQueue.java:703)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at
>>>> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
>>>> at
>>>> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:84)
>>>> at java.awt.EventQueue$4.run(EventQueue.java:733)
>>>> at java.awt.EventQueue$4.run(EventQueue.java:731)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at
>>>> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
>>>> at java.awt.EventQueue.dispatchEvent(EventQueue.java:730)
>>>> at
>>>> java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:205)
>>>> at
>>>> java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116)
>>>> at
>>>> java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105)
>>>> at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
>>>> at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:93)
>>>> at java.awt.EventDispatchThread.run(EventDispatchThread.java:82)
>>>> Caused by:
>>>> org.apache.poi.hssf.record.RecordInputStream$LeftoverDataException:
>>>> Initialisation of record 0x203(NumberRecord) left 4 bytes remaining still
>>>> to be read.
>>>> at
>>>> org.apache.poi.hssf.record.RecordInputStream.hasNextRecord(RecordInputStream.java:188)
>>>> at
>>>> org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:235)
>>>> at
>>>> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:168)
>>>> at
>>>> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:129)
>>>> at
>>>> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:343)
>>>> at
>>>> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:172)
>>>> at
>>>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183)
>>>> at
>>>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
>>>> at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>>>> ... 46 more
>>>>
>>>> I'm using TIKA 1.24.1 but it also happens in the previous version.
>>>> Thanks
>>>>
>>>>