You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Van Tassell, Kristian" <kr...@siemens.com> on 2012/02/15 14:22:54 UTC

Error on parsing Visio file

Hi everyone,

We're using a Solr environment and will soon be utilizing Lucene as well. The bulk of our data is xml and images, but we do have a small percentage of data in other formats such as VSD. Our test suite contains roughly 100 filetypes (xml, pdf, word, vsd, etc). Thus far, we've successfully indexed 200 VSD files but I came across one that just appears to hang when OfficeParser.parse is called (An exception is set to be caught and logged, but I don't seem to be getting one...still checking into that).

The file opens fine is Visio. I've tried both tika 0.9 and 1.0.

Is this the proper method to parse a .vsd file? Or do you have other suggestions?

TikaConfig tc = TikaConfig.getDefaultConfig();
ParseContext context = new ParseContext();
Metadata metadata = new Metadata();
ContentHandler handler = new WriteOutContentHandler(10*1024*1024);

InputStream fis = new URL(url.toString()).openStream();

OfficeParser officeParser = new OfficeParser();
officeParser.parse(fis, handler, metadata, context); // hangs here


Thanks for any information you can provide!
-Kristian

RE: Error on parsing Visio file

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 15 Feb 2012, Van Tassell, Kristian wrote:
> Thanks for the idea to try it from tika-app, having come from Solr, I 
> was unaware of this (great) tool. I opened a vsd file I knew was parsing 
> fine and it loaded fine, as expected. I then opened the problem file and 
> finally saw the stack trace occurring. As you suspected, it looks 
> perhaps more like POI is the offender.

Yup, looks like a POI bug. It seems that for some reason, it's expecting 
to find a string beyond the length of the chunk. Your best bet is to open 
a bug in the POI bugzilla, and upload the file

That said, I'm not sure why you're not getting the failure + exception 
when you call Tika from your code, that might be a 2nd issue

Nick

RE: Error on parsing Visio file

Posted by "Van Tassell, Kristian" <kr...@siemens.com>.
Nick,

Thanks for the idea to try it from tika-app, having come from Solr, I was unaware of this (great) tool. I opened a vsd file I knew was parsing fine and it loaded fine, as expected. I then opened the problem file and finally saw the stack trace occurring. As you suspected, it looks perhaps more like POI is the offender. 

Apache Tika was unable to parse the document
at ...\myfile.vsd.

The full exception stack trace is included below:

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@5b202f4d
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
	at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
	at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238)
	at javax.swing.AbstractButton.fireActionPerformed(Unknown Source)
	at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source)
	at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source)
	at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
	at javax.swing.AbstractButton.doClick(Unknown Source)
	at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source)
	at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown Source)
	at java.awt.Component.processMouseEvent(Unknown Source)
	at javax.swing.JComponent.processMouseEvent(Unknown Source)
	at java.awt.Component.processEvent(Unknown Source)
	at java.awt.Container.processEvent(Unknown Source)
	at java.awt.Component.dispatchEventImpl(Unknown Source)
	at java.awt.Container.dispatchEventImpl(Unknown Source)
	at java.awt.Component.dispatchEvent(Unknown Source)
	at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
	at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
	at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
	at java.awt.Container.dispatchEventImpl(Unknown Source)
	at java.awt.Window.dispatchEventImpl(Unknown Source)
	at java.awt.Component.dispatchEvent(Unknown Source)
	at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
	at java.awt.EventQueue.access$000(Unknown Source)
	at java.awt.EventQueue$1.run(Unknown Source)
	at java.awt.EventQueue$1.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.security.AccessControlContext$1.doIntersectionPrivilege(Unknown Source)
	at java.security.AccessControlContext$1.doIntersectionPrivilege(Unknown Source)
	at java.awt.EventQueue$2.run(Unknown Source)
	at java.awt.EventQueue$2.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.security.AccessControlContext$1.doIntersectionPrivilege(Unknown Source)
	at java.awt.EventQueue.dispatchEvent(Unknown Source)
	at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
	at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
	at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
	at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
	at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
	at java.awt.EventDispatchThread.run(Unknown Source)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Illegal offset 8 (String data is of length 8)
	at org.apache.poi.util.StringUtil.getFromUnicodeLE(StringUtil.java:70)
	at org.apache.poi.hdgf.chunks.Chunk.processCommands(Chunk.java:203)
	at org.apache.poi.hdgf.chunks.ChunkFactory.createChunk(ChunkFactory.java:180)
	at org.apache.poi.hdgf.streams.ChunkStream.findChunks(ChunkStream.java:59)
	at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:93)
	at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:100)
	at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:100)
	at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:106)
	at org.apache.poi.hdgf.extractor.VisioTextExtractor.<init>(VisioTextExtractor.java:55)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:214)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:177)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	... 43 more

RE: Error on parsing Visio file

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 15 Feb 2012, Van Tassell, Kristian wrote:
> The file is actually local to the server, so I've been able to verify it 
> exists, test opening it with Visio, etc.

Does it fail with tika-app also?

> I am able to share the file. Would I just send it directly to your email 
> (or whoever)?

Best bet is to open a new bug in JIRA, and attach it there. Just so you're 
aware, it might turn out to be a bug in POI rather than Tika, so you may 
need to report it there (depends on exactly where the problem turns out to 
be)

Nick

RE: Error on parsing Visio file

Posted by "Van Tassell, Kristian" <kr...@siemens.com>.
Nick,

The file is actually local to the server, so I've been able to verify it exists, test opening it with Visio, etc. Prior to this code, I'm getting some meta info from the file, such as filesize readings, so I'm assuming it is ok in that regard. I could certainly try another method though if you think there may be something I'm not thinking of.

I am able to share the file. Would I just send it directly to your email (or whoever)?

Thanks for your help!
-Kristian

-----Original Message-----
From: Nick Burch [mailto:nick.burch@alfresco.com] 
Sent: Wednesday, February 15, 2012 7:27 AM
To: user@tika.apache.org
Subject: Re: Error on parsing Visio file

On Wed, 15 Feb 2012, Van Tassell, Kristian wrote:
> Thus far, we've successfully indexed 200 VSD files but I came across one 
> that just appears to hang when OfficeParser.parse is called (An 
> exception is set to be caught and logged, but I don't seem to be getting 
> one...still checking into that).

Are you able to share the problematic file?

> InputStream fis = new URL(url.toString()).openStream();

Did you try streaming that file to disk, and trying to parse it from 
there? i.e. ensuring it couldn't be a problem with fetching from the 
server?

Nick

Re: Error on parsing Visio file

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 15 Feb 2012, Van Tassell, Kristian wrote:
> Thus far, we've successfully indexed 200 VSD files but I came across one 
> that just appears to hang when OfficeParser.parse is called (An 
> exception is set to be caught and logged, but I don't seem to be getting 
> one...still checking into that).

Are you able to share the problematic file?

> InputStream fis = new URL(url.toString()).openStream();

Did you try streaming that file to disk, and trying to parse it from 
there? i.e. ensuring it couldn't be a problem with fetching from the 
server?

Nick