You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by Nick Burch <ni...@torchbox.com> on 2007/06/19 13:23:40 UTC

HDGF - Horrible DiaGram Format for visio

Hi All

I want to be able to extract text from visio documents, so I can chuck 
them into lucene.

So, I've made us of all the handy documentation from vsdump 
(http://www.gnome.ru/projects/vsdump_en.html), and I've committed some 
basic code for visio files to the scratchpad, as hdgf.

The code is able to parse the pointers and streams, which seem to be the 
main building blocks of visio files. It also has a command line tool to 
print out the streams+pointers, and what their parent-child relationships 
are.

Annoyingly, I haven't figured out how to get strings out of a strings 
stream, so I can't actually use it with lucene. Hopefully the vsdump guys 
will get that cracked shortly, and I can add the functionality in.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: HDGF - Horrible DiaGram Format for visio

Posted by Nick Burch <ni...@torchbox.com>.
On Tue, 19 Jun 2007, Nick Burch wrote:
> Annoyingly, I haven't figured out how to get strings out of a strings 
> stream, so I can't actually use it with lucene. Hopefully the vsdump 
> guys will get that cracked shortly, and I can add the functionality in.

With yet more help from the guy behind vsdump, I've now got basic text 
extraction working. It also now has some basic processing for the commands 
stored in chunks, so quite a bit of what you might want to read from a 
visio file is supported. dev.VSDDumper will show you what parts of the 
file we can now process.

All code is in the scratchpad, ought to be fully unit tested, and there's 
some documentation on the main site. I've also done a list of the steps I 
think are needed to support writing visio files back out again, in case 
anyone's interested in doing that.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org