You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by Nick Burch <ni...@torchbox.com> on 2007/06/19 13:23:40 UTC
HDGF - Horrible DiaGram Format for visio
Hi All
I want to be able to extract text from visio documents, so I can chuck
them into lucene.
So, I've made us of all the handy documentation from vsdump
(http://www.gnome.ru/projects/vsdump_en.html), and I've committed some
basic code for visio files to the scratchpad, as hdgf.
The code is able to parse the pointers and streams, which seem to be the
main building blocks of visio files. It also has a command line tool to
print out the streams+pointers, and what their parent-child relationships
are.
Annoyingly, I haven't figured out how to get strings out of a strings
stream, so I can't actually use it with lucene. Hopefully the vsdump guys
will get that cracked shortly, and I can add the functionality in.
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
Re: HDGF - Horrible DiaGram Format for visio
Posted by Nick Burch <ni...@torchbox.com>.
On Tue, 19 Jun 2007, Nick Burch wrote:
> Annoyingly, I haven't figured out how to get strings out of a strings
> stream, so I can't actually use it with lucene. Hopefully the vsdump
> guys will get that cracked shortly, and I can add the functionality in.
With yet more help from the guy behind vsdump, I've now got basic text
extraction working. It also now has some basic processing for the commands
stored in chunks, so quite a bit of what you might want to read from a
visio file is supported. dev.VSDDumper will show you what parts of the
file we can now process.
All code is in the scratchpad, ought to be fully unit tested, and there's
some documentation on the main site. I've also done a list of the steps I
think are needed to support writing visio files back out again, in case
anyone's interested in doing that.
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org