You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by Nick Burch <ni...@torchbox.com> on 2007/12/30 19:43:06 UTC

Initial OOXML support

If you've been watching the commit messages over the last few days, you'll
have seen that I've made a quick stab at some ooxml support.

The code I've committed is powered by two other projects:
* xml beans - http://xmlbeans.apache.org/
* openxml4j - http://www.openxml4j.org/

OpenXML4J provides a nice library to get at the underlying zip file
format, grab the relationships between different bits of the file etc. In
many ways, it's the ooxml equivalent of poifs.

Then, I'm using xmlbeans + the microsoft supplied xsds to build up the
low level objects to work with the different streams. These objects are
much like our record and record factory stuff.


On top of all this, I've written some classes to handle getting at the
interesting low level parts of the files (HSSFXML, HWPFXML and HSLFXML).
I've stubbed out some usermodel equivalents, but not done anything else
with them. Finally, I've written some text extractors, which use the low
level beans to get the text out into a format you can stuff into lucene.


Couple of snags to be aware of:
* openxml4j is java 1.5, so we're going to want to keep this all separate
   whatever happens, so people who don't want ooxml can continue to use
   poi with java 1.3 / 1.4
* openxml4j haven't done even an alpha release yet, so we're working of a
   jar I tested and built, hosted off people.apache.org/~nick/
* the ooxml xsds haven't been confirmed to be under a ASL compatible
   licence, so ant just downloads everyone their own copy of them, and
   they're not in svn
* everything ooxml related has its own ant tasks - compile-ooxml,
   test-ooxml and jar-ooxml. The existing ant tasks (eg compile, jar,
   dist) will all ignore the ooxml stuff, so you'll have to take positive
   steps to get it
* to confirm, if you do a dist or a jar, you won't get it, so it won't
   interfere with the 3.0.2 release process
* there's no formal documentation for it, just unit tests and javadocs.
   I'm holding off writing any until other people have sanity checked the
   api structure :)
* you're going to need to read the emca specs if you want to make much use
   of it as it stands, unless you know the ole2 equivalents really well
   and can spot how they've stuffed it all into xml...


Next up is probably write support. This may require some tweaking and
thought, as there are three objects relating to each stream:
* the PackagePart (xml file in the zip)
* the Document (bean for the root of the xml file) eg WorkbookDocument
* the main bean, eg CTWorkbook
As someone using the API, you'll want the CTWorkbook, as that's the thing
with the actual data on it. However, to save the changes, you need to get
the document bean the ct bean came from, and trigger the write from there,
and stuff the resulting bytes back into the PackagePart. So, we'll need to
track all these bits internally, so we can give the user the bean they
want, but still have everything available to write it out.

(One option might be to nobble xmlbeans so that we can attach the
PackagePart onto the Document, and get back at the document from the main
bean, but that might prove to be far too much work, so we'll have to see)

If anyone has any good ideas for how to do the writing stuff, do pipe up.
It looks like it's going to be a little while before I get a copy of
office 2007, so there's no point me trying to knock up write support
before I have something to test opening with, which gives us a gap to
figure it out in :)


Oh, and I've put all the code in src/scratchpad/ooxml-src/ and
src/scratchpad/ooxml-testcases/, to indicate it's of scratchpad completion
levels, but different directories as it needs java 1.5. Once it's a bit
more stable, we'll probably want to move it to its own top level area
under src

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: Initial OOXML support

Posted by "Andrew C. Oliver" <ac...@buni.org>.
I think it is early to do that IMO.

Avik Sengupta wrote:
> Cool!
> 
> How about creating this as a separate subproject in svn? 
> (poi.apache.org/ooxml) 
> 
> Just a thought. 
> 
> Regards
> -
> Avik
> 
> On Sunday 30 December 2007 10:43:06 Nick Burch wrote:
>> If you've been watching the commit messages over the last few days, you'll
>> have seen that I've made a quick stab at some ooxml support.
>>
>> The code I've committed is powered by two other projects:
>> * xml beans - http://xmlbeans.apache.org/
>> * openxml4j - http://www.openxml4j.org/
>>
>> OpenXML4J provides a nice library to get at the underlying zip file
>> format, grab the relationships between different bits of the file etc. In
>> many ways, it's the ooxml equivalent of poifs.
>>
>> Then, I'm using xmlbeans + the microsoft supplied xsds to build up the
>> low level objects to work with the different streams. These objects are
>> much like our record and record factory stuff.
>>
>>
>> On top of all this, I've written some classes to handle getting at the
>> interesting low level parts of the files (HSSFXML, HWPFXML and HSLFXML).
>> I've stubbed out some usermodel equivalents, but not done anything else
>> with them. Finally, I've written some text extractors, which use the low
>> level beans to get the text out into a format you can stuff into lucene.
>>
>>
>> Couple of snags to be aware of:
>> * openxml4j is java 1.5, so we're going to want to keep this all separate
>>    whatever happens, so people who don't want ooxml can continue to use
>>    poi with java 1.3 / 1.4
>> * openxml4j haven't done even an alpha release yet, so we're working of a
>>    jar I tested and built, hosted off people.apache.org/~nick/
>> * the ooxml xsds haven't been confirmed to be under a ASL compatible
>>    licence, so ant just downloads everyone their own copy of them, and
>>    they're not in svn
>> * everything ooxml related has its own ant tasks - compile-ooxml,
>>    test-ooxml and jar-ooxml. The existing ant tasks (eg compile, jar,
>>    dist) will all ignore the ooxml stuff, so you'll have to take positive
>>    steps to get it
>> * to confirm, if you do a dist or a jar, you won't get it, so it won't
>>    interfere with the 3.0.2 release process
>> * there's no formal documentation for it, just unit tests and javadocs.
>>    I'm holding off writing any until other people have sanity checked the
>>    api structure :)
>> * you're going to need to read the emca specs if you want to make much use
>>    of it as it stands, unless you know the ole2 equivalents really well
>>    and can spot how they've stuffed it all into xml...
>>
>>
>> Next up is probably write support. This may require some tweaking and
>> thought, as there are three objects relating to each stream:
>> * the PackagePart (xml file in the zip)
>> * the Document (bean for the root of the xml file) eg WorkbookDocument
>> * the main bean, eg CTWorkbook
>> As someone using the API, you'll want the CTWorkbook, as that's the thing
>> with the actual data on it. However, to save the changes, you need to get
>> the document bean the ct bean came from, and trigger the write from there,
>> and stuff the resulting bytes back into the PackagePart. So, we'll need to
>> track all these bits internally, so we can give the user the bean they
>> want, but still have everything available to write it out.
>>
>> (One option might be to nobble xmlbeans so that we can attach the
>> PackagePart onto the Document, and get back at the document from the main
>> bean, but that might prove to be far too much work, so we'll have to see)
>>
>> If anyone has any good ideas for how to do the writing stuff, do pipe up.
>> It looks like it's going to be a little while before I get a copy of
>> office 2007, so there's no point me trying to knock up write support
>> before I have something to test opening with, which gives us a gap to
>> figure it out in :)
>>
>>
>> Oh, and I've put all the code in src/scratchpad/ooxml-src/ and
>> src/scratchpad/ooxml-testcases/, to indicate it's of scratchpad completion
>> levels, but different directories as it needs java 1.5. Once it's a bit
>> more stable, we'll probably want to move it to its own top level area
>> under src
>>
>> Nick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>> For additional commands, e-mail: dev-help@poi.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org


-- 
Buni Meldware Communication Suite
http://buni.org
Multi-platform and extensible Email,
Calendaring (including freebusy),
Rich Webmail, Web-calendaring, ease
of installation/administration.

Re: Initial OOXML support

Posted by Nick Burch <ni...@torchbox.com>.
On Mon, 31 Dec 2007, Andrew C. Oliver wrote:
> BTW is it horrible?  Last I looked it wasn't bad but that was some time
> ago I admit.

As xml formats go, yes. Compared to what they might've come up with,
starting from the ole2 stuff, no. So, I was pleasantly surprised by it, my
friends who "do" xml aren't fans. Just depends where you come from :)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: Initial OOXML support

Posted by "Andrew C. Oliver" <ac...@buni.org>.
BTW is it horrible?  Last I looked it wasn't bad but that was some time 
ago I admit.

Nick Burch wrote:
> On Mon, 31 Dec 2007, Avik Sengupta wrote:
>> How about creating this as a separate subproject in svn?
>> (poi.apache.org/ooxml)
> 
> I've gone for the name HXF (horrible xml format) for a few little bits
> 
> At this stage, I'm not sure how much of the usermodel code from the other
> projects we're going to be able to re-use. I guess if we manage to re-use
> lots of it, we probably don't need to put it somewhere else. However, if
> we only end up defining a bunch of interfaces from the current code, and
> implementing those for ooxml, then a subproject might make sense.
> 
> We can decide all that once we know how it'll all work :)
> 
> Nick
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org


-- 
Buni Meldware Communication Suite
http://buni.org
Multi-platform and extensible Email,
Calendaring (including freebusy),
Rich Webmail, Web-calendaring, ease
of installation/administration.

Re: Initial OOXML support

Posted by Nick Burch <ni...@torchbox.com>.
On Mon, 31 Dec 2007, Avik Sengupta wrote:
> How about creating this as a separate subproject in svn?
> (poi.apache.org/ooxml)

I've gone for the name HXF (horrible xml format) for a few little bits

At this stage, I'm not sure how much of the usermodel code from the other
projects we're going to be able to re-use. I guess if we manage to re-use
lots of it, we probably don't need to put it somewhere else. However, if
we only end up defining a bunch of interfaces from the current code, and
implementing those for ooxml, then a subproject might make sense.

We can decide all that once we know how it'll all work :)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: Initial OOXML support

Posted by Avik Sengupta <av...@itellix.com>.
Cool!

How about creating this as a separate subproject in svn? 
(poi.apache.org/ooxml) 

Just a thought. 

Regards
-
Avik

On Sunday 30 December 2007 10:43:06 Nick Burch wrote:
> If you've been watching the commit messages over the last few days, you'll
> have seen that I've made a quick stab at some ooxml support.
>
> The code I've committed is powered by two other projects:
> * xml beans - http://xmlbeans.apache.org/
> * openxml4j - http://www.openxml4j.org/
>
> OpenXML4J provides a nice library to get at the underlying zip file
> format, grab the relationships between different bits of the file etc. In
> many ways, it's the ooxml equivalent of poifs.
>
> Then, I'm using xmlbeans + the microsoft supplied xsds to build up the
> low level objects to work with the different streams. These objects are
> much like our record and record factory stuff.
>
>
> On top of all this, I've written some classes to handle getting at the
> interesting low level parts of the files (HSSFXML, HWPFXML and HSLFXML).
> I've stubbed out some usermodel equivalents, but not done anything else
> with them. Finally, I've written some text extractors, which use the low
> level beans to get the text out into a format you can stuff into lucene.
>
>
> Couple of snags to be aware of:
> * openxml4j is java 1.5, so we're going to want to keep this all separate
>    whatever happens, so people who don't want ooxml can continue to use
>    poi with java 1.3 / 1.4
> * openxml4j haven't done even an alpha release yet, so we're working of a
>    jar I tested and built, hosted off people.apache.org/~nick/
> * the ooxml xsds haven't been confirmed to be under a ASL compatible
>    licence, so ant just downloads everyone their own copy of them, and
>    they're not in svn
> * everything ooxml related has its own ant tasks - compile-ooxml,
>    test-ooxml and jar-ooxml. The existing ant tasks (eg compile, jar,
>    dist) will all ignore the ooxml stuff, so you'll have to take positive
>    steps to get it
> * to confirm, if you do a dist or a jar, you won't get it, so it won't
>    interfere with the 3.0.2 release process
> * there's no formal documentation for it, just unit tests and javadocs.
>    I'm holding off writing any until other people have sanity checked the
>    api structure :)
> * you're going to need to read the emca specs if you want to make much use
>    of it as it stands, unless you know the ole2 equivalents really well
>    and can spot how they've stuffed it all into xml...
>
>
> Next up is probably write support. This may require some tweaking and
> thought, as there are three objects relating to each stream:
> * the PackagePart (xml file in the zip)
> * the Document (bean for the root of the xml file) eg WorkbookDocument
> * the main bean, eg CTWorkbook
> As someone using the API, you'll want the CTWorkbook, as that's the thing
> with the actual data on it. However, to save the changes, you need to get
> the document bean the ct bean came from, and trigger the write from there,
> and stuff the resulting bytes back into the PackagePart. So, we'll need to
> track all these bits internally, so we can give the user the bean they
> want, but still have everything available to write it out.
>
> (One option might be to nobble xmlbeans so that we can attach the
> PackagePart onto the Document, and get back at the document from the main
> bean, but that might prove to be far too much work, so we'll have to see)
>
> If anyone has any good ideas for how to do the writing stuff, do pipe up.
> It looks like it's going to be a little while before I get a copy of
> office 2007, so there's no point me trying to knock up write support
> before I have something to test opening with, which gives us a gap to
> figure it out in :)
>
>
> Oh, and I've put all the code in src/scratchpad/ooxml-src/ and
> src/scratchpad/ooxml-testcases/, to indicate it's of scratchpad completion
> levels, but different directories as it needs java 1.5. Once it's a bit
> more stable, we'll probably want to move it to its own top level area
> under src
>
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: Initial OOXML support

Posted by "Andrew C. Oliver" <ac...@buni.org>.
This is tremendous.  Great work nick.

Nick Burch wrote:
> If you've been watching the commit messages over the last few days, you'll
> have seen that I've made a quick stab at some ooxml support.
> 
> The code I've committed is powered by two other projects:
> * xml beans - http://xmlbeans.apache.org/
> * openxml4j - http://www.openxml4j.org/
> 
> OpenXML4J provides a nice library to get at the underlying zip file
> format, grab the relationships between different bits of the file etc. In
> many ways, it's the ooxml equivalent of poifs.
> 
> Then, I'm using xmlbeans + the microsoft supplied xsds to build up the
> low level objects to work with the different streams. These objects are
> much like our record and record factory stuff.
> 
> 
> On top of all this, I've written some classes to handle getting at the
> interesting low level parts of the files (HSSFXML, HWPFXML and HSLFXML).
> I've stubbed out some usermodel equivalents, but not done anything else
> with them. Finally, I've written some text extractors, which use the low
> level beans to get the text out into a format you can stuff into lucene.
> 
> 
> Couple of snags to be aware of:
> * openxml4j is java 1.5, so we're going to want to keep this all separate
>    whatever happens, so people who don't want ooxml can continue to use
>    poi with java 1.3 / 1.4
> * openxml4j haven't done even an alpha release yet, so we're working of a
>    jar I tested and built, hosted off people.apache.org/~nick/
> * the ooxml xsds haven't been confirmed to be under a ASL compatible
>    licence, so ant just downloads everyone their own copy of them, and
>    they're not in svn
> * everything ooxml related has its own ant tasks - compile-ooxml,
>    test-ooxml and jar-ooxml. The existing ant tasks (eg compile, jar,
>    dist) will all ignore the ooxml stuff, so you'll have to take positive
>    steps to get it
> * to confirm, if you do a dist or a jar, you won't get it, so it won't
>    interfere with the 3.0.2 release process
> * there's no formal documentation for it, just unit tests and javadocs.
>    I'm holding off writing any until other people have sanity checked the
>    api structure :)
> * you're going to need to read the emca specs if you want to make much use
>    of it as it stands, unless you know the ole2 equivalents really well
>    and can spot how they've stuffed it all into xml...
> 
> 
> Next up is probably write support. This may require some tweaking and
> thought, as there are three objects relating to each stream:
> * the PackagePart (xml file in the zip)
> * the Document (bean for the root of the xml file) eg WorkbookDocument
> * the main bean, eg CTWorkbook
> As someone using the API, you'll want the CTWorkbook, as that's the thing
> with the actual data on it. However, to save the changes, you need to get
> the document bean the ct bean came from, and trigger the write from there,
> and stuff the resulting bytes back into the PackagePart. So, we'll need to
> track all these bits internally, so we can give the user the bean they
> want, but still have everything available to write it out.
> 
> (One option might be to nobble xmlbeans so that we can attach the
> PackagePart onto the Document, and get back at the document from the main
> bean, but that might prove to be far too much work, so we'll have to see)
> 
> If anyone has any good ideas for how to do the writing stuff, do pipe up.
> It looks like it's going to be a little while before I get a copy of
> office 2007, so there's no point me trying to knock up write support
> before I have something to test opening with, which gives us a gap to
> figure it out in :)
> 
> 
> Oh, and I've put all the code in src/scratchpad/ooxml-src/ and
> src/scratchpad/ooxml-testcases/, to indicate it's of scratchpad completion
> levels, but different directories as it needs java 1.5. Once it's a bit
> more stable, we'll probably want to move it to its own top level area
> under src
> 
> Nick
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org


-- 
Buni Meldware Communication Suite
http://buni.org
Multi-platform and extensible Email,
Calendaring (including freebusy),
Rich Webmail, Web-calendaring, ease
of installation/administration.