You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by Yegor Kozlov <ye...@dinom.ru> on 2009/11/16 17:53:48 UTC

a 'lite' version of ooxml-schemas jar

Hi All,

As we discussed at Apachecon, one way to optimize the size of POI distributions is to create a 'lite' version of the 
ooxml-schemas jar.
The idea is simple: remove all unused classes and resources from the jar generated by XMLBeans. Rough estimations made 
at the Barcamp showed that POI uses less than 30% of the OOXML schemas, hence the optimized jar should be significantly 
smaller.

With this in mind I created a simple utility called OOXMLLite, see 
http://svn.apache.org/repos/asf/poi/trunk/src/ooxml/java/org/apache/poi/util/OOXMLLite.java

The process includes four simple steps:

  - run all ooxml unit tests
  - see what classes from the ooxml-schemas.jar are loaded in the JVM
  - copy the loaded classes into some directory.
  - copy the binary resources (.xsb)

  A good acceptance test is to run the ooxml unit tests against the 'lite' classes - all should pass. There is an 
accompanying Ant task ooxml-xsds-lite for that, see build.xml.

The resulting 'lite' jar is much smaller: ooxml-schemas-lite-3.6-beta1.jar is only 3.5 MB while the 'big' 
ooxml-schemas-1.0.jar is 14.5 MB. In theory, the size can be trimmed down below 3 MB  - my utility copies all .xsb files 
and does not yet track resource dependencies.

I propose to include ooxml-schemas-lite in the release cycle. The artifact name is ooxml-schemas-lite-${version.id}.jar.
Interested projects (first of all I mean Apache Tika) can setup their Maven poms to use 
<artifactId>poi-ooxml-lite</artifactId>  instead of <artifactId>poi-ooxml</artifactId>. This will reduce the 
distribution size by approximately 10 MB.

Yegor

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Fwd: a 'lite' version of ooxml-schemas jar

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

Some great news from POI, see below!

BR,

Jukka Zitting

---------- Forwarded message ----------
From: Jukka Zitting <ju...@gmail.com>
Date: Wed, Dec 2, 2009 at 7:28 PM
Subject: Re: a 'lite' version of ooxml-schemas jar
To: POI Developers List <de...@poi.apache.org>


Hi,

On Tue, Nov 24, 2009 at 11:02 AM, Yegor Kozlov <ye...@dinom.ru> wrote:
> For Maven this change is transparent - POM for the poi-ooxml module depends
> on poi-ooxml-schemas instead of ooxml-schemas, this means Maven users will
> only need to update the version of POI from 3.5-FINAL to 3.6, the rest will
> be handled by Maven automatically.

I just had a chance to test this with Tika, and it works beautifully.
After upgrading to a POI 3.6-beta1-20091202 snapshot the size of the
tika-app jar dropped from 25MB to 15MB. That's a major improvement,
thanks! I can't wait for the next POI release.

The only odd thing about the upgrade was that I needed to comment out
a piece of Tika extraction code that uses the
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTBookmark
class as returned from XWPFParagraph.getCTP().getBookmarkStartArray().
It looks like that class is not included in the poi-ooxml-schemas jar
even though the CTP class with the getBookmarkStartArray() method is
there.

BR,

Jukka Zitting

Re: a 'lite' version of ooxml-schemas jar

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Dec 3, 2009 at 5:58 PM, Yegor Kozlov <ye...@dinom.ru> wrote:
> the problem should be fixed in r886733.
> At least, Tika trunk compiles OK against poi-ooxml-schemas produced from POI
> trunk. JUnits run OK too.

Excellent, thanks!

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: a 'lite' version of ooxml-schemas jar

Posted by Yegor Kozlov <ye...@dinom.ru>.
the problem should be fixed in r886733.
At least, Tika trunk compiles OK against poi-ooxml-schemas produced from POI trunk. JUnits run OK too.

Regards,
Yegor

> Hi,
> 
> On Wed, Dec 2, 2009 at 7:58 PM, Yegor Kozlov <ye...@dinom.ru> wrote:
>> Can you point me at the place in Tika where getBookmarkStartArray() is used?
> 
> See line 78 of o.a.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator [1].
> 
> [1] http://svn.apache.org/viewvc/lucene/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java?revision=820962&view=markup
> 
> BR,
> 
> Jukka Zitting
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: a 'lite' version of ooxml-schemas jar

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Dec 2, 2009 at 7:58 PM, Yegor Kozlov <ye...@dinom.ru> wrote:
> Can you point me at the place in Tika where getBookmarkStartArray() is used?

See line 78 of o.a.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator [1].

[1] http://svn.apache.org/viewvc/lucene/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java?revision=820962&view=markup

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: a 'lite' version of ooxml-schemas jar

Posted by Yegor Kozlov <ye...@dinom.ru>.
it for the next POI release.
> 
> The only odd thing about the upgrade was that I needed to comment out
> a piece of Tika extraction code that uses the
> org.openxmlformats.schemas.wordprocessingml.x2006.main.CTBookmark
> class as returned from XWPFParagraph.getCTP().getBookmarkStartArray().
> It looks like that class is not included in the poi-ooxml-schemas jar
> even though the CTP class with the getBookmarkStartArray() method is
> there.
> 

Can you point me at the place in Tika where getBookmarkStartArray() is used?
ooxml-lite only includes classes called during execution of junits. getBookmarkStartArray is not covered by the tests 
and it explains why the CTBookmark class is missing.

Yegor

> BR,
> 
> Jukka Zitting
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: a 'lite' version of ooxml-schemas jar

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Tue, Nov 24, 2009 at 11:02 AM, Yegor Kozlov <ye...@dinom.ru> wrote:
> For Maven this change is transparent - POM for the poi-ooxml module depends
> on poi-ooxml-schemas instead of ooxml-schemas, this means Maven users will
> only need to update the version of POI from 3.5-FINAL to 3.6, the rest will
> be handled by Maven automatically.

I just had a chance to test this with Tika, and it works beautifully.
After upgrading to a POI 3.6-beta1-20091202 snapshot the size of the
tika-app jar dropped from 25MB to 15MB. That's a major improvement,
thanks! I can't wait for the next POI release.

The only odd thing about the upgrade was that I needed to comment out
a piece of Tika extraction code that uses the
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTBookmark
class as returned from XWPFParagraph.getCTP().getBookmarkStartArray().
It looks like that class is not included in the poi-ooxml-schemas jar
even though the CTP class with the getBookmarkStartArray() method is
there.

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: a 'lite' version of ooxml-schemas jar

Posted by Yegor Kozlov <ye...@dinom.ru>.
Finally it settled in my head. I finished improving build.xml - all seems to be working OK.
I named the new artifact poi-ooxml-schemas, the prefix poi- clearly indicates that it is a POI derivative from 
ooxml-schemas.jar.

I also think it is a reasonable idea to include poi-ooxml-schemas in the dist target and be the default provider of 
ooxml xml beans. This means that the "big" ooxml-schemas-1.0.jar is used only for development. Normal POI releases as 
well as Maven POMs will use the "lite" jar.

Below is current structure of a POI release bundle:

#lib is unchanged
lib/commons-logging-1.1.jar
lib/junit-3.8.1.jar
lib/log4j-1.2.13.jar

#ooxml-schemas-1.0.jar is excluded
ooxml-lib/dom4j-1.6.1.jar
ooxml-lib/geronimo-stax-api_1.0_spec-1.0.jar
ooxml-lib/xmlbeans-2.3.0.jar

poi-3.6-beta1-20091124.jar
poi-scratchpad-3.6-beta1-20091124.jar
poi-contrib-3.6-beta1-20091124.jar
poi-ooxml-3.6-beta1-20091124.jar
poi-ooxml-schemas-3.6-beta1-20091124.jar  #new artifact, replaces ooxml-schemas-1.0.jar
poi-examples-3.6-beta1-20091124.jar       #new artifact, was requested in Bugzilla

For Maven this change is transparent - POM for the poi-ooxml module depends on poi-ooxml-schemas instead of 
ooxml-schemas, this means Maven users will only need to update the version of POI from 3.5-FINAL to 3.6, the rest will 
be handled by Maven automatically.

Yegor

> Hi Yegor,
> 
> +1
> 
> This will have affects on the website re-write.
> 
> (1) The "How to Build" page has a list of common targets. Here is what I 
> have currently:
> 
> clean -- Erase all build work products (ie. everything in the build 
> directory
> compile    -- Compiles all files from main, contrib and scratchpad
> test -- Run all unit tests from main, contrib and scratchpad (JUnit)
> jar -- Produce jar files
> docs -- Generate all documentation for the system (Apache Forrest)
> dist -- Create a distribution (JUnit and Apache Forrest)
> 
> This should always be part of the dist target. Should we add a target 
> for building a "lite" ooxml, or is this always be part of jar and test?
> 
> I think we should have a "lite" target separate from jar and test.
> 
> (2) I am reworking the home page. There is a table of components that 
> appear there.
> 
> Document -- Component -- JAR -- Maven artifactId
> OLE2 Filesystem -- POIFS -- poi-version-yyyymmdd.jar -- poi
> OLE2 Property Sets -- HPSF -- poi-version-yyyymmdd.jar -- poi
> Excel XLS -- HSSF -- poi-version-yyyymmdd.jar -- poi
> Excel XLSX -- XSSF -- poi-ooxml-version-yyyymmdd.jar -- poi-ooxml
> PowerPoint PPT -- HSLF -- poi-scratchpad-version-yyyymmdd.jar -- 
> poi-scratchpad
> PowerPoint PPTX -- XSLF -- poi-ooxml-version-yyyymmdd.jar -- poi-ooxml
> Word DOC -- HWPF -- poi-scratchpad-version-yyyymmdd.jar -- poi-scratchpad
> Word DOCX -- XWPF -- poi-ooxml-version-yyyymmdd.jar -- poi-ooxml
> Visio VSD -- HDGF -- poi-scratchpad-version-yyyymmdd.jar -- poi-scratchpad
> Publisher PUB -- HPBF -- poi-scratchpad-version-yyyymmdd.jar -- 
> poi-scratchpad
> Outlook MSG -- HSMF -- poi-scratchpad-version-yyyymmdd.jar -- 
> poi-scratchpad
> 
> I am missing the OOXML schemas in my list. With this new lite version I 
> need two rows.
> 
> OOXML Schemas -- OpenXML4J -- ooxml-schemas-yyyymmdd.jar -- poi-ooxml
> OOXML Lite -- OpenXML4J -- ooxml-schemas-lite-yyyymmdd.jar -- 
> poi-ooxml-lite
> 
> We will need to include poi-ooxml-version-yyyymmdd.jar in the 
> poi-ooxml-lite target as well. I'll mark the XLSX, XWPF, and XSLF rows 
> appropriately.
> 
> Correct?
> 
> (3) I 'll rewrite your description as a new page within the currently 
> very sparse. OOXML documentation.
> 
> BTW - the www.openxml4j.org domain has gone away and I am going to need 
> help from you in deciding additional documentation and OPC examples that 
> we should include for the OOXML sub-project.
> 
> Regards,
> Dave
> 
> On Nov 16, 2009, at 8:53 AM, Yegor Kozlov wrote:
> 
>> Hi All,
>>
>> As we discussed at Apachecon, one way to optimize the size of POI 
>> distributions is to create a 'lite' version of the ooxml-schemas jar.
>> The idea is simple: remove all unused classes and resources from the 
>> jar generated by XMLBeans. Rough estimations made at the Barcamp 
>> showed that POI uses less than 30% of the OOXML schemas, hence the 
>> optimized jar should be significantly smaller.
>>
>> With this in mind I created a simple utility called OOXMLLite, see 
>> http://svn.apache.org/repos/asf/poi/trunk/src/ooxml/java/org/apache/poi/util/OOXMLLite.java 
>>
>>
>> The process includes four simple steps:
>>
>> - run all ooxml unit tests
>> - see what classes from the ooxml-schemas.jar are loaded in the JVM
>> - copy the loaded classes into some directory.
>> - copy the binary resources (.xsb)
>>
>> A good acceptance test is to run the ooxml unit tests against the 
>> 'lite' classes - all should pass. There is an accompanying Ant task 
>> ooxml-xsds-lite for that, see build.xml.
>>
>> The resulting 'lite' jar is much smaller: 
>> ooxml-schemas-lite-3.6-beta1.jar is only 3.5 MB while the 'big' 
>> ooxml-schemas-1.0.jar is 14.5 MB. In theory, the size can be trimmed 
>> down below 3 MB  - my utility copies all .xsb files and does not yet 
>> track resource dependencies.
>>
>> I propose to include ooxml-schemas-lite in the release cycle. The 
>> artifact name is ooxml-schemas-lite-${version.id}.jar.
>> Interested projects (first of all I mean Apache Tika) can setup their 
>> Maven poms to use <artifactId>poi-ooxml-lite</artifactId>  instead of 
>> <artifactId>poi-ooxml</artifactId>. This will reduce the distribution 
>> size by approximately 10 MB.
>>
>> Yegor
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>> For additional commands, e-mail: dev-help@poi.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: a 'lite' version of ooxml-schemas jar

Posted by David Fisher <df...@jmlafferty.com>.
Hi Yegor,

>> I've been digging deeper into the dependencies in maven. I think  
>> that "lite" should become the usual way to build.
>> (1) maven/poi-ooxml.pom is missing two dependencies:
>> xmlbeans-2.3.0.jar
>>  <property name="ooxml.xmlbeans.jar" location="${ooxml.lib}/ 
>> xmlbeans-2.3.0.jar"/>
>>  <property name="ooxml.xmlbeans.url" value="${repository.m2}/maven2/ 
>> org/apache/xmlbeans/xmlbeans/2.3.0/xmlbeans-2.3.0.jar"/>
>
> it is a chained dependency, xmlbeans-2.3.0.jar is included in ooxml- 
> schemas.pom.
>
>> geronimo-stax-api_1.0_spec-1.0.jar
>>  <property name="ooxml.jsr173.jar" location="${ooxml.lib}/geronimo- 
>> stax-api_1.0_spec-1.0.jar"/>
>>  <property name="ooxml.jsr173.url" value="${repository.m2}/maven2/ 
>> org/apache/geronimo/specs/geronimo-stax-api_1.0_spec/1.0/geronimo- 
>> stax-api_1.0_spec-1.0.jar"/>
> > Is there a reason we let these out of the pom?
> we need geronimo-stax only to build ooxml-schemas.jar, it is not  
> needed at runtime. Maven users don't need it either.

What build targets require jsr 173? Those will download geronimo-stax  
and the schemas to build ooxml-schemas.

>>>> I propose to include ooxml-schemas-lite in the release cycle. The  
>>>> artifact name is ooxml-schemas-lite-${version.id}.jar.
>>>> Interested projects (first of all I mean Apache Tika) can setup  
>>>> their Maven poms to use <artifactId>poi-ooxml-lite</artifactId>   
>>>> instead of <artifactId>poi-ooxml</artifactId>. This will reduce  
>>>> the distribution size by approximately 10 MB.
>> (2) You propose a new artifact-id of ooxml-schemas-lite. I think a  
>> name like ooxml-poi, poi-ooxml-schemas, or poi-opc would be better.
>
>
>> There are a few points to make here:
>> - ooxml-schemas has a different versioning - it is version 1.0. It  
>> should not change much. We should have a documented build target  
>> for this.
>> - ooxml lite - should follow the poi versioning schema since newer  
>> versions of POI will cover more of the schema. So, it is not really  
>> quite a sub of ooxml-schema as much as it is a cross reference  
>> between ooxml-schema and poi-ooxml.
>> Which version should poi-ooxml use "lite" or ooxml-schemas? I think  
>> we should always use "lite" and distribute lite. We can put the  
>> "lite" classes in one of two places:
>
> My plan was to distribute 'lite' only as a supplemental jar. Also, I  
> will consider switching the project to use "lite" only when I have  
> feedback from users.

OK. Then it will be made by a new build target. What is that target?

> (a) In the poi-ooxml jar as part of that build.
>> (b) In its own jar under a new maven artifact-id. I like ooxml-poi
>> I think (b) is better, but if a user is working on ooxml support in  
>> poi-ooxml then they it is likely that they will be covering parts  
>> of the schema not yet covered by "lite"
>> Users will still want to work with the full schemas they need to  
>> make a choice when they build - either with a special target or by  
>> copying the big jar in ooxml-lib/
>
> Development builds should always use the "full" jar and /ooxml-lib  
> should contain ooxml-schemas-1.0.jar and no other versions of ooxml- 
> schemas. Actually it is the only way it can work - the "lite" jar is  
> a derivative from the "full" jar, it is not an alternative.

OK. Lite jar is acceptable as a supplemental and I will "document" why  
it should be used.

Is there a reason for maven users to care about the "lite". If there  
is then we need to worry about alternative dependencies for ooxml - 
schemas. If not then we are good.


>> In general users will want to use the "lite" jar. We can provide  
>> access to the full ooxml-schema as a replacement. Is it possible to  
>> have "selective" targets in a maven pom? Can we make poi-ooxml  
>> dependent on either "ooxml-poi" or "ooxml-schema"?
>> For the build I think that an explicit target should be used called  
>> "ooxml" - this will perform your full task and make sure that the  
>> build environment is using "lite" and not "full". I suspect that  
>> this target may move some files around. We'll need to explain that  
>> adding support for parts of the schema means adding unit tests.  
>> These unit test should help us with documentation on the OOXML  
>> formats.
>
>>> (2) I am reworking the home page. There is a table of components  
>>> that appear there.
>>>
>>> Document -- Component -- JAR -- Maven artifactId
>>> OLE2 Filesystem -- POIFS -- poi-version-yyyymmdd.jar -- poi
>>> OLE2 Property Sets -- HPSF -- poi-version-yyyymmdd.jar -- poi
>>> Excel XLS -- HSSF -- poi-version-yyyymmdd.jar -- poi
>>> Excel XLSX -- XSSF -- poi-ooxml-version-yyyymmdd.jar -- poi-ooxml
>>> PowerPoint PPT -- HSLF -- poi-scratchpad-version-yyyymmdd.jar --  
>>> poi-scratchpad
>>> PowerPoint PPTX -- XSLF -- poi-ooxml-version-yyyymmdd.jar -- poi- 
>>> ooxml
>>> Word DOC -- HWPF -- poi-scratchpad-version-yyyymmdd.jar -- poi- 
>>> scratchpad
>>> Word DOCX -- XWPF -- poi-ooxml-version-yyyymmdd.jar -- poi-ooxml
>>> Visio VSD -- HDGF -- poi-scratchpad-version-yyyymmdd.jar -- poi- 
>>> scratchpad
>>> Publisher PUB -- HPBF -- poi-scratchpad-version-yyyymmdd.jar --  
>>> poi-scratchpad
>>> Outlook MSG -- HSMF -- poi-scratchpad-version-yyyymmdd.jar -- poi- 
>>> scratchpad
>>>
>>> I am missing the OOXML schemas in my list. With this new lite  
>>> version I need two rows.
>>>
>
>
>>> OOXML Schemas -- OpenXML4J -- ooxml-schemas-yyyymmdd.jar -- poi- 
>>> ooxml
>>> OOXML Lite -- OpenXML4J -- ooxml-schemas-lite-yyyymmdd.jar -- poi- 
>>> ooxml-lite
>>>
>>> We will need to include poi-ooxml-version-yyyymmdd.jar in the poi- 
>>> ooxml-lite target as well. I'll mark the XLSX, XWPF, and XSLF rows  
>>> appropriately.
>>>
>>> Correct?
>>>
> Not quite.
> OpenXML4J  is a general-purpose API to work with OPC packages, it is  
> a direct counterpart of POIFS. So, it should stay separate.

And that is part of poi-ooxml. Correct?

> As to OOXML Schemas, I would rather not advertise them on the web  
> site - it is a detail of our internal implementation. Users are  
> advised to use common interfaces.

I will document them on the website because they are dependencies of  
the maven poms.

I will say that one would not normally want to build the ooxml- 
schemas. However I could see people wanting to understand what  
changing the schema might mean.

However with a lite version I can see a developer of new ooxml  
features wanting to test their coverage and write unit tests.

Regards,
Dave



>
>
> Yegor
>
>>> (3) I 'll rewrite your description as a new page within the  
>>> currently very sparse. OOXML documentation.
>>>
>>> BTW - the www.openxml4j.org domain has gone away and I am going to  
>>> need help from you in deciding additional documentation and OPC  
>>> examples that we should include for the OOXML sub-project.
>>>
>>> Regards,
>>> Dave
>>>
>>> On Nov 16, 2009, at 8:53 AM, Yegor Kozlov wrote:
>>>
>>>> Hi All,
>>>>
>>>> As we discussed at Apachecon, one way to optimize the size of POI  
>>>> distributions is to create a 'lite' version of the ooxml-schemas  
>>>> jar.
>>>> The idea is simple: remove all unused classes and resources from  
>>>> the jar generated by XMLBeans. Rough estimations made at the  
>>>> Barcamp showed that POI uses less than 30% of the OOXML schemas,  
>>>> hence the optimized jar should be significantly smaller.
>>>>
>>>> With this in mind I created a simple utility called OOXMLLite,  
>>>> see http://svn.apache.org/repos/asf/poi/trunk/src/ooxml/java/org/apache/poi/util/OOXMLLite.java
>>>>
>>>> The process includes four simple steps:
>>>>
>>>> - run all ooxml unit tests
>>>> - see what classes from the ooxml-schemas.jar are loaded in the JVM
>>>> - copy the loaded classes into some directory.
>>>> - copy the binary resources (.xsb)
>>>>
>>>> A good acceptance test is to run the ooxml unit tests against the  
>>>> 'lite' classes - all should pass. There is an accompanying Ant  
>>>> task ooxml-xsds-lite for that, see build.xml.
>>>>
>>>> The resulting 'lite' jar is much smaller: ooxml-schemas-lite-3.6- 
>>>> beta1.jar is only 3.5 MB while the 'big' ooxml-schemas-1.0.jar is  
>>>> 14.5 MB. In theory, the size can be trimmed down below 3 MB  - my  
>>>> utility copies all .xsb files and does not yet track resource  
>>>> dependencies.
>>>>
>>>> I propose to include ooxml-schemas-lite in the release cycle. The  
>>>> artifact name is ooxml-schemas-lite-${version.id}.jar.
>>>> Interested projects (first of all I mean Apache Tika) can setup  
>>>> their Maven poms to use <artifactId>poi-ooxml-lite</artifactId>   
>>>> instead of <artifactId>poi-ooxml</artifactId>. This will reduce  
>>>> the distribution size by approximately 10 MB.
>>>>
>>>> Yegor
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>>>> For additional commands, e-mail: dev-help@poi.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>>> For additional commands, e-mail: dev-help@poi.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>> For additional commands, e-mail: dev-help@poi.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: a 'lite' version of ooxml-schemas jar

Posted by Yegor Kozlov <ye...@dinom.ru>.
Hi Dave,

> 
> I've been digging deeper into the dependencies in maven. I think that 
> "lite" should become the usual way to build.
> 
> (1) maven/poi-ooxml.pom is missing two dependencies:
> 
> xmlbeans-2.3.0.jar
>   <property name="ooxml.xmlbeans.jar" 
> location="${ooxml.lib}/xmlbeans-2.3.0.jar"/>
>   <property name="ooxml.xmlbeans.url" 
> value="${repository.m2}/maven2/org/apache/xmlbeans/xmlbeans/2.3.0/xmlbeans-2.3.0.jar"/> 
> 
> 

it is a chained dependency, xmlbeans-2.3.0.jar is included in ooxml-schemas.pom.

> geronimo-stax-api_1.0_spec-1.0.jar
>   <property name="ooxml.jsr173.jar" 
> location="${ooxml.lib}/geronimo-stax-api_1.0_spec-1.0.jar"/>
>   <property name="ooxml.jsr173.url" 
> value="${repository.m2}/maven2/org/apache/geronimo/specs/geronimo-stax-api_1.0_spec/1.0/geronimo-stax-api_1.0_spec-1.0.jar"/> 
> 
> 
 > Is there a reason we let these out of the pom?
we need geronimo-stax only to build ooxml-schemas.jar, it is not needed at runtime. Maven users don't need it either.

> 
>>> I propose to include ooxml-schemas-lite in the release cycle. The 
>>> artifact name is ooxml-schemas-lite-${version.id}.jar.
>>> Interested projects (first of all I mean Apache Tika) can setup their 
>>> Maven poms to use <artifactId>poi-ooxml-lite</artifactId>  instead of 
>>> <artifactId>poi-ooxml</artifactId>. This will reduce the distribution 
>>> size by approximately 10 MB.
> 
> (2) You propose a new artifact-id of ooxml-schemas-lite. I think a name 
> like ooxml-poi, poi-ooxml-schemas, or poi-opc would be better.
> 


> There are a few points to make here:
> 
> - ooxml-schemas has a different versioning - it is version 1.0. It 
> should not change much. We should have a documented build target for this.
> 
> - ooxml lite - should follow the poi versioning schema since newer 
> versions of POI will cover more of the schema. So, it is not really 
> quite a sub of ooxml-schema as much as it is a cross reference between 
> ooxml-schema and poi-ooxml.
> 
> Which version should poi-ooxml use "lite" or ooxml-schemas? I think we 
> should always use "lite" and distribute lite. We can put the "lite" 
> classes in one of two places:
> 

My plan was to distribute 'lite' only as a supplemental jar. Also, I will consider switching the project to use "lite" 
only when I have feedback from users.


> (a) In the poi-ooxml jar as part of that build.
> (b) In its own jar under a new maven artifact-id. I like ooxml-poi
> 
> I think (b) is better, but if a user is working on ooxml support in 
> poi-ooxml then they it is likely that they will be covering parts of the 
> schema not yet covered by "lite"
> 
> Users will still want to work with the full schemas they need to make a 
> choice when they build - either with a special target or by copying the 
> big jar in ooxml-lib/
> 

Development builds should always use the "full" jar and /ooxml-lib should contain ooxml-schemas-1.0.jar and no other 
versions of ooxml-schemas. Actually it is the only way it can work - the "lite" jar is a derivative from the "full" jar, 
it is not an alternative.

> In general users will want to use the "lite" jar. We can provide access 
> to the full ooxml-schema as a replacement. Is it possible to have 
> "selective" targets in a maven pom? Can we make poi-ooxml dependent on 
> either "ooxml-poi" or "ooxml-schema"?
> 
> For the build I think that an explicit target should be used called 
> "ooxml" - this will perform your full task and make sure that the build 
> environment is using "lite" and not "full". I suspect that this target 
> may move some files around. We'll need to explain that adding support 
> for parts of the schema means adding unit tests. These unit test should 
> help us with documentation on the OOXML formats.
> 

>> (2) I am reworking the home page. There is a table of components that 
>> appear there.
>>
>> Document -- Component -- JAR -- Maven artifactId
>> OLE2 Filesystem -- POIFS -- poi-version-yyyymmdd.jar -- poi
>> OLE2 Property Sets -- HPSF -- poi-version-yyyymmdd.jar -- poi
>> Excel XLS -- HSSF -- poi-version-yyyymmdd.jar -- poi
>> Excel XLSX -- XSSF -- poi-ooxml-version-yyyymmdd.jar -- poi-ooxml
>> PowerPoint PPT -- HSLF -- poi-scratchpad-version-yyyymmdd.jar -- 
>> poi-scratchpad
>> PowerPoint PPTX -- XSLF -- poi-ooxml-version-yyyymmdd.jar -- poi-ooxml
>> Word DOC -- HWPF -- poi-scratchpad-version-yyyymmdd.jar -- poi-scratchpad
>> Word DOCX -- XWPF -- poi-ooxml-version-yyyymmdd.jar -- poi-ooxml
>> Visio VSD -- HDGF -- poi-scratchpad-version-yyyymmdd.jar -- 
>> poi-scratchpad
>> Publisher PUB -- HPBF -- poi-scratchpad-version-yyyymmdd.jar -- 
>> poi-scratchpad
>> Outlook MSG -- HSMF -- poi-scratchpad-version-yyyymmdd.jar -- 
>> poi-scratchpad
>>
>> I am missing the OOXML schemas in my list. With this new lite version 
>> I need two rows.
>>


>> OOXML Schemas -- OpenXML4J -- ooxml-schemas-yyyymmdd.jar -- poi-ooxml
>> OOXML Lite -- OpenXML4J -- ooxml-schemas-lite-yyyymmdd.jar -- 
>> poi-ooxml-lite
>>
>> We will need to include poi-ooxml-version-yyyymmdd.jar in the 
>> poi-ooxml-lite target as well. I'll mark the XLSX, XWPF, and XSLF rows 
>> appropriately.
>>
>> Correct?
>>
Not quite.
OpenXML4J  is a general-purpose API to work with OPC packages, it is a direct counterpart of POIFS. So, it should stay 
separate.

As to OOXML Schemas, I would rather not advertise them on the web site - it is a detail of our internal implementation. 
Users are advised to use common interfaces.


Yegor

>> (3) I 'll rewrite your description as a new page within the currently 
>> very sparse. OOXML documentation.
>>
>> BTW - the www.openxml4j.org domain has gone away and I am going to 
>> need help from you in deciding additional documentation and OPC 
>> examples that we should include for the OOXML sub-project.
>>
>> Regards,
>> Dave
>>
>> On Nov 16, 2009, at 8:53 AM, Yegor Kozlov wrote:
>>
>>> Hi All,
>>>
>>> As we discussed at Apachecon, one way to optimize the size of POI 
>>> distributions is to create a 'lite' version of the ooxml-schemas jar.
>>> The idea is simple: remove all unused classes and resources from the 
>>> jar generated by XMLBeans. Rough estimations made at the Barcamp 
>>> showed that POI uses less than 30% of the OOXML schemas, hence the 
>>> optimized jar should be significantly smaller.
>>>
>>> With this in mind I created a simple utility called OOXMLLite, see 
>>> http://svn.apache.org/repos/asf/poi/trunk/src/ooxml/java/org/apache/poi/util/OOXMLLite.java 
>>>
>>>
>>> The process includes four simple steps:
>>>
>>> - run all ooxml unit tests
>>> - see what classes from the ooxml-schemas.jar are loaded in the JVM
>>> - copy the loaded classes into some directory.
>>> - copy the binary resources (.xsb)
>>>
>>> A good acceptance test is to run the ooxml unit tests against the 
>>> 'lite' classes - all should pass. There is an accompanying Ant task 
>>> ooxml-xsds-lite for that, see build.xml.
>>>
>>> The resulting 'lite' jar is much smaller: 
>>> ooxml-schemas-lite-3.6-beta1.jar is only 3.5 MB while the 'big' 
>>> ooxml-schemas-1.0.jar is 14.5 MB. In theory, the size can be trimmed 
>>> down below 3 MB  - my utility copies all .xsb files and does not yet 
>>> track resource dependencies.
>>>
>>> I propose to include ooxml-schemas-lite in the release cycle. The 
>>> artifact name is ooxml-schemas-lite-${version.id}.jar.
>>> Interested projects (first of all I mean Apache Tika) can setup their 
>>> Maven poms to use <artifactId>poi-ooxml-lite</artifactId>  instead of 
>>> <artifactId>poi-ooxml</artifactId>. This will reduce the distribution 
>>> size by approximately 10 MB.
>>>
>>> Yegor
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>>> For additional commands, e-mail: dev-help@poi.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>> For additional commands, e-mail: dev-help@poi.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: a 'lite' version of ooxml-schemas jar

Posted by David Fisher <df...@jmlafferty.com>.
Hi Yegor,

I've been digging deeper into the dependencies in maven. I think that  
"lite" should become the usual way to build.

(1) maven/poi-ooxml.pom is missing two dependencies:

xmlbeans-2.3.0.jar
   <property name="ooxml.xmlbeans.jar" location="${ooxml.lib}/ 
xmlbeans-2.3.0.jar"/>
   <property name="ooxml.xmlbeans.url" value="${repository.m2}/maven2/ 
org/apache/xmlbeans/xmlbeans/2.3.0/xmlbeans-2.3.0.jar"/>

geronimo-stax-api_1.0_spec-1.0.jar
   <property name="ooxml.jsr173.jar" location="${ooxml.lib}/geronimo- 
stax-api_1.0_spec-1.0.jar"/>
   <property name="ooxml.jsr173.url" value="${repository.m2}/maven2/ 
org/apache/geronimo/specs/geronimo-stax-api_1.0_spec/1.0/geronimo-stax- 
api_1.0_spec-1.0.jar"/>

Is there a reason we let these out of the pom?

>> I propose to include ooxml-schemas-lite in the release cycle. The  
>> artifact name is ooxml-schemas-lite-${version.id}.jar.
>> Interested projects (first of all I mean Apache Tika) can setup  
>> their Maven poms to use <artifactId>poi-ooxml-lite</artifactId>   
>> instead of <artifactId>poi-ooxml</artifactId>. This will reduce the  
>> distribution size by approximately 10 MB.

(2) You propose a new artifact-id of ooxml-schemas-lite. I think a  
name like ooxml-poi, poi-ooxml-schemas, or poi-opc would be better.

There are a few points to make here:

- ooxml-schemas has a different versioning - it is version 1.0. It  
should not change much. We should have a documented build target for  
this.

- ooxml lite - should follow the poi versioning schema since newer  
versions of POI will cover more of the schema. So, it is not really  
quite a sub of ooxml-schema as much as it is a cross reference between  
ooxml-schema and poi-ooxml.

Which version should poi-ooxml use "lite" or ooxml-schemas? I think we  
should always use "lite" and distribute lite. We can put the "lite"  
classes in one of two places:

(a) In the poi-ooxml jar as part of that build.
(b) In its own jar under a new maven artifact-id. I like ooxml-poi

I think (b) is better, but if a user is working on ooxml support in  
poi-ooxml then they it is likely that they will be covering parts of  
the schema not yet covered by "lite"

Users will still want to work with the full schemas they need to make  
a choice when they build - either with a special target or by copying  
the big jar in ooxml-lib/

In general users will want to use the "lite" jar. We can provide  
access to the full ooxml-schema as a replacement. Is it possible to  
have "selective" targets in a maven pom? Can we make poi-ooxml  
dependent on either "ooxml-poi" or "ooxml-schema"?

For the build I think that an explicit target should be used called  
"ooxml" - this will perform your full task and make sure that the  
build environment is using "lite" and not "full". I suspect that this  
target may move some files around. We'll need to explain that adding  
support for parts of the schema means adding unit tests. These unit  
test should help us with documentation on the OOXML formats.

Regards,
Dave

On Nov 16, 2009, at 9:26 AM, David Fisher wrote:

> Hi Yegor,
>
> +1
>
> This will have affects on the website re-write.
>
> (1) The "How to Build" page has a list of common targets. Here is  
> what I have currently:
>
> clean -- Erase all build work products (ie. everything in the build  
> directory
> compile	-- Compiles all files from main, contrib and scratchpad
> test -- Run all unit tests from main, contrib and scratchpad (JUnit)
> jar -- Produce jar files
> docs -- Generate all documentation for the system (Apache Forrest)
> dist -- Create a distribution (JUnit and Apache Forrest)
>
> This should always be part of the dist target. Should we add a  
> target for building a "lite" ooxml, or is this always be part of jar  
> and test?
>
> I think we should have a "lite" target separate from jar and test.
>
> (2) I am reworking the home page. There is a table of components  
> that appear there.
>
> Document -- Component -- JAR -- Maven artifactId
> OLE2 Filesystem -- POIFS -- poi-version-yyyymmdd.jar -- poi
> OLE2 Property Sets -- HPSF -- poi-version-yyyymmdd.jar -- poi
> Excel XLS -- HSSF -- poi-version-yyyymmdd.jar -- poi
> Excel XLSX -- XSSF -- poi-ooxml-version-yyyymmdd.jar -- poi-ooxml
> PowerPoint PPT -- HSLF -- poi-scratchpad-version-yyyymmdd.jar -- poi- 
> scratchpad
> PowerPoint PPTX -- XSLF -- poi-ooxml-version-yyyymmdd.jar -- poi-ooxml
> Word DOC -- HWPF -- poi-scratchpad-version-yyyymmdd.jar -- poi- 
> scratchpad
> Word DOCX -- XWPF -- poi-ooxml-version-yyyymmdd.jar -- poi-ooxml
> Visio VSD -- HDGF -- poi-scratchpad-version-yyyymmdd.jar -- poi- 
> scratchpad
> Publisher PUB -- HPBF -- poi-scratchpad-version-yyyymmdd.jar -- poi- 
> scratchpad
> Outlook MSG -- HSMF -- poi-scratchpad-version-yyyymmdd.jar -- poi- 
> scratchpad
>
> I am missing the OOXML schemas in my list. With this new lite  
> version I need two rows.
>
> OOXML Schemas -- OpenXML4J -- ooxml-schemas-yyyymmdd.jar -- poi-ooxml
> OOXML Lite -- OpenXML4J -- ooxml-schemas-lite-yyyymmdd.jar -- poi- 
> ooxml-lite
>
> We will need to include poi-ooxml-version-yyyymmdd.jar in the poi- 
> ooxml-lite target as well. I'll mark the XLSX, XWPF, and XSLF rows  
> appropriately.
>
> Correct?
>
> (3) I 'll rewrite your description as a new page within the  
> currently very sparse. OOXML documentation.
>
> BTW - the www.openxml4j.org domain has gone away and I am going to  
> need help from you in deciding additional documentation and OPC  
> examples that we should include for the OOXML sub-project.
>
> Regards,
> Dave
>
> On Nov 16, 2009, at 8:53 AM, Yegor Kozlov wrote:
>
>> Hi All,
>>
>> As we discussed at Apachecon, one way to optimize the size of POI  
>> distributions is to create a 'lite' version of the ooxml-schemas jar.
>> The idea is simple: remove all unused classes and resources from  
>> the jar generated by XMLBeans. Rough estimations made at the  
>> Barcamp showed that POI uses less than 30% of the OOXML schemas,  
>> hence the optimized jar should be significantly smaller.
>>
>> With this in mind I created a simple utility called OOXMLLite, see http://svn.apache.org/repos/asf/poi/trunk/src/ooxml/java/org/apache/poi/util/OOXMLLite.java
>>
>> The process includes four simple steps:
>>
>> - run all ooxml unit tests
>> - see what classes from the ooxml-schemas.jar are loaded in the JVM
>> - copy the loaded classes into some directory.
>> - copy the binary resources (.xsb)
>>
>> A good acceptance test is to run the ooxml unit tests against the  
>> 'lite' classes - all should pass. There is an accompanying Ant task  
>> ooxml-xsds-lite for that, see build.xml.
>>
>> The resulting 'lite' jar is much smaller: ooxml-schemas-lite-3.6- 
>> beta1.jar is only 3.5 MB while the 'big' ooxml-schemas-1.0.jar is  
>> 14.5 MB. In theory, the size can be trimmed down below 3 MB  - my  
>> utility copies all .xsb files and does not yet track resource  
>> dependencies.
>>
>> I propose to include ooxml-schemas-lite in the release cycle. The  
>> artifact name is ooxml-schemas-lite-${version.id}.jar.
>> Interested projects (first of all I mean Apache Tika) can setup  
>> their Maven poms to use <artifactId>poi-ooxml-lite</artifactId>   
>> instead of <artifactId>poi-ooxml</artifactId>. This will reduce the  
>> distribution size by approximately 10 MB.
>>
>> Yegor
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>> For additional commands, e-mail: dev-help@poi.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: a 'lite' version of ooxml-schemas jar

Posted by David Fisher <df...@jmlafferty.com>.
Hi Yegor,

+1

This will have affects on the website re-write.

(1) The "How to Build" page has a list of common targets. Here is what  
I have currently:

clean -- Erase all build work products (ie. everything in the build  
directory
compile	-- Compiles all files from main, contrib and scratchpad
test -- Run all unit tests from main, contrib and scratchpad (JUnit)
jar -- Produce jar files
docs -- Generate all documentation for the system (Apache Forrest)
dist -- Create a distribution (JUnit and Apache Forrest)

This should always be part of the dist target. Should we add a target  
for building a "lite" ooxml, or is this always be part of jar and test?

I think we should have a "lite" target separate from jar and test.

(2) I am reworking the home page. There is a table of components that  
appear there.

Document -- Component -- JAR -- Maven artifactId
OLE2 Filesystem -- POIFS -- poi-version-yyyymmdd.jar -- poi
OLE2 Property Sets -- HPSF -- poi-version-yyyymmdd.jar -- poi
Excel XLS -- HSSF -- poi-version-yyyymmdd.jar -- poi
Excel XLSX -- XSSF -- poi-ooxml-version-yyyymmdd.jar -- poi-ooxml
PowerPoint PPT -- HSLF -- poi-scratchpad-version-yyyymmdd.jar -- poi- 
scratchpad
PowerPoint PPTX -- XSLF -- poi-ooxml-version-yyyymmdd.jar -- poi-ooxml
Word DOC -- HWPF -- poi-scratchpad-version-yyyymmdd.jar -- poi- 
scratchpad
Word DOCX -- XWPF -- poi-ooxml-version-yyyymmdd.jar -- poi-ooxml
Visio VSD -- HDGF -- poi-scratchpad-version-yyyymmdd.jar -- poi- 
scratchpad
Publisher PUB -- HPBF -- poi-scratchpad-version-yyyymmdd.jar -- poi- 
scratchpad
Outlook MSG -- HSMF -- poi-scratchpad-version-yyyymmdd.jar -- poi- 
scratchpad

I am missing the OOXML schemas in my list. With this new lite version  
I need two rows.

OOXML Schemas -- OpenXML4J -- ooxml-schemas-yyyymmdd.jar -- poi-ooxml
OOXML Lite -- OpenXML4J -- ooxml-schemas-lite-yyyymmdd.jar -- poi- 
ooxml-lite

We will need to include poi-ooxml-version-yyyymmdd.jar in the poi- 
ooxml-lite target as well. I'll mark the XLSX, XWPF, and XSLF rows  
appropriately.

Correct?

(3) I 'll rewrite your description as a new page within the currently  
very sparse. OOXML documentation.

BTW - the www.openxml4j.org domain has gone away and I am going to  
need help from you in deciding additional documentation and OPC  
examples that we should include for the OOXML sub-project.

Regards,
Dave

On Nov 16, 2009, at 8:53 AM, Yegor Kozlov wrote:

> Hi All,
>
> As we discussed at Apachecon, one way to optimize the size of POI  
> distributions is to create a 'lite' version of the ooxml-schemas jar.
> The idea is simple: remove all unused classes and resources from the  
> jar generated by XMLBeans. Rough estimations made at the Barcamp  
> showed that POI uses less than 30% of the OOXML schemas, hence the  
> optimized jar should be significantly smaller.
>
> With this in mind I created a simple utility called OOXMLLite, see http://svn.apache.org/repos/asf/poi/trunk/src/ooxml/java/org/apache/poi/util/OOXMLLite.java
>
> The process includes four simple steps:
>
> - run all ooxml unit tests
> - see what classes from the ooxml-schemas.jar are loaded in the JVM
> - copy the loaded classes into some directory.
> - copy the binary resources (.xsb)
>
> A good acceptance test is to run the ooxml unit tests against the  
> 'lite' classes - all should pass. There is an accompanying Ant task  
> ooxml-xsds-lite for that, see build.xml.
>
> The resulting 'lite' jar is much smaller: ooxml-schemas-lite-3.6- 
> beta1.jar is only 3.5 MB while the 'big' ooxml-schemas-1.0.jar is  
> 14.5 MB. In theory, the size can be trimmed down below 3 MB  - my  
> utility copies all .xsb files and does not yet track resource  
> dependencies.
>
> I propose to include ooxml-schemas-lite in the release cycle. The  
> artifact name is ooxml-schemas-lite-${version.id}.jar.
> Interested projects (first of all I mean Apache Tika) can setup  
> their Maven poms to use <artifactId>poi-ooxml-lite</artifactId>   
> instead of <artifactId>poi-ooxml</artifactId>. This will reduce the  
> distribution size by approximately 10 MB.
>
> Yegor
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org