You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by "Murphy, Mark" <mu...@metalexmfg.com> on 2016/06/01 12:50:15 UTC

Musings on POI Architecture

I want to apologize in advance on this Stream of Consciousness post. I hope it makes sense to someone.

At work I have been using the SS side of POI, and have become fairly comfortable with it. I realize that there are some things still that need to be done, and some issues with XML Beans that have been discussed, but it seems fairly well organized. Recently I have also been working with the WP side as well, and it is obviously still a work in progress. Likely there are fewer developers contributing there. But as I sat here considering the best way to get the things done that I need, I thought about the need to have a common POI architecture between the pieces of the project. This may exist, I just haven't found it yet. I have found that XWPF does not yet have a clear separation between the model and the usermodel. For example, to build headers and footers, the user must drip into the model to get a key object that has not yet been exposed in the usermodel. And, significant parts still require use of CT and ST classes. This is likely due to the early level of development of the WP portion of POI, but I feel that this is a great place to start if we intend to replace XML Beans.

I would like to propose a change to the POI architecture with respect to SS, as it already has a well-defined architecture. This change would allow us to more easily move away from XML Beans, and potentially reduce memory consumption in the XML format space. It seems to me that one of the reasons we use XML Beans is that it allows us to update XML documents in place. Unfortunately, XML is a highly inefficient format, and maybe it would be better, with respect to memory use, to model documents internally in a more efficient format, and at save time convert the document to its binary or XML format as necessary. In this case, the model would be the internal representation of the document, and the usermodel would be the API we expose to users of the library. In this manner we could have a single model and user model for each document type: spreadsheet, word processor, diagram, etc. Then on write we would convert to the binary or XML format as requested. In addition to the potential memory savings, this would enable a few things: We could more easily support additional formats (such as .ods and .csv) because we would not have to manipulate those formats internally. We could move XML Beans or its replacement to the periphery making it easier to swap out that piece. We would not run into issues such as the one we currently have with the swapRows() method in XSSF where the file data is hard to sort because of the tight coupling with XML Beans.

The WP side is a perfect place to try this out since it does not really have a well-defined separation between model and usermodel. If I go on any more, this thought will totally fall apart, so I will leave this open for discussion, and I hope that no one feels that I am stepping on toes. That is not my intention.

Mark Murphy

Re: Musings on POI Architecture

Posted by Javen O'Neal <ja...@gmail.com>.
> Unfortunately, XML is a highly inefficient format, and maybe it would be
better, with respect to memory use, to model documents internally in a more
efficient format, and at save time convert the document to its binary or
XML format as necessary.

This is done in the H??F classes, where each field is read from the binary
stream, doing the Little Endian conversion for multi-byte values. This
means that each class instance uses roughly the same memory as the number
of bytes corresponding to that element in the binary stream if the class
does not include additional data structures to improve performance.
Meanwhile, most X??F classes store these fields into frequently larger data
types (short->int) and unpacking multiple codes that were encoded in one
short into multiple 1-byte bool fields. This is usually done while keeping
the XML nodes in memory and writing changes to the nodes. Full
deserialization and reserialization would be more memory efficient, but
requires us to implement every feature that could exist on that element
(otherwise updating a document could result in loss of data or corruption).

I think reading the XML into regular Java data structures and discarding
the XML nodes from memory at read, then recreating the XML would be a good
direction to aim for, but it's such a large task that no one has done it.
As difficult as it is for me to ask IT at my day job to provide 16GB of RAM
to engineers who use internal POI-powered applications, it's less work than
memory-optimizing just the subset of XSSF classes that we use.

Don't let the magnitude of this task turn you off. Chisel away at bloaty
classes as your are able/interested.
On Jun 1, 2016 07:13, "Javen O'Neal" <ja...@gmail.com> wrote:

> > create a branch and start experimenting! :)
> Forking the Git mirror might be the easiest way to manage these
> contributions.
> On Jun 1, 2016 06:35, "Nick Burch" <ap...@gagravarr.org> wrote:
>
>> On Wed, 1 Jun 2016, Murphy, Mark wrote:
>>
>>> At work I have been using the SS side of POI, and have become fairly
>>> comfortable with it. I realize that there are some things still that need
>>> to be done, and some issues with XML Beans that have been discussed, but it
>>> seems fairly well organized. Recently I have also been working with the WP
>>> side as well, and it is obviously still a work in progress.
>>>
>>
>> There's not a lot of link between HWPF and XWPF. I tried to put one in,
>> but the formats have a surprising number of differences in concepts and
>> approaches, more-so than HSSF/XSSF. Coupled with less XWPF contributions,
>> and HWPF needing lots of love after the loss of the main developer, and
>> that's how we end up in the situation today...
>>
>> I have found that XWPF does not yet have a clear separation between the
>>> model and the usermodel.
>>>
>>
>> For anything done by POI committers, it should do. However, we've taken a
>> lot of community contributions, and many of those steer more towards "get
>> it done" than "build a full solution perfectly". That's why you see a lot
>> of "leakages" of the low-level XML stuff. It'd be great to wrap all of that
>> stuff up! And required for dropping xmlbeans - we need to get everyone off
>> the CT classes if we want to be able to replace them
>>
>> I would like to propose a change to the POI architecture with respect to
>>> SS, as it already has a well-defined architecture. This change would allow
>>> us to more easily move away from XML Beans, and potentially reduce memory
>>> consumption in the XML format space. It seems to me that one of the reasons
>>> we use XML Beans is that it allows us to update XML documents in place.
>>>
>>
>> On the whole, you can buy/beg/rent more memory, or faster machines. The
>> resource we really lack in POI is contributors writing code or
>> documentation or tests. xmlbeans makes development of the X??F stuff
>> quicker, and that's what we tend to optimise for!
>>
>> Unfortunately, XML is a highly inefficient format, and maybe it would be
>>> better, with respect to memory use, to model documents internally in a more
>>> efficient format, and at save time convert the document to its binary or
>>> XML format as necessary.
>>>
>>
>> The binary and XML formats have more differences than you'd ideally
>> expect or like, which in part is why we don't have more shared stuff
>> between them. Not saying that this plan wouldn't work, just that it might
>> not be as clean as you'd like especially for more fiddly stuff like
>> formatting, colours or the like
>>
>> The WP side is a perfect place to try this out since it does not really
>>> have a well-defined separation between model and usermodel. If I go on any
>>> more, this thought will totally fall apart, so I will leave this open for
>>> discussion, and I hope that no one feels that I am stepping on toes. That
>>> is not my intention.
>>>
>>
>> As long as it doesn't make new contributions to POI harder or slower (we
>> need more contributions!), and as long as you want to do the work, create a
>> branch and start experimenting! :)
>>
>> Nick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>> For additional commands, e-mail: dev-help@poi.apache.org
>>
>>

RE: Musings on POI Architecture

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 1 Jun 2016, Murphy, Mark wrote:
> Is there a main developer for XWPF?

Nope!

On Wed, 1 Jun 2016, Murphy, Mark wrote:
> Do you use the stuff in the wp package, or was that simply 
> experimentation?

It wasn't common enough to be able to swap the Tika stuff onto it, so no. 
I'd like to for Tika at some point though

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


RE: Musings on POI Architecture

Posted by "Murphy, Mark" <mu...@metalexmfg.com>.
Do you use the stuff in the wp package, or was that simply experimentation?


-----Original Message-----
From: Javen O'Neal [mailto:javenoneal@gmail.com] 
Sent: Wednesday, June 01, 2016 10:13 AM
To: POI Developers List
Subject: Re: Musings on POI Architecture

> create a branch and start experimenting! :)
Forking the Git mirror might be the easiest way to manage these contributions.
On Jun 1, 2016 06:35, "Nick Burch" <ap...@gagravarr.org> wrote:

> On Wed, 1 Jun 2016, Murphy, Mark wrote:
>
>> At work I have been using the SS side of POI, and have become fairly 
>> comfortable with it. I realize that there are some things still that 
>> need to be done, and some issues with XML Beans that have been 
>> discussed, but it seems fairly well organized. Recently I have also 
>> been working with the WP side as well, and it is obviously still a work in progress.
>>
>
> There's not a lot of link between HWPF and XWPF. I tried to put one 
> in, but the formats have a surprising number of differences in 
> concepts and approaches, more-so than HSSF/XSSF. Coupled with less 
> XWPF contributions, and HWPF needing lots of love after the loss of 
> the main developer, and that's how we end up in the situation today...
>
> I have found that XWPF does not yet have a clear separation between 
> the
>> model and the usermodel.
>>
>
> For anything done by POI committers, it should do. However, we've 
> taken a lot of community contributions, and many of those steer more 
> towards "get it done" than "build a full solution perfectly". That's 
> why you see a lot of "leakages" of the low-level XML stuff. It'd be 
> great to wrap all of that stuff up! And required for dropping xmlbeans 
> - we need to get everyone off the CT classes if we want to be able to 
> replace them
>
> I would like to propose a change to the POI architecture with respect 
> to
>> SS, as it already has a well-defined architecture. This change would 
>> allow us to more easily move away from XML Beans, and potentially 
>> reduce memory consumption in the XML format space. It seems to me 
>> that one of the reasons we use XML Beans is that it allows us to update XML documents in place.
>>
>
> On the whole, you can buy/beg/rent more memory, or faster machines. 
> The resource we really lack in POI is contributors writing code or 
> documentation or tests. xmlbeans makes development of the X??F stuff 
> quicker, and that's what we tend to optimise for!
>
> Unfortunately, XML is a highly inefficient format, and maybe it would 
> be
>> better, with respect to memory use, to model documents internally in 
>> a more efficient format, and at save time convert the document to its 
>> binary or XML format as necessary.
>>
>
> The binary and XML formats have more differences than you'd ideally 
> expect or like, which in part is why we don't have more shared stuff between them.
> Not saying that this plan wouldn't work, just that it might not be as 
> clean as you'd like especially for more fiddly stuff like formatting, 
> colours or the like
>
> The WP side is a perfect place to try this out since it does not 
> really
>> have a well-defined separation between model and usermodel. If I go 
>> on any more, this thought will totally fall apart, so I will leave 
>> this open for discussion, and I hope that no one feels that I am 
>> stepping on toes. That is not my intention.
>>
>
> As long as it doesn't make new contributions to POI harder or slower 
> (we need more contributions!), and as long as you want to do the work, 
> create a branch and start experimenting! :)
>
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional 
> commands, e-mail: dev-help@poi.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: Musings on POI Architecture

Posted by Javen O'Neal <ja...@gmail.com>.
> create a branch and start experimenting! :)
Forking the Git mirror might be the easiest way to manage these
contributions.
On Jun 1, 2016 06:35, "Nick Burch" <ap...@gagravarr.org> wrote:

> On Wed, 1 Jun 2016, Murphy, Mark wrote:
>
>> At work I have been using the SS side of POI, and have become fairly
>> comfortable with it. I realize that there are some things still that need
>> to be done, and some issues with XML Beans that have been discussed, but it
>> seems fairly well organized. Recently I have also been working with the WP
>> side as well, and it is obviously still a work in progress.
>>
>
> There's not a lot of link between HWPF and XWPF. I tried to put one in,
> but the formats have a surprising number of differences in concepts and
> approaches, more-so than HSSF/XSSF. Coupled with less XWPF contributions,
> and HWPF needing lots of love after the loss of the main developer, and
> that's how we end up in the situation today...
>
> I have found that XWPF does not yet have a clear separation between the
>> model and the usermodel.
>>
>
> For anything done by POI committers, it should do. However, we've taken a
> lot of community contributions, and many of those steer more towards "get
> it done" than "build a full solution perfectly". That's why you see a lot
> of "leakages" of the low-level XML stuff. It'd be great to wrap all of that
> stuff up! And required for dropping xmlbeans - we need to get everyone off
> the CT classes if we want to be able to replace them
>
> I would like to propose a change to the POI architecture with respect to
>> SS, as it already has a well-defined architecture. This change would allow
>> us to more easily move away from XML Beans, and potentially reduce memory
>> consumption in the XML format space. It seems to me that one of the reasons
>> we use XML Beans is that it allows us to update XML documents in place.
>>
>
> On the whole, you can buy/beg/rent more memory, or faster machines. The
> resource we really lack in POI is contributors writing code or
> documentation or tests. xmlbeans makes development of the X??F stuff
> quicker, and that's what we tend to optimise for!
>
> Unfortunately, XML is a highly inefficient format, and maybe it would be
>> better, with respect to memory use, to model documents internally in a more
>> efficient format, and at save time convert the document to its binary or
>> XML format as necessary.
>>
>
> The binary and XML formats have more differences than you'd ideally expect
> or like, which in part is why we don't have more shared stuff between them.
> Not saying that this plan wouldn't work, just that it might not be as clean
> as you'd like especially for more fiddly stuff like formatting, colours or
> the like
>
> The WP side is a perfect place to try this out since it does not really
>> have a well-defined separation between model and usermodel. If I go on any
>> more, this thought will totally fall apart, so I will leave this open for
>> discussion, and I hope that no one feels that I am stepping on toes. That
>> is not my intention.
>>
>
> As long as it doesn't make new contributions to POI harder or slower (we
> need more contributions!), and as long as you want to do the work, create a
> branch and start experimenting! :)
>
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>
>

RE: Musings on POI Architecture

Posted by "Murphy, Mark" <mu...@metalexmfg.com>.
Is there a main developer for XWPF?

-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: Wednesday, June 01, 2016 9:35 AM
To: POI Developers List
Subject: Re: Musings on POI Architecture

On Wed, 1 Jun 2016, Murphy, Mark wrote:
> At work I have been using the SS side of POI, and have become fairly 
> comfortable with it. I realize that there are some things still that 
> need to be done, and some issues with XML Beans that have been 
> discussed, but it seems fairly well organized. Recently I have also 
> been working with the WP side as well, and it is obviously still a 
> work in progress.

There's not a lot of link between HWPF and XWPF. I tried to put one in, but the formats have a surprising number of differences in concepts and approaches, more-so than HSSF/XSSF. Coupled with less XWPF contributions, and HWPF needing lots of love after the loss of the main developer, and that's how we end up in the situation today...

> I have found that XWPF does not yet have a clear separation between 
> the model and the usermodel.

For anything done by POI committers, it should do. However, we've taken a lot of community contributions, and many of those steer more towards "get it done" than "build a full solution perfectly". That's why you see a lot of "leakages" of the low-level XML stuff. It'd be great to wrap all of that stuff up! And required for dropping xmlbeans - we need to get everyone off the CT classes if we want to be able to replace them

> I would like to propose a change to the POI architecture with respect 
> to SS, as it already has a well-defined architecture. This change 
> would allow us to more easily move away from XML Beans, and 
> potentially reduce memory consumption in the XML format space. It 
> seems to me that one of the reasons we use XML Beans is that it allows 
> us to update XML documents in place.

On the whole, you can buy/beg/rent more memory, or faster machines. The resource we really lack in POI is contributors writing code or documentation or tests. xmlbeans makes development of the X??F stuff quicker, and that's what we tend to optimise for!

> Unfortunately, XML is a highly inefficient format, and maybe it would 
> be better, with respect to memory use, to model documents internally 
> in a more efficient format, and at save time convert the document to 
> its binary or XML format as necessary.

The binary and XML formats have more differences than you'd ideally expect or like, which in part is why we don't have more shared stuff between them. Not saying that this plan wouldn't work, just that it might not be as clean as you'd like especially for more fiddly stuff like formatting, colours or the like

> The WP side is a perfect place to try this out since it does not 
> really have a well-defined separation between model and usermodel. If 
> I go on any more, this thought will totally fall apart, so I will 
> leave this open for discussion, and I hope that no one feels that I am 
> stepping on toes. That is not my intention.

As long as it doesn't make new contributions to POI harder or slower (we need more contributions!), and as long as you want to do the work, create a branch and start experimenting! :)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional commands, e-mail: dev-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: Musings on POI Architecture

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 1 Jun 2016, Murphy, Mark wrote:
> At work I have been using the SS side of POI, and have become fairly 
> comfortable with it. I realize that there are some things still that 
> need to be done, and some issues with XML Beans that have been 
> discussed, but it seems fairly well organized. Recently I have also been 
> working with the WP side as well, and it is obviously still a work in 
> progress.

There's not a lot of link between HWPF and XWPF. I tried to put one in, 
but the formats have a surprising number of differences in concepts and 
approaches, more-so than HSSF/XSSF. Coupled with less XWPF contributions, 
and HWPF needing lots of love after the loss of the main developer, and 
that's how we end up in the situation today...

> I have found that XWPF does not yet have a clear separation between the 
> model and the usermodel.

For anything done by POI committers, it should do. However, we've taken a 
lot of community contributions, and many of those steer more towards "get 
it done" than "build a full solution perfectly". That's why you see a lot 
of "leakages" of the low-level XML stuff. It'd be great to wrap all of 
that stuff up! And required for dropping xmlbeans - we need to get 
everyone off the CT classes if we want to be able to replace them

> I would like to propose a change to the POI architecture with respect to 
> SS, as it already has a well-defined architecture. This change would 
> allow us to more easily move away from XML Beans, and potentially reduce 
> memory consumption in the XML format space. It seems to me that one of 
> the reasons we use XML Beans is that it allows us to update XML 
> documents in place.

On the whole, you can buy/beg/rent more memory, or faster machines. The 
resource we really lack in POI is contributors writing code or 
documentation or tests. xmlbeans makes development of the X??F stuff 
quicker, and that's what we tend to optimise for!

> Unfortunately, XML is a highly inefficient format, and maybe it would be 
> better, with respect to memory use, to model documents internally in a 
> more efficient format, and at save time convert the document to its 
> binary or XML format as necessary.

The binary and XML formats have more differences than you'd ideally expect 
or like, which in part is why we don't have more shared stuff between 
them. Not saying that this plan wouldn't work, just that it might not be 
as clean as you'd like especially for more fiddly stuff like formatting, 
colours or the like

> The WP side is a perfect place to try this out since it does not really 
> have a well-defined separation between model and usermodel. If I go on 
> any more, this thought will totally fall apart, so I will leave this 
> open for discussion, and I hope that no one feels that I am stepping on 
> toes. That is not my intention.

As long as it doesn't make new contributions to POI harder or slower (we 
need more contributions!), and as long as you want to do the work, create 
a branch and start experimenting! :)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org