You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@poi.apache.org by moparisthebest <ad...@moparisthebest.com> on 2017/05/30 21:05:04 UTC

[PATCHES] poi as calculation engine

Hi all,

This project I inherited uses POI (XSSF currently) as a sort of
calculation engine, it creates a new spreadsheet, fills it with pages of
data, then fills a page with formulas evaluating them one at a time,
sending the results back to the client.

Today their current biggest set is 49k formulas, and it takes ~21
minutes to evaluate.  If I apply these 2 patches to disable double
formula validation:

https://bz.apache.org/bugzilla/show_bug.cgi?id=61033

The same thing executes in 11 minutes.

In the interest of completeness, I tried SXSSF with a -1 row limit but
killed it at the ~50 minute mark. However, I determined that if I run
the thing through HSSF it executes in about 70 seconds. But, I need a
patch that increases HSSF's limits to XSSF limits, with the
understanding you can't read or write that out, it only works in memory.
 Patch 1 in this 3 part series was actually an existing bug that has
been merged here:

https://bz.apache.org/bugzilla/show_bug.cgi?id=61059

Patch 2 can be merged stand-alone and will speed up existing uses of
HSSF formulas as-is:

https://code.moparisthebest.com/moparisthebest/poi/commit/8ab388eb780b0759367f904455690b5721e7ee6c

The last patch allows you to send in an arbitrary SpreadsheetVersion to
create a blank HSSFWorkbook, I'm guessing this might be somewhat
controversial to merge:

https://code.moparisthebest.com/moparisthebest/poi/commit/4727ab718d374ab86f465e51a83a3ef0018a944a

(these are also available at github if you prefer
https://github.com/moparisthebest/poi/commits/calc_engine )

All current tests pass, no API was changed (only added to), and behavior
is also unchanged unless you create your workbook like 'new
HSSFWorkbook(SpreadsheetVersion.EXCEL2007)', in which case it will let
you write XLSX's limit of rows and formula calculations etc.

Would a patch like this be acceptable?  Any other ideas?

Thanks much,
Travis

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

Re: [PATCHES] poi as calculation engine

Posted by Javen O'Neal <on...@apache.org>.

> How about we adopt a full lazy-evaluation approach for Formula objects?
I opened bug 61136 [1] and took a stab at full lazy evaluation [2].

Travis, would you be willing to test the memory consumption and
execution speed with these changes in HSSF and XSSF relative to trunk
and your patch (8ab388eb78)?

[1] https://bz.apache.org/bugzilla/show_bug.cgi?id=61136
[2] https://bz.apache.org/bugzilla/attachment.cgi?id=35014&action=diff

On Tue, May 30, 2017 at 8:49 PM, Javen O'Neal <on...@apache.org> wrote:
> How about we adopt a full lazy-evaluation approach for Formula objects?
> Either _byteEncoding and _encodedTokenLen should be provided or
> _ptgTokens should be provided. Only when the other is needed should
> Ptg.readTokens or Ptg.serializeTokens be called.
>
> As I understand it, reading and serializing tokens is pretty
> expensive, and eager evaluation is expensive if the results are never
> used.
>
> There should be:
>> private Formula(final byte[] byteEncoding, final int encodedTokenLen);
>> private Formula(final Ptg[] cachedTokens);
> I don't think a third constructor is necessary with the current code
> and enough lazy evaluation, but here's a 3rd constructor
>> private Formula(final byte[] byteEncoding, final int encodedTokenLen, final Ptg[] cachedTokens);
>
> I think we're less likely to rewrite org.apache.poi.hssf.model with
> SpreadsheetVersion code since it would encourage incorrect usage (the
> mailing list would explode). However, if you can get to the bottom of
> why HSSF formula evaluation is faster with some profiling (perhaps
> it's the expensive xmlbean reading and writing that's slowing XSSF
> down), then we could go that avenue.
> 1. Find and fix the largest contributors to the overall formula
> evaluation time for your XSSF and HSSF test cases.
> 2. Create a GenericSSEvaluationWorkbook that stores sheets, cells, and
> rows in plain old java objects without the underlying HSSF bytestream
> or XSSF/SXSSF xml data structures that are needed for serialization.
> This wouldn't need to be a writeable workbook. This could plug into
> the current formula evaluation code.
> 3. Rewrite some of the XSSF classes that wrap an xmlbean to fully read
> their data from an xmlbean, and only recreate the bean when writing
> out. We want to go this direction with XSSF as it would make POI
> faster, use less memory, and make it easier to transition to a
> different XML library. Maybe I'm overstating, but long term, this
> would make it easier to merge HSSF, XSSF, SXSSF, and other SS
> interfaces, enabling format-agnostic classes that could convert
> between xls and xlsx.
> 4. Other ideas?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

Re: [PATCHES] poi as calculation engine

Posted by Javen O'Neal <on...@apache.org>.

How about we adopt a full lazy-evaluation approach for Formula objects?
Either _byteEncoding and _encodedTokenLen should be provided or
_ptgTokens should be provided. Only when the other is needed should
Ptg.readTokens or Ptg.serializeTokens be called.

As I understand it, reading and serializing tokens is pretty
expensive, and eager evaluation is expensive if the results are never
used.

There should be:
> private Formula(final byte[] byteEncoding, final int encodedTokenLen);
> private Formula(final Ptg[] cachedTokens);
I don't think a third constructor is necessary with the current code
and enough lazy evaluation, but here's a 3rd constructor
> private Formula(final byte[] byteEncoding, final int encodedTokenLen, final Ptg[] cachedTokens);

I think we're less likely to rewrite org.apache.poi.hssf.model with
SpreadsheetVersion code since it would encourage incorrect usage (the
mailing list would explode). However, if you can get to the bottom of
why HSSF formula evaluation is faster with some profiling (perhaps
it's the expensive xmlbean reading and writing that's slowing XSSF
down), then we could go that avenue.
1. Find and fix the largest contributors to the overall formula
evaluation time for your XSSF and HSSF test cases.
2. Create a GenericSSEvaluationWorkbook that stores sheets, cells, and
rows in plain old java objects without the underlying HSSF bytestream
or XSSF/SXSSF xml data structures that are needed for serialization.
This wouldn't need to be a writeable workbook. This could plug into
the current formula evaluation code.
3. Rewrite some of the XSSF classes that wrap an xmlbean to fully read
their data from an xmlbean, and only recreate the bean when writing
out. We want to go this direction with XSSF as it would make POI
faster, use less memory, and make it easier to transition to a
different XML library. Maybe I'm overstating, but long term, this
would make it easier to merge HSSF, XSSF, SXSSF, and other SS
interfaces, enabling format-agnostic classes that could convert
between xls and xlsx.
4. Other ideas?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org