You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2015/04/21 13:25:11 UTC

[Bug 57840] New: [PATCH] Support for structured references with Excel tables.

https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

            Bug ID: 57840
           Summary: [PATCH]  Support for structured references with Excel
                    tables.
           Product: POI
           Version: 3.11-FINAL
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: XSSF
          Assignee: dev@poi.apache.org
          Reporter: daniel.livshen@intel.com

Created attachment 32671
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=32671&action=edit
Structured reference support for POI

Added new support for the Structured Reference syntax that was introduced in MS
Excel 2007. 
You can find more information about structured references in :
https://support.office.com/en-us/article/Using-structured-references-with-Excel-tables-F5ED2452-2337-4F71-BED3-C8AE6D2B276E
The new patch adds the ability of parsing structured references (Only works in
XSSF as in HSSF you don't have tables) and converting it to normal (and well
implemented) Area Reference.
Also added support for Indirect evaluations with structured references.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

GW <gr...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #9 from GW <gr...@gmail.com> ---
(In reply to Javen O'Neal from comment #7)
> Thanks for the quick turn around Greg!
> I'm slowly reviewing what you have. I committed StructuredReferences.xlsx in
> r1747482.
> I also added svn:eol-style=native to all the files in your patch so the
> diffs are shorter when reviewing these changes on a different OS.
> The import change on Match.java isn't needed.
> 
> 1. Could you change String Table.isStructuredReference to a compiled Pattern
> for performance?
> 2. We're trying to move all junit3 tests to junit4 tests. Could you update
> TestStructuredReferences.java to use junit4 (org.junit.Test with @Test
> decorators, see TestXSSFFormulaParser.java for an example)
> 3. The changes to XSSFRowShifter looks like formulas with structured
> references cannot be row-shifted. Is that correct? If so, could you either
> fix that in your patch or open a new bug that depends on bug 57840?
> 4. Add a javadoc to XSSFWorkbook.getTable. This method rebuilds the table
> cache and returns a single item. If the cache isn't used elsewhere, then a
> non-caching linear search would be faster and use less memory.

Updated attachment has the requested changes.  Thanks for the feedback, I've
been a long time user but this is my first patch.  Here are responses to your
questions and comments:

1. Yes, changed.  I missed this from the original patch I started from, thanks.

2. Done.  The example test I picked turned out to be one not yet updated :)

3. Structured Reference syntax doesn't reference individual rows directly, so
shifting things doesn't affect these.  See the MS documentation link in the
initial description.  All structured reference syntax is to table columns and
parts (#Data, #Headers, etc.)  Formula evaluation dynamically constructs a
3DArea for the reference based on the current Table definition (I'm only
caching references to the XSSFTable objects, not their start/end cell
definitions).  Since those objects already existed, my assumption was that
inserting/deleting rows above a XSSFTable already updates the table start/end. 
If that's not the case, that's an existing bug - I haven't tried it.

4. Added JavaDoc

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #36 from Andreas Beeker <ki...@apache.org> ---
TL;DR: please comment non obvious performance hacks


As I usually delete all *ssf bug notifications, I wasn't aware of those
autoboxing changes and after findbugs complaint about them, I changed them back
to the state before the changes (i.e. as there were quite a lot of svn changes
and I haven't checked the history, when going through the findbugs issues)

After I've committed the autoboxing, I thought there might be something wrong,
otherwise such simple issues would have been already removed.
So now the code for XSSFRow is reverted again and I've added the necessary
ignore rules to the findbugs-filters, but for future modifications, it would be
good if you add a comment, when doing something on purpose against usual
optimizations, otherwise the next patch might throw out those lines again ...

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #38 from Greg Woolsey <gr...@gmail.com> ---
Anything else needed I can help with before this issue is resolved?  Table
reference syntax works for all my samples, and performance is now vastly
improved (although still not where I think it should be, but that's a different
set of bugs and patches I'll file eventually).

The FindBugs issues have been addressed, as have unit tests.

Perhaps we still need some comments as to why the explicit boxing is happening,
so they don't get undone?

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

daniel.livshen@intel.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|3.11-FINAL                  |3.12-dev

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #19 from Javen O'Neal <on...@apache.org> ---
(In reply to Greg Woolsey from comment #18)
> Created attachment 33937 [details]
> fix to reuse XSSFEvaluationWorkbook inside itself

Applied in r1747754.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #4 from Nick Burch <ap...@gagravarr.org> ---
(In reply to GW from comment #3)
> First, though, I have to test against 3.12, as a library I use (Vaadin
> Spreadsheet) doesn't like something that changed in the API between 3.12 and
> trunk, as that's where my POC has to work for my day job :)

We've tried reaching out to the Vaadin Spreadsheet folks a few times now, but
never had any response. If you have contacts there, please ping them to get in
touch! (IIRC it was conditional formatting stuff we were most recently
interested in collaborating on)

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #2 from Javen O'Neal <on...@apache.org> ---
attachment 32671 has merge conflicts against the trunk. Additional work will be
needed to bring this back to something that can be committed to the trunk.

A few more unit tests are probably also needed for the new functionality that
were added to make sure we don't break structured references in the future.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #39 from Javen O'Neal <on...@apache.org> ---
(In reply to Greg Woolsey from comment #38)
> Perhaps we still need some comments as to why the explicit boxing is happening,
> so they don't get undone?
Applied in r1748479.

I kept this bug open in case you had anything else you discovered from your
testing. I'll close it now.
These changes will be available in POI 3.15 beta 2, to be released in the next
couple months.

I'm assuming that general formula evaluation is faster as a whole with
r1747840, not just structured references.

In future bugs, we should look at:
 * comment 5: get Vaadin Spreadsheet working with POI versions higher than
3.12.
 * comment 12: When evaluating multiple formulas in a sheet, cache the formula
result of cells with the same expression
 * comment 21: add to POI documentation that "-XX:AutoBoxCacheMax=1048704" may
improve performance of working with large worksheets (perhaps this is no longer
true now that we're using explicit boxing).
 * comment 23: use HashMap instead of TreeMap to hold XSSFRows and XSSFCells in
XSSFSheet and XSSFRows, respectively. Obviously this would make moving rows or
getting all rows in a range extremely expensive.
 * comment 34: trigger a cache reset, either with an explicit public method or
implicitly with an internal method (this could make regular workbook operations
slower since any cell that is part of a table or any cell in a table that
references a value that changes in the workbook). Several methods in XSSFTable
still use caches that may contain stale data.
 * comment 38: continue to improve formula evaluation performance

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #12 from Greg Woolsey <gr...@gmail.com> ---
(In reply to Javen O'Neal from comment #11)
> Moved getTable cache from XSSFWorkbook to XSSFEvaluationWorkbook to reduce
> the frequency of problems with a stale cache in r1747615-r1747616.

This kills performance - the reason I moved it to XSSFWorkbook in the first
place was performance.

Parsing and evaluating formulas on a moderately sized workbook (one table with
6 columns and 50,00 rows, one computed column) and a sheet with a small table
of formulas referencing the data table (VLOOKUPs mostly) took over an hour on a
P5 with 2GB allocated to the VM.

Moving it to XSSFWorkbook reduced this to 6 minutes, which is still awful, the
the problem at that point was the auto-boxing of rownums in all the row methods
in XSSFSheet (15% of total time spent in Integer.compareTo).  Lots of
references via Google for performance tests with auto-boxing vs. primitives vs.
explicitly referenced and retained primitive wrappers (Integer/Long).

The table lookup was performed hundreds of thousands of times, as every cell
evaluation created a new instance of XSSFEvaluationWorkbook, causing the cache
to be rebuilt.  Even just doing a linear search, if there are 3 or 4 tables,
regardless of size, is hugely expensive because the XMLBeans package is
terribly slow for repeated lookups.

I tried to balance the inconvenience/problem of a stale cache with performance.

I think, but haven't verified yet, that the real performance issue is that when
evaluating all the cell formulas in a workbook, if multiple cells reference the
same expression, the expression isn't cached, but recalculated over and over
for all those cells.  This is probably worst with range functions, when a
column of cells do the same VLOOKUP over and over with different inputs against
the same range.  Especially if that range contains formula cells.

All that to say, moving this back is going to have an extreme performance cost,
and the real issues are deep.  They probably deserve their own separate issues:

1. formula evaluations that require access to XML elements/attributes
2. XSSFSheet rownum auto-boxing
3. Formula evaluation intermediate result caching

Eventually I'll try to capture just how many table lookups by name are done vs.
how many formulas reference the table, to see how bad it is.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #31 from Greg Woolsey <gr...@gmail.com> ---
Created attachment 33941
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=33941&action=edit
adds a unit test file and method that make heavy use of structured reference
formulas, for performance testing

My latest patch attachment,
patch-57840-heavy-structured-ref-formula-use-test.tar.gz, has my sample file
(got approval for the sanitized version) and a test method to loop through and
evaluate all the formula cells.

It runs successfully, but still takes a while to process the 7,274 formula
cells.

most have IF() calls that execute VLOOKUP() using a structured reference to a
named table with over 3,700 rows in 8 columns.  This exercises things enough it
can be profiled using Java Mission Control or other tools to help find the
remaining code hot spots.  The test could be augmented with an upper bound
timing test if desired, to fail if the eval run takes longer than say 5 minutes
(my last run was 90 sec.).  That would flag big performance regressions, like
introducing a code path that misses the evaluator caches.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #22 from Javen O'Neal <on...@apache.org> ---
Doing some Google searching, it looks like using new Integer(int) rather than
Integer.valueOf (the explicit version of implicit auto-boxing) may use more
memory due to duplicated objects wrapping primitives, but is faster because it
doesn't look up lookup cached values to save this memory.

Did you notice any performance difference changing the boxing on the code
relating to this bug?

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #32 from Greg Woolsey <gr...@gmail.com> ---
Created attachment 33942
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=33942&action=edit
lazily cache the cell cache key hash

Last one for now - I'm still unclear exactly why, but lazy calculating then
storing the hash for the evaluation cell cache key speeds things up quite a bit
(10-20%).  It looks like for some uses only equals() is used on a newly created
key, so avoiding the hash calculation helps, while on others (the ones stored
in the map?) hashcode() is called multiple times, so caching it once calculated
saves time.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #30 from Javen O'Neal <on...@apache.org> ---
Nevermind. I found
https://svn.apache.org/viewvc/poi/trunk/src/resources/devtools/findbugs-filters.xml
Looks like this is the place to do it.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |55413

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #33 from Javen O'Neal <on...@apache.org> ---
Applied to trunk.
comment 31   attachment 33941   r1747878
comment 32   attachment 33942   r1747881

I disabled the stress test since it's a demonstration of performance, not
functionality. Perhaps we should have an ant target for testing the performance
of POI.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #21 from Greg Woolsey <gr...@gmail.com> ---
(In reply to Javen O'Neal from comment #14)
> What's remaining before this bug can be closed:
> * performance evaluation of any explicit boxing/unboxing to see if any speed
> increase can be gained without adding complexity to the code.

I found adding the VM parameter 

-XX:AutoBoxCacheMax=1048704 

improved performance of my slow test by 12% (can't share it yet, waiting for
approval of my sanitized version of proprietary data).

That's SpreadsheetVersion.EXCEL2007.getMaxRows() + 128 (-127>0 are also cached
by Integer).

That forces Integer to cache all the values up to max rows.  Obviously don't do
this unless are are also setting the VM heap size larger too.  Since POI
already uses lots of memory for documents, this won't be a problem for people
who need it.

I think with that documentation note somewhere the autoboxing issue can be
considered dealt with for now.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

GW <gr...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|Linux                       |All
            Version|3.12-dev                    |3.15-dev
           Hardware|PC                          |All

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

Vlad Skarzhevskyy <sk...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |skarzhevskyy@gmail.com

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |57721

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #3 from GW <gr...@gmail.com> ---
I have this working against trunk, all tests pass, but no new tests yet.  I'll
work on a patch today and tomorrow with tests and some tweaks to the original
patch. For one, it had company network specific changes to build.xml that
leaked into the patch (I've removed them). There is some code that could
benefit from turning a series of constants into an Enum, and it performs very
slowly for large workbooks/large tables.  I can see a couple of things that
would be quick, small changes for some big gains.  Again, with tests.

First, though, I have to test against 3.12, as a library I use (Vaadin
Spreadsheet) doesn't like something that changed in the API between 3.12 and
trunk, as that's where my POC has to work for my day job :)

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

Dominik Stadler <do...@gmx.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |PatchAvailable

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #14 from Javen O'Neal <on...@apache.org> ---
Fixed behavior of [#Totals] when Table does not have a Totals row and [#This
Row] and [@] when rowIndex is outside the table start and stop rows.
Added remaining unit tests for new functionality in r1747655.

reintegrated xssf_structured_references branch to trunk in r1747657 and deleted
branch in r1747657.

What's remaining before this bug can be closed:
* unit test for XSSFTable.findColumnIndex and getters added in [1]
* unit test that demonstrates performance need to cache
XSSFWorkbook.getTable(String name), either by decreasing the visibility of this
method with @Internal or protected/package-private accessibility or by adding a
method so that users can force a refresh of the cache if needed.
* performance evaluation of any explicit boxing/unboxing to see if any speed
increase can be gained without adding complexity to the code.
* any other unit tests that I missed

[1]
https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xssf/usermodel/XSSFTable.java?r1=1747657&r2=1747656&pathrev=1747657

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

GW <gr...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #33927|0                           |1
        is obsolete|                            |

--- Comment #8 from GW <gr...@gmail.com> ---
Created attachment 33932
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=33932&action=edit
Updated patch

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #18 from Greg Woolsey <gr...@gmail.com> ---
Created attachment 33937
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=33937&action=edit
fix to reuse XSSFEvaluationWorkbook inside itself

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #37 from Greg Woolsey <gr...@gmail.com> ---
(In reply to Andreas Beeker from comment #36)
> TL;DR: please comment non obvious performance hacks
> 
Yes, I agree, I should have added comments.  I'm typically verbose, but got
ahead of myself this time.  I didn't notice FindBugs in the mix, but
regardless, I agree, especially on large distributed projects commenting every
change is a good default rule.  Thanks for taking care of this one, and the
reminder.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

GW <gr...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #32671|0                           |1
        is obsolete|                            |

--- Comment #6 from GW <gr...@gmail.com> ---
Created attachment 33927
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=33927&action=edit
Trunk patch adding support for Structured References in formulas, with unit
tests

The new patch (same date as comment) is against TRUNK, and contains new passing
unit tests to exercise the new code and notice new bugs.

The original patch was missing some corner cases around special characters and
escapes in table names, now covered by tests and handled by the code.

(why does Excel allow named ranges and tables to start with a backslash!?)

There are new comments for some existing and some new code that is there to
improve execution performance at the cost of static cached metadata - changes
to the underlying OOXML objects may not be seen unless the cached data is
reset.  There was already some of that, but it wasn't called out in the JavaDoc
comments.  Not likely in normal use, as most uses are either building or
reading, but not building, reading, modifying, then calculating again, but
since the underlying objects are exposed through public setters, it can happen.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #15 from Nick Burch <ap...@gagravarr.org> ---
We do have a stated policy that an evaluator will cache for performance, and
it's up to you as a user to notify it / clear it / create a new one if you go
changing cells after you create the evaluator -
https://poi.apache.org/spreadsheet/eval.html#Performance

Could we extend / follow that principle to help with the caching/performance?

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #34 from Nick Burch <ap...@gagravarr.org> ---
I may still be missing something, but why not just put the table cache on the
XSSFWorkbook, provide an internal "clear cache" method, then have the table
related methods on XSSFSheet and XSSFTable trigger the cache-clear when they've
done something that would invalidate it?

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #29 from Javen O'Neal <on...@apache.org> ---
Findbugs complained [1] about explicit boxing of the Integers and not
typechecking before casting. Any way we can suppress these? This SO post [2]
suggests either a edu.umd.cs.findbugs.annotations.SuppressWarnings annotation
or adding filters to the findbugs configuration in build.xml.
[1] https://builds.apache.org/job/POI/1336/findbugsResult/new/
[2]
http://stackoverflow.com/questions/1829904/is-there-a-way-to-ignore-a-single-findbugs-warning

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #1 from GW <gr...@gmail.com> ---
This is great, just what I needed for 

https://bz.apache.org/bugzilla/show_bug.cgi?id=57721

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #11 from Javen O'Neal <on...@apache.org> ---
Moved getTable cache from XSSFWorkbook to XSSFEvaluationWorkbook to reduce the
frequency of problems with a stale cache in r1747615-r1747616.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #17 from Greg Woolsey <gr...@gmail.com> ---
(In reply to Javen O'Neal from comment #16)
> (In reply to Nick Burch from comment #15)
> > We do have a stated policy that an evaluator will cache for performance 
> The problem is that Greg stated  in comment 12 that there's a performance
> hit if caching isn't done in XSSFWorkbook, not XSSFEvaluationWorkbook. I am
> comfortable with single-use caching in XSSFEvaluationWorkbook, but
> uncomfortable with the same code in XSSFWorkbook.

I would also prefer not to cache in XSSFWorkbook, but a new
XSSFEvaluationWorkbook is created every time a cell is evaluated, rather than
reusing the same instance for all cells in a calculation/evaluation run.

In fact, the main method in XSSFEvaluationWorkbook,
getFormulaTokens(EvaluationCell), creates a new instance rather than using
itself!  I don't see any shared state in that class or the parent class, so
that seems a likely easy fix - just use itself in the call to
FormulaParser.parse() rather than a new instance.

See attached patch-57840-reuse-XSSFEvaluationWorkbook.txt for the tiny fix. 
Unit tests run a little faster with it for me, too.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

GW <gr...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |greg.woolsey@gmail.com

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #20 from Javen O'Neal <on...@apache.org> ---
(In reply to Javen O'Neal from comment #14)
> What's remaining before this bug can be closed:
> * unit test for XSSFTable.findColumnIndex and getters added in [1]

Applied in r1747762 and r1747771.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #10 from Javen O'Neal <on...@apache.org> ---
To make it easier to get this to a trunk-eligible state, I've created
xssf_structured_references branch [1] in r1747607 so you and others can
contribute unit tests and improvements with smaller diffs against this branch.
I've applied your changes from attachment 33932 in r1747612.

When this is trunk-eligible, I'll merge this back with trunk.

[1] https://svn.apache.org/viewvc/poi/branches/xssf_structured_references/

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #25 from Greg Woolsey <gr...@gmail.com> ---
Created attachment 33939
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=33939&action=edit
re-use XSSFEvaluationWorkbook when expanding a shared formula

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

Dominik Stadler <do...@gmx.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |61722


Referenced Bugs:

https://bz.apache.org/bugzilla/show_bug.cgi?id=61722
[Bug 61722] Review ways to avoid new Integer()/Integer.valueOf() in performance
sensitive places
-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #24 from Greg Woolsey <gr...@gmail.com> ---
Created attachment 33938
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=33938&action=edit
avoid auto-boxing ints for row/column TreeTable lookups

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #5 from GW <gr...@gmail.com> ---
(In reply to Nick Burch from comment #4)
> (In reply to GW from comment #3)
> > First, though, I have to test against 3.12, as a library I use (Vaadin
> > Spreadsheet) doesn't like something that changed in the API between 3.12 and
> > trunk, as that's where my POC has to work for my day job :)
> 
> We've tried reaching out to the Vaadin Spreadsheet folks a few times now,
> but never had any response. If you have contacts there, please ping them to
> get in touch! (IIRC it was conditional formatting stuff we were most
> recently interested in collaborating on)

I will do that.  I'm lobbying to go all-in with POI and Vaadin Spreadsheet for
the next big thing for my new gig.  I've been a user of POI for over 10 years,
and Vaadin for 5.  If I get my way, we will be paying Vaadin customers and can
have some say in product direction.

For example, I found a bug in their conditional formatting stuff yesterday :) 
After I finish the patch unit tests, I'll be filing a bug for them about it. 
It may relate to Excel Table structured reference syntax in the conditional
formatting formula, in which case they definitely should be talking to this
team!

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |58227

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #27 from Greg Woolsey <gr...@gmail.com> ---
I broke out my improvements today into individual patch diffs for easier
inspection and consumption.  The last is the biggest improvement, none are huge
code changes.

I'm still waiting for approval on my sanitized test data file.  Hopefully I can
upload that tomorrow.  Currently takes 93 seconds still to evaluate all 7,274
formula cells.

I wonder if it is worth having VLOOKUP and similar range functions track
previously evaluated ranges, and if the same range shows up multiple times
index or hash its values for future reference?  Hmm.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #16 from Javen O'Neal <on...@apache.org> ---
(In reply to Nick Burch from comment #15)
> We do have a stated policy that an evaluator will cache for performance 
The problem is that Greg stated  in comment 12 that there's a performance hit
if caching isn't done in XSSFWorkbook, not XSSFEvaluationWorkbook. I am
comfortable with single-use caching in XSSFEvaluationWorkbook, but
uncomfortable with the same code in XSSFWorkbook.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #23 from Greg Woolsey <gr...@gmail.com> ---
(In reply to Javen O'Neal from comment #22)
> Doing some Google searching, it looks like using new Integer(int) rather
> than Integer.valueOf (the explicit version of implicit auto-boxing) may use
> more memory due to duplicated objects wrapping primitives, but is faster
> because it doesn't look up lookup cached values to save this memory.
> 
> Did you notice any performance difference changing the boxing on the code
> relating to this bug?

Setting the Integer cache size, as noted above, helped significantly.  

Attached is a patch to XSSFSheet and XSSFRow, to explicitly create new Integer
instances everywhere ints were autoboxed for calls to the TreeMap instances in
those classes.

This adds another 4% boost over just the cache setting, it turns out.

Then there is the big winner so far, reducing my test time by > 30%: 

cache XSSFCEvaluationSheet refs so we can cache XSSFEvaluationCell refs in a
HashMap, using a custom lightweight key object (int row/col fields, using good
hash practice).

HashMap lookups by direct row/column (which is what we have down in the
function evaluators) is way, way faster than TreeMap lookups.

For my sample, with 7,274 formula cells, most of which include one or more
VLOOKUP calls to a Structured Reference with 3,726 rows, this was a huge
improvement.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #40 from Greg Woolsey <gr...@gmail.com> ---
(In reply to Javen O'Neal from comment #39)
Thanks.  I'm moving on to the Vaadin Spreadsheet side of things for my POC, as
the POI feature set and performance are acceptable at this stage.  I'll likely
file new issues/patches in the future if I get the project accepted.

I've submitted Vaadin bug https://dev.vaadin.com/ticket/19952 with a very
significant improvement to their current formula performance, especially
related to conditional formatting.

They were creating a new evaluation context for every _cell_ evaluated.  And to
do it, they were inserting a new row in an existing sheet, and modifying the
first cell's formula.  The row # is in the middle of the data for one of my
sample workbooks, and that whole scheme was an ugly hack.

I showed them how they could evaluate a formula without a Cell using an
existing WorkbookEvaluator, using code I found in the unit test framework. 
That, along with the cached results it could now leverage, took my sample
workbook from HOURS to marginally more time than the unit test I added to POI. 
I also have a patch to use POI 3.15 that I hope they take.  But that has more
in it, as they have some ugly code I couldn't bear to leave alone.  I want a
real-looking API along with the updated compatibility.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #7 from Javen O'Neal <on...@apache.org> ---
Thanks for the quick turn around Greg!
I'm slowly reviewing what you have. I committed StructuredReferences.xlsx in
r1747482.
I also added svn:eol-style=native to all the files in your patch so the diffs
are shorter when reviewing these changes on a different OS.
The import change on Match.java isn't needed.

1. Could you change String Table.isStructuredReference to a compiled Pattern
for performance?
2. We're trying to move all junit3 tests to junit4 tests. Could you update
TestStructuredReferences.java to use junit4 (org.junit.Test with @Test
decorators, see TestXSSFFormulaParser.java for an example)
3. The changes to XSSFRowShifter looks like formulas with structured references
cannot be row-shifted. Is that correct? If so, could you either fix that in
your patch or open a new bug that depends on bug 57840?
4. Add a javadoc to XSSFWorkbook.getTable. This method rebuilds the table cache
and returns a single item. If the cache isn't used elsewhere, then a
non-caching linear search would be faster and use less memory.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #35 from Javen O'Neal <on...@apache.org> ---
I think Greg's attachment 33939 patch, applied in r1747840, solve most of the
caching problems.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #28 from Javen O'Neal <on...@apache.org> ---
Applied attachments from comment 24, 25, and 26 in r1747837, r1747840, and
r1747838, respectively to trunk.

Thanks for the thorough profiling and simple, standalone patches! Very easy to
review.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #13 from Javen O'Neal <on...@apache.org> ---
Could you generate and upload a workbook with either a lot of tables or large
tables that demonstrates the performance problems?

Unfortunately we don't have a meaningful way to track execution time in our
continuous integration tests because of differing resources provided by the
executors. Nonetheless, you could write a unit test that calculates the time
spent evaluating structured references for A/B comparisons (such as the change
from comment 11).

Are tables being looked up on the XSSFWorkbook or the XSSFEvaluationWorkbook?

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57840] [PATCH] Support for structured references with Excel tables.

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #26 from Greg Woolsey <gr...@gmail.com> ---
Created attachment 33940
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=33940&action=edit
cache XSSFCellEvaluator instances and look up by hash key - much faster

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org