You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Raffaele Castagno <ra...@gruppotesi.com> on 2012/02/27 09:41:21 UTC

Apache POI formula cell duplication very slow

I'm generating an excel file using Apache POI 3.8 , and have the need to replicate some existing row n° times.

This because I have some complex formula which I use as a template to create new lines, replacing cell indexes with regexps.

The problem is that performance are awful, it takes 2h to generate some 4000 rows.

I've pinpointed the problem to be not in the regexp part, as I initially thought, but in the duplication of formula cells.

I actually use this to replicate formula cells:

case Cell.CELL_TYPE_FORMULA:
     newCell.setCellType(oldCell.getCellType());
     newCell.setCellFormula(oldCell.getCellFormula());
     break;

If I copy the formula as text like this:

case Cell.CELL_TYPE_FORMULA:
     newCell.setCellType(Cell.CELL_TYPE_STRING);
     newCell.setCellValue("="+oldCell.getCellFormula());
     break;

it's instead pretty fast, even with my regexp in place.

Anyway, this is an imperfect solution, because the formula has english keywords (ie IF()), when I need to write in italian format.

More, cells with formula inserted like that need to be forcefully re-evaluated in excel with something like "replace all -> '=' with '='".

The problem seems to rely in the setCellFormula(), because of the HSSFFormulaParser.parse().

What's strange, is that parsing time seems to grow exponentially:

100 rows ->  6785ms
200 rows -> 23933ms
300 rows -> 51388ms
400 rows -> 88586ms

What it seems, is that each time I copy a formula, the POI library re-evaluates or re-parses or re-something all preceding rows.

Do anyone know how can solve this problem? Thanks in advance.

Raffaele C.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI formula cell duplication very slow

Posted by Raffaele Castagno <ra...@gruppotesi.com>.
Correction: can't use XSSF, since I read and write binary XLS files, not XLSX.

Raffaele

On 27/02/2012 10:01, Raffaele Castagno wrote:
> I use HSSF, should I try XSSF? As far as I know, there are compatibility issues with older versions of Excel, but if
> it can make a difference, I'll try.
>
> Thanks
>
> Raffaele C.
>
> On 27/02/2012 09:54, Yegor Kozlov wrote:
>> HSSF or XSSF or both?
>>
>> Yegor
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI formula cell duplication very slow

Posted by Raffaele Castagno <ra...@gruppotesi.com>.
I use HSSF, should I try XSSF? As far as I know, there are compatibility issues with older versions of Excel, but if
it can make a difference, I'll try.

Thanks

Raffaele C.

On 27/02/2012 09:54, Yegor Kozlov wrote:
> HSSF or XSSF or both?
>
> Yegor


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI formula cell duplication very slow

Posted by Yegor Kozlov <ye...@dinom.ru>.
HSSF or XSSF or both?

Yegor

On Mon, Feb 27, 2012 at 12:41 PM, Raffaele Castagno
<ra...@gruppotesi.com> wrote:
> I'm generating an excel file using Apache POI 3.8 , and have the need to
> replicate some existing row n° times.
>
> This because I have some complex formula which I use as a template to create
> new lines, replacing cell indexes with regexps.
>
> The problem is that performance are awful, it takes 2h to generate some 4000
> rows.
>
> I've pinpointed the problem to be not in the regexp part, as I initially
> thought, but in the duplication of formula cells.
>
> I actually use this to replicate formula cells:
>
> case Cell.CELL_TYPE_FORMULA:
>    newCell.setCellType(oldCell.getCellType());
>    newCell.setCellFormula(oldCell.getCellFormula());
>    break;
>
> If I copy the formula as text like this:
>
> case Cell.CELL_TYPE_FORMULA:
>    newCell.setCellType(Cell.CELL_TYPE_STRING);
>    newCell.setCellValue("="+oldCell.getCellFormula());
>    break;
>
> it's instead pretty fast, even with my regexp in place.
>
> Anyway, this is an imperfect solution, because the formula has english
> keywords (ie IF()), when I need to write in italian format.
>
> More, cells with formula inserted like that need to be forcefully
> re-evaluated in excel with something like "replace all -> '=' with '='".
>
> The problem seems to rely in the setCellFormula(), because of the
> HSSFFormulaParser.parse().
>
> What's strange, is that parsing time seems to grow exponentially:
>
> 100 rows ->  6785ms
> 200 rows -> 23933ms
> 300 rows -> 51388ms
> 400 rows -> 88586ms
>
> What it seems, is that each time I copy a formula, the POI library
> re-evaluates or re-parses or re-something all preceding rows.
>
> Do anyone know how can solve this problem? Thanks in advance.
>
> Raffaele C.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI formula cell duplication very slow

Posted by Raffaele Castagno <ra...@gruppotesi.com>.
On 27/02/2012 11:06, Yegor Kozlov wrote:
> Calling setCellType is extra in both cases. Cell.setCellFormula
> automatically changes the cell type to CELL_TYPE_FORMULA and
> setCellValue to the appropriate time.
>
> Is POI slow with replacing any formulas or only when setting your
> special ones: you mentioned that your formulas are complex and it
> might be the case.
>
> You can estimate parsing time by calling the formula parser explicitly:
>
>          Ptg[] ptgs = HSSFFormulaParser.parse(formula, workbook);
>
> Are you sure it is the bottleneck?
>
> Yegor
>
>
Pretty sure, since if I remove all my regexp post processing I gain only few seconds, and there's really nothing more than that.

Also, there is my other example:

>  case Cell.CELL_TYPE_FORMULA:
>      newCell.setCellType(Cell.CELL_TYPE_STRING);
>      newCell.setCellValue("="+oldCell.getCellFormula());
>      break;

my purpose was to elaborate the same amount of data, excluding the parse phase.

This version takes 7sec against about 100sec of the "setCellFormula()" version.

I'll try to estimate parsing time as you indicated.

Thanks for helping.

Raffaele C.



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI formula cell duplication very slow

Posted by Raffaele Castagno <ra...@gruppotesi.com>.
Done.
Evaluating 10 big formula for 1000 times takes few seconds,
so must be some other step of the setCellFormula() method?

BTW, the example I linked is broken, since I do:

copyRow(workbook, sheet, 0, 1);

that is, I overwrite the same row multiple times.

If I change it to

copyRow(workbook, sheet, 0, i+1);

it's way faster.

Looking right now if I'm doing something stupid like that in the real code.

Thanks.

Raffaele

On 27/02/2012 14:37, Nick Burch wrote:
> On Mon, 27 Feb 2012, Raffaele Castagno wrote:
>>> You can estimate parsing time by calling the formula parser explicitly:
>>>
>>>          Ptg[] ptgs = HSSFFormulaParser.parse(formula, workbook);
>>>
>>> Are you sure it is the bottleneck?
>>
>> Tried some variations, been unable to test explicit formula parsing
>
> You should try it, the code Yegor posted above is all you need. That will let you be sure if it's the formula parsing or something else
>
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI formula cell duplication very slow

Posted by Raffaele Castagno <ra...@gruppotesi.com>.
I've made some test, and it seems this is the solution.
I've reverted all my changes, and replaced

// If the row exist in destination, push down all rows by 1 else create a new row
if (newRow != null) {
     worksheet.shiftRows(destinationRowNum, worksheet.getLastRowNum(), 1);
} else {
     newRow = worksheet.createRow(destinationRowNum);
}

with

if(newRow == null)
{
     newRow = worksheet.createRow(destinationRowNum);
}

and now the generated excel is identical to the one I had before, but is generated in a fraction of the time.

Thanks everyone.

Raffaele

On 27/02/2012 19:19, Raffaele Castagno wrote:
> Oh my...I think I found it...
>
> Original was:
>
> // If the row exist in destination, push down all rows by 1 else create a new row
> if (newRow != null) {
>     worksheet.shiftRows(destinationRowNum, worksheet.getLastRowNum(), 1);
> } else {
>     newRow = worksheet.createRow(destinationRowNum);
> }
>
> I've commented all leaving only
>
>     newRow = worksheet.createRow(destinationRowNum);
>
> And now I'm down to 60sec to process all rows!
>
> Probably, there's some dirt in my template which was causing POI to shift everything at each iteration.
>
> I'll have to double-check my final excel for correctness, but I think I solved it!
>
> Thanks everyone!
>
> Raffaele C.
>
> On 27/02/2012 14:37, Nick Burch wrote:
>> On Mon, 27 Feb 2012, Raffaele Castagno wrote:
>>>> You can estimate parsing time by calling the formula parser explicitly:
>>>>
>>>>          Ptg[] ptgs = HSSFFormulaParser.parse(formula, workbook);
>>>>
>>>> Are you sure it is the bottleneck?
>>>
>>> Tried some variations, been unable to test explicit formula parsing
>>
>> You should try it, the code Yegor posted above is all you need. That will let you be sure if it's the formula parsing or something else
>>
>> Nick
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI formula cell duplication very slow

Posted by Raffaele Castagno <ra...@gruppotesi.com>.
Oh my...I think I found it...

Original was:

// If the row exist in destination, push down all rows by 1 else create a new row
if (newRow != null) {
     worksheet.shiftRows(destinationRowNum, worksheet.getLastRowNum(), 1);
} else {
     newRow = worksheet.createRow(destinationRowNum);
}

I've commented all leaving only

     newRow = worksheet.createRow(destinationRowNum);

And now I'm down to 60sec to process all rows!

Probably, there's some dirt in my template which was causing POI to shift everything at each iteration.

I'll have to double-check my final excel for correctness, but I think I solved it!

Thanks everyone!

Raffaele C.

On 27/02/2012 14:37, Nick Burch wrote:
> On Mon, 27 Feb 2012, Raffaele Castagno wrote:
>>> You can estimate parsing time by calling the formula parser explicitly:
>>>
>>>          Ptg[] ptgs = HSSFFormulaParser.parse(formula, workbook);
>>>
>>> Are you sure it is the bottleneck?
>>
>> Tried some variations, been unable to test explicit formula parsing
>
> You should try it, the code Yegor posted above is all you need. That will let you be sure if it's the formula parsing or something else
>
> Nick
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI formula cell duplication very slow

Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 27 Feb 2012, Raffaele Castagno wrote:
>> You can estimate parsing time by calling the formula parser explicitly:
>>
>>          Ptg[] ptgs = HSSFFormulaParser.parse(formula, workbook);
>> 
>> Are you sure it is the bottleneck?
>
> Tried some variations, been unable to test explicit formula parsing

You should try it, the code Yegor posted above is all you need. That will 
let you be sure if it's the formula parsing or something else

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI formula cell duplication very slow

Posted by Raffaele Castagno <ra...@gruppotesi.com>.
Here's a link to my example:
https://www2.dropbox.com/sh/fqxqwip68qsvdy7/UXF9Kn_rSQ/poi_slow_copy_example.zip

Just create a lib folder with POI 3.8 dependencies.

I've created an empty xls with some of the formula I have in my template, and replicate the row 1000times.

If you run RowCopy.bat, you will see that it starts fast, and progressively slows down to a crawl.

Thanks for helping.

Raffaele

On 27/02/2012 13:24, Yegor Kozlov wrote:
> Can you narrow it down to a isolated unit test so that we can debug
> the problem on our side?
>
> without a test file we can't do much.
>
> Yegor
>
> On Mon, Feb 27, 2012 at 4:10 PM, Raffaele Castagno
> <ra...@gruppotesi.com>  wrote:
> >  On 27/02/2012 11:06, Yegor Kozlov wrote:
> >>
> >>  Calling setCellType is extra in both cases. Cell.setCellFormula
> >>  automatically changes the cell type to CELL_TYPE_FORMULA and
> >>  setCellValue to the appropriate time.
> >>
> >>  Is POI slow with replacing any formulas or only when setting your
> >>  special ones: you mentioned that your formulas are complex and it
> >>  might be the case.
> >>
> >>  You can estimate parsing time by calling the formula parser explicitly:
> >>
> >>           Ptg[] ptgs = HSSFFormulaParser.parse(formula, workbook);
> >>
> >>  Are you sure it is the bottleneck?
> >>
> >
> >  Tried some variations, been unable to test explicit formula parsing, but the
> >  "setCellFormula()" seems to be the bottleneck.
> >
> >  If I remove some of the formula from my template, speed improves, but anyway
> >  speed degrades as it processes more rows,
> >  so the problem is not in the complexity of the formulas, but it's something
> >  related in general to formula processing.
> >
> >  I'm running out of ideas...
> >
> >
> >  Raffaele C.
> >
> >
> >  ---------------------------------------------------------------------
> >  To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> >  For additional commands, e-mail: user-help@poi.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI formula cell duplication very slow

Posted by Yegor Kozlov <ye...@dinom.ru>.
Can you narrow it down to a isolated unit test so that we can debug
the problem on our side?

without a test file we can't do much.

Yegor

On Mon, Feb 27, 2012 at 4:10 PM, Raffaele Castagno
<ra...@gruppotesi.com> wrote:
> On 27/02/2012 11:06, Yegor Kozlov wrote:
>>
>> Calling setCellType is extra in both cases. Cell.setCellFormula
>> automatically changes the cell type to CELL_TYPE_FORMULA and
>> setCellValue to the appropriate time.
>>
>> Is POI slow with replacing any formulas or only when setting your
>> special ones: you mentioned that your formulas are complex and it
>> might be the case.
>>
>> You can estimate parsing time by calling the formula parser explicitly:
>>
>>         Ptg[] ptgs = HSSFFormulaParser.parse(formula, workbook);
>>
>> Are you sure it is the bottleneck?
>>
>
> Tried some variations, been unable to test explicit formula parsing, but the
> "setCellFormula()" seems to be the bottleneck.
>
> If I remove some of the formula from my template, speed improves, but anyway
> speed degrades as it processes more rows,
> so the problem is not in the complexity of the formulas, but it's something
> related in general to formula processing.
>
> I'm running out of ideas...
>
>
> Raffaele C.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI formula cell duplication very slow

Posted by Raffaele Castagno <ra...@gruppotesi.com>.
On 27/02/2012 11:06, Yegor Kozlov wrote:
> Calling setCellType is extra in both cases. Cell.setCellFormula
> automatically changes the cell type to CELL_TYPE_FORMULA and
> setCellValue to the appropriate time.
>
> Is POI slow with replacing any formulas or only when setting your
> special ones: you mentioned that your formulas are complex and it
> might be the case.
>
> You can estimate parsing time by calling the formula parser explicitly:
>
>          Ptg[] ptgs = HSSFFormulaParser.parse(formula, workbook);
>
> Are you sure it is the bottleneck?
>

Tried some variations, been unable to test explicit formula parsing, but the "setCellFormula()" seems to be the bottleneck.

If I remove some of the formula from my template, speed improves, but anyway speed degrades as it processes more rows,
so the problem is not in the complexity of the formulas, but it's something related in general to formula processing.

I'm running out of ideas...

Raffaele C.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI formula cell duplication very slow

Posted by Yegor Kozlov <ye...@dinom.ru>.
On Mon, Feb 27, 2012 at 12:41 PM, Raffaele Castagno
<ra...@gruppotesi.com> wrote:
> I'm generating an excel file using Apache POI 3.8 , and have the need to
> replicate some existing row n° times.
>
> This because I have some complex formula which I use as a template to create
> new lines, replacing cell indexes with regexps.
>
> The problem is that performance are awful, it takes 2h to generate some 4000
> rows.
>
> I've pinpointed the problem to be not in the regexp part, as I initially
> thought, but in the duplication of formula cells.
>
> I actually use this to replicate formula cells:
>
> case Cell.CELL_TYPE_FORMULA:
>    newCell.setCellType(oldCell.getCellType());
>    newCell.setCellFormula(oldCell.getCellFormula());
>    break;
>
> If I copy the formula as text like this:
>
> case Cell.CELL_TYPE_FORMULA:
>    newCell.setCellType(Cell.CELL_TYPE_STRING);
>    newCell.setCellValue("="+oldCell.getCellFormula());
>    break;
>
> it's instead pretty fast, even with my regexp in place.
>

Calling setCellType is extra in both cases. Cell.setCellFormula
automatically changes the cell type to CELL_TYPE_FORMULA and
setCellValue to the appropriate time.

Is POI slow with replacing any formulas or only when setting your
special ones: you mentioned that your formulas are complex and it
might be the case.

You can estimate parsing time by calling the formula parser explicitly:

        Ptg[] ptgs = HSSFFormulaParser.parse(formula, workbook);

Are you sure it is the bottleneck?

Yegor

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI formula cell duplication very slow

Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 27 Feb 2012, Raffaele Castagno wrote:
> What it seems, is that each time I copy a formula, the POI library 
> re-evaluates or re-parses or re-something all preceding rows.

POI doesn't re-evaluate formula values unless you tell it to. You should 
make sure you only do this at the end, and not on each row / cell

See http://poi.apache.org/spreadsheet/eval.html for more information on 
how to trigger a formula re-evaluation when you're done with your cells

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org