You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Wabi Sabi <wa...@gmail.com> on 2022/05/20 14:40:48 UTC

Multi-threading behavior

Hello,

I am trying to parallelize Excel processing and I am noticing a
bizarre behavior - single threaded processing is actually faster...

I am not doing anything fancy. I just open an XSSFWorkbook, fill out some
values, run formula calcs and read the output. If I run single threaded -
initial run takes a few seconds to complete (assume because JVM needs to
load POI + all the XML, schemas, etc.), but performance improves
and subsequent runs all take about 100-200 ms.

Same logic executed in a separate thread runs easily for 5 seconds in each
thread.... So turns out that single threaded processing of say 10 files is
at 4.5 seconds, but multithreaded takes 5-6 easily... No files are shared
among threads.

The hotspots are in POIXMLDocument.load. Thread behavior also looks
correct. File contention is out of the picture too - reading a different
file each time.

Any ideas as two why or pointers at the POI multithreading best practices
are greatly appreciated.  Thank you very much in advance!

Re: Multi-threading behavior

Posted by Wabi Sabi <wa...@gmail.com>.
Thank you! 🙏 🙏🙏 This worked like a charm! Performance bump easily ~ 20%.
Thread-wise should be safe as files are never shared among different
threads, but I will keep an eye on it making sure things are ok.

On Wed, May 25, 2022 at 3:21 PM PJ Fanning <fa...@yahoo.com.invalid>
wrote:

> Maybe you could try to make XmlBeans unsynchronized using this method?
>
>
> https://xmlbeans.apache.org/docs/5.0.0/org/apache/xmlbeans/XmlOptions.html#setUnsynchronized()
>
> Look at https://poi.apache.org/components/configuration.html and the bit
> about
> org.apache.poi.ooxml.POIXMLTypeLoader.DEFAULT_XML_OPTIONS
>
> In theory, this should compile but I haven't tested it.
>
>
> org.apache.poi.ooxml.POIXMLTypeLoader.DEFAULT_XML_OPTIONS.setUnsynchronized()
>
> There is no guarantee that you won't get concurrent updates interfering
> with each other if you do this. Presumably, XmlBeans synchronization is
> there for a good reason.
>
>
>
>
>
>
> On Wednesday 25 May 2022, 20:12:23 IST, Wabi Sabi <wa...@gmail.com>
> wrote:
>
>
>
>
>
> Thank you so much for your feedback! After profiling the app a bit, it
> looks like the top hotspot is in
> org.apache.xmleans.impl.values.XmlObjectBase.monitor. It's locking access
> for most cell ops e.g.:
>
> public final void setBigDecimalValue(BigDecimal obj) {
>     if (obj == null) {
>         this.setNil();
>     } else {
>         synchronized(this.monitor()) {
>             this.set_prepare();
>             this.set_BigDecimal(obj);
>             this.set_commit();
>         }
>     }
> }
>
> It's basically hogs execution the most. Wonder what's the best way to try
> fixing it? Replace the abstract class with custom non-synchronized
> implementation?
>
> On Sat, May 21, 2022 at 6:03 AM Andreas Reichel <
> andreas@manticore-projects.com> wrote:
>
> > One more thing: Swapping/Paging!
> >
> > The least thing you need to ensure is that you have enough RAM to hold
> > your 10 WorkSheets in memory without paging/swapping involved.
> > Depending on your workbook, that can be huge memory!
> >
> > As soon as swapping/paging kicks in, any performance measurement is
> > useless because IO will be the bottleneck and dominate your tests.
> >
> > On Unix/Linux, engage VMSTAT and ensure, there is no SO/SI shown during
> > your test.
> >
> > Good luck
> > Andreas
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Re: Multi-threading behavior

Posted by PJ Fanning <fa...@yahoo.com.INVALID>.
Maybe you could try to make XmlBeans unsynchronized using this method?

https://xmlbeans.apache.org/docs/5.0.0/org/apache/xmlbeans/XmlOptions.html#setUnsynchronized()

Look at https://poi.apache.org/components/configuration.html and the bit about
org.apache.poi.ooxml.POIXMLTypeLoader.DEFAULT_XML_OPTIONS

In theory, this should compile but I haven't tested it.

org.apache.poi.ooxml.POIXMLTypeLoader.DEFAULT_XML_OPTIONS.setUnsynchronized()

There is no guarantee that you won't get concurrent updates interfering with each other if you do this. Presumably, XmlBeans synchronization is there for a good reason.






On Wednesday 25 May 2022, 20:12:23 IST, Wabi Sabi <wa...@gmail.com> wrote: 





Thank you so much for your feedback! After profiling the app a bit, it
looks like the top hotspot is in
org.apache.xmleans.impl.values.XmlObjectBase.monitor. It's locking access
for most cell ops e.g.:

public final void setBigDecimalValue(BigDecimal obj) {
    if (obj == null) {
        this.setNil();
    } else {
        synchronized(this.monitor()) {
            this.set_prepare();
            this.set_BigDecimal(obj);
            this.set_commit();
        }
    }
}

It's basically hogs execution the most. Wonder what's the best way to try
fixing it? Replace the abstract class with custom non-synchronized
implementation?

On Sat, May 21, 2022 at 6:03 AM Andreas Reichel <
andreas@manticore-projects.com> wrote:

> One more thing: Swapping/Paging!
>
> The least thing you need to ensure is that you have enough RAM to hold
> your 10 WorkSheets in memory without paging/swapping involved.
> Depending on your workbook, that can be huge memory!
>
> As soon as swapping/paging kicks in, any performance measurement is
> useless because IO will be the bottleneck and dominate your tests.
>
> On Unix/Linux, engage VMSTAT and ensure, there is no SO/SI shown during
> your test.
>
> Good luck
> Andreas
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Multi-threading behavior

Posted by Wabi Sabi <wa...@gmail.com>.
Thank you so much for your feedback! After profiling the app a bit, it
looks like the top hotspot is in
org.apache.xmleans.impl.values.XmlObjectBase.monitor. It's locking access
for most cell ops e.g.:

public final void setBigDecimalValue(BigDecimal obj) {
    if (obj == null) {
        this.setNil();
    } else {
        synchronized(this.monitor()) {
            this.set_prepare();
            this.set_BigDecimal(obj);
            this.set_commit();
        }
    }
}

It's basically hogs execution the most. Wonder what's the best way to try
fixing it? Replace the abstract class with custom non-synchronized
implementation?

On Sat, May 21, 2022 at 6:03 AM Andreas Reichel <
andreas@manticore-projects.com> wrote:

> One more thing: Swapping/Paging!
>
> The least thing you need to ensure is that you have enough RAM to hold
> your 10 WorkSheets in memory without paging/swapping involved.
> Depending on your workbook, that can be huge memory!
>
> As soon as swapping/paging kicks in, any performance measurement is
> useless because IO will be the bottleneck and dominate your tests.
>
> On Unix/Linux, engage VMSTAT and ensure, there is no SO/SI shown during
> your test.
>
> Good luck
> Andreas
>

Re: Multi-threading behavior

Posted by Andreas Reichel <an...@manticore-projects.com>.
One more thing: Swapping/Paging!

The least thing you need to ensure is that you have enough RAM to hold
your 10 WorkSheets in memory without paging/swapping involved.
Depending on your workbook, that can be huge memory!

As soon as swapping/paging kicks in, any performance measurement is
useless because IO will be the bottleneck and dominate your tests.

On Unix/Linux, engage VMSTAT and ensure, there is no SO/SI shown during
your test.

Good luck
Andreas

Re: Multi-threading behavior

Posted by Andreas Reichel <an...@manticore-projects.com>.
Wabi,

just guessing:

XSSFWorkbook workbook = new XSSFWorkbook(new BufferedInputStream(new FileInputStream("src/main/resources/customer.xlsx")));

You operate with exactly ONE STATIC FILE and repeat that 10 times.
I would not be surprised, when a recent JVM detects this and runs it only 1 time -- eliminating 9 times the same code. 
JVM code elimination is incredible smart these days. I'd expect that the Workbook object stays in cache and is re-used 9 times.

This applies to the Serial Test Case. The Parallel Test Case runs everything at the same time, so can't eliminate.
Of course, this is just a theory and needs proof: 

a) use randomly generated sheets instead of 1 static sheet for your tests
b) forcefully destroy/finalise the Worksheet Object by tampering with the GC settings 
c) pre-warm the JVM before running your tests (so that also the Parallel Test Case has all the Cached Objects available)

d) better engage a proper Micro Testing Framework (like Java Microbenchmark Harness "JMH"), taking care of those considerations

Good luck!
Andreas




    


On Fri, 2022-05-20 at 11:15 -0400, Wabi Sabi wrote:
> Thank you for taking a look! It is indeed.
> 
> I also tested the same logic on Win JDK 11 and Mac OS X JDK 1.8. The
> overall pattern is the same: the initial run is super slow (7 seconds
> on
> Mac and 1.5 seconds on Win), subsequent runs are dramatically better
> (down
> to 150-200 ms on both systems).
> 
> On Fri, May 20, 2022 at 10:47 AM PJ Fanning
> <fa...@yahoo.com.invalid>
> wrote:
> 
> > Is this related to
> > https://stackoverflow.com/questions/72310943/poi-single-vs-multithreaded-performance
> > ?
> > 
> > 
> > 
> > 
> > 
> > 
> > On Friday 20 May 2022, 15:41:28 IST, Wabi Sabi
> > <wa...@gmail.com>
> > wrote:
> > 
> > 
> > 
> > 
> > 
> > Hello,
> > 
> > I am trying to parallelize Excel processing and I am noticing a
> > bizarre behavior - single threaded processing is actually faster...
> > 
> > I am not doing anything fancy. I just open an XSSFWorkbook, fill
> > out some
> > values, run formula calcs and read the output. If I run single
> > threaded -
> > initial run takes a few seconds to complete (assume because JVM
> > needs to
> > load POI + all the XML, schemas, etc.), but performance improves
> > and subsequent runs all take about 100-200 ms.
> > 
> > Same logic executed in a separate thread runs easily for 5 seconds
> > in each
> > thread.... So turns out that single threaded processing of say 10
> > files is
> > at 4.5 seconds, but multithreaded takes 5-6 easily... No files are
> > shared
> > among threads.
> > 
> > The hotspots are in POIXMLDocument.load. Thread behavior also looks
> > correct. File contention is out of the picture too - reading a
> > different
> > file each time.
> > 
> > Any ideas as two why or pointers at the POI multithreading best
> > practices
> > are greatly appreciated.  Thank you very much in advance!
> > 
> > -------------------------------------------------------------------
> > --
> > To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> > For additional commands, e-mail: user-help@poi.apache.org
> > 
> > 


Re: Multi-threading behavior

Posted by Wabi Sabi <wa...@gmail.com>.
Thank you for taking a look! It is indeed.

I also tested the same logic on Win JDK 11 and Mac OS X JDK 1.8. The
overall pattern is the same: the initial run is super slow (7 seconds on
Mac and 1.5 seconds on Win), subsequent runs are dramatically better (down
to 150-200 ms on both systems).

On Fri, May 20, 2022 at 10:47 AM PJ Fanning <fa...@yahoo.com.invalid>
wrote:

> Is this related to
> https://stackoverflow.com/questions/72310943/poi-single-vs-multithreaded-performance
> ?
>
>
>
>
>
>
> On Friday 20 May 2022, 15:41:28 IST, Wabi Sabi <wa...@gmail.com>
> wrote:
>
>
>
>
>
> Hello,
>
> I am trying to parallelize Excel processing and I am noticing a
> bizarre behavior - single threaded processing is actually faster...
>
> I am not doing anything fancy. I just open an XSSFWorkbook, fill out some
> values, run formula calcs and read the output. If I run single threaded -
> initial run takes a few seconds to complete (assume because JVM needs to
> load POI + all the XML, schemas, etc.), but performance improves
> and subsequent runs all take about 100-200 ms.
>
> Same logic executed in a separate thread runs easily for 5 seconds in each
> thread.... So turns out that single threaded processing of say 10 files is
> at 4.5 seconds, but multithreaded takes 5-6 easily... No files are shared
> among threads.
>
> The hotspots are in POIXMLDocument.load. Thread behavior also looks
> correct. File contention is out of the picture too - reading a different
> file each time.
>
> Any ideas as two why or pointers at the POI multithreading best practices
> are greatly appreciated.  Thank you very much in advance!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Re: Multi-threading behavior

Posted by PJ Fanning <fa...@yahoo.com.INVALID>.
Is this related to https://stackoverflow.com/questions/72310943/poi-single-vs-multithreaded-performance ?






On Friday 20 May 2022, 15:41:28 IST, Wabi Sabi <wa...@gmail.com> wrote: 





Hello,

I am trying to parallelize Excel processing and I am noticing a
bizarre behavior - single threaded processing is actually faster...

I am not doing anything fancy. I just open an XSSFWorkbook, fill out some
values, run formula calcs and read the output. If I run single threaded -
initial run takes a few seconds to complete (assume because JVM needs to
load POI + all the XML, schemas, etc.), but performance improves
and subsequent runs all take about 100-200 ms.

Same logic executed in a separate thread runs easily for 5 seconds in each
thread.... So turns out that single threaded processing of say 10 files is
at 4.5 seconds, but multithreaded takes 5-6 easily... No files are shared
among threads.

The hotspots are in POIXMLDocument.load. Thread behavior also looks
correct. File contention is out of the picture too - reading a different
file each time.

Any ideas as two why or pointers at the POI multithreading best practices
are greatly appreciated.  Thank you very much in advance!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org