You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Andreas Reichel <an...@manticore-projects.com> on 2022/05/16 10:49:06 UTC

Thread Safety of sheet operations

Greetings!

Of course I am fully aware that POI is not thread safety on workbook
level and threads can only be used as long as every sheet is processed
within its own thread (without altering styles).
Still I ran into a small problem when processing several sheets in
parallel: After creating all styles, I opened the threads for each
sheet and CREATED each sheet inside the thread. This failed.

The correct solution was, to

1) Create all Sheets serially (from an Array of names)
2) Spawn the threads (from an Array of names) and
3) inside of each thread, to get the sheet by name

So far so good.

I only wonder, if POI should have blocked Sheet sheet =
workbook.createSheet(p[4]); by itself, e. g. using a semaphore and
print a warning. This would help a) the developers understanding and b)
also ensure, that no undetected mistakes were made when compiling the
workbook.

Please let me know what you think, thanks a lot in advance.
Best regards
Andreas

Re: Thread Safety of sheet operations

Posted by Andreas Reichel <an...@manticore-projects.com>.
On Thu, 2022-05-19 at 21:03 +0200, Dominik Stadler wrote:
> I would be open to add some specific locks if it helps your case as
> long as
> we consider performance impact small. So if you send a PR and none of
> the
> active developers objects, we can see what we can do.

Thanks a lot Dominik, this was the feedback I was looking for.
I will give it a try and maybe I will come up with something and maybe
it will prove useful.

Cheers
Andreas

Re: Thread Safety of sheet operations

Posted by Dominik Stadler <do...@gmx.at>.
Hi,

The current official guarantee that we make is described in the FAQs at
https://poi.apache.org/help/faq.html and it actually already states our
stance on the "per-sheet" level locking:

20. Can I access/modify workbooks/documents/slideshows in multiple threads?
What are the multi-threading guarantees that Apache POI makes

In short: *Handling different document-objects in different threads will
work. Accessing the same document in multiple threads will not work.*

This means the workbook/document/slideshow objects are not checked for
thread safety, but any globally held object like global caches or other
data structures are guarded against multi threaded access accordingly.

There have been discussions
<https://mail-archives.apache.org/mod_mbox/poi-user/201109.mbox/%3C1314859350817-4757295.post@n5.nabble.com%3E>
about accessing different Workbook-sheets in different threads
concurrently. While this may work to some degree, it may lead to very hard
to track errors as multi-threading issues typically only manifest after
long runtime when many threads are active and the system is under high
load, i.e. in production use! Also it might break in future versions of
Apache POI as we do not specifically test using the library this way.


There are likely much more than 0.0x% people interested in performance when
handling large workbooks, both single-threaded or multi-threaded with
per-workbook threads. I rather think the "I need to parallelise per sheet"
a fairly uncommon use-case.

However I also don't think performance would suffer in any way if we add a
simple synchronized-lock, especially if only done for a small amount of
code during sheet creation.

But fully making it 100% safe on that level is hard to guarantee with many
caches and things kept on workbook-level..

So currently you are basically lucky if it works for you already that far!

I would be open to add some specific locks if it helps your case as long as
we consider performance impact small. So if you send a PR and none of the
active developers objects, we can see what we can do.

However I would strongly opt against officially declaring support at that
level because it would require a very thorough and specific set of tests to
uncover issues that might creep up and it also would limit us from doing
performance optimizations in the future.

Dominik.

On Thu, May 19, 2022 at 8:12 PM Andreas Reichel <
andreas@manticore-projects.com> wrote:

> Thank you for writing, Nick.
>
> On Thu, 2022-05-19 at 18:09 +0100, Nick Burch wrote:
> > On Mon, 16 May 2022, Andreas Reichel wrote:
> > > Of course I am fully aware that POI is not thread safety on
> > > workbook
> > > level and threads can only be used as long as every sheet is
> > > processed
> > > within its own thread (without altering styles).
> >
> > We only ensure that you are safe if each workbook is created in its
> > own
> > thread, we generally advise against having multiple threads working
> > on the
> > same workbook as it often goes wrong. It sounds like you've pushed
> > things
> > well beyond our normally recommended limits already...
>
> Oh! I thought we would be thread safe down to sheet level?
> Because I am able to create MANY and LARGE sheets in parallel just fine
> -- as long as actions within each thread refer to only the sheet.
>
> Styles and sheets are to be created before the threads spawn.
> Workbook formula evaluation is done after the threads joined.
>
> The only confusion was about:
>
> Sheet sheet = workbook.createSheet(p[4]);
>
> Because this returns the sheet (and so would have been expected within
> the thread), but in fact its Workbook related and must happen before
> the threads spawn.
>
> > Won't that slow down the 99.9% case, where people are just using a
> > single
> > thread for the whole workbook?
>
> Locking a Semaphore does not take (much) time.
> The 99.9% case creates some simple sheets and will not notice 100 ms
> for the locking.
>
> The remaining 0.01% who are pushing it and create 500 MB Excel files
> will save multiple minutes from parallelism.
> (My report builder reduce time from 20 mins to less than 2 minutes just
> by writing to the sheets in parallel. Its a huge benefit for us!)
>
> > And doesn't it risk people trying to do
> > things that aren't generally thread-safe because "that bit seemed
> > fine"
> > and getting stuck later?
>
> Very correct and agreed. I am still curious what the ACTUAL limitations
> are: Are we just overly defensive/careful declaring it "threadsafe as
> per Workbook" only?
> Or are there real show stoppers to work on different sheets in parallel
> (each sheet in its own thread without cross references of course).
>
> I feel like its the second one, and then we could synchronize those
> methods .
>
> Cheers
> Andreas
>

Re: Thread Safety of sheet operations

Posted by Andreas Reichel <an...@manticore-projects.com>.
Thank you for writing, Nick.

On Thu, 2022-05-19 at 18:09 +0100, Nick Burch wrote:
> On Mon, 16 May 2022, Andreas Reichel wrote:
> > Of course I am fully aware that POI is not thread safety on
> > workbook
> > level and threads can only be used as long as every sheet is
> > processed
> > within its own thread (without altering styles).
> 
> We only ensure that you are safe if each workbook is created in its
> own 
> thread, we generally advise against having multiple threads working
> on the 
> same workbook as it often goes wrong. It sounds like you've pushed
> things 
> well beyond our normally recommended limits already...

Oh! I thought we would be thread safe down to sheet level?
Because I am able to create MANY and LARGE sheets in parallel just fine
-- as long as actions within each thread refer to only the sheet.

Styles and sheets are to be created before the threads spawn.
Workbook formula evaluation is done after the threads joined.

The only confusion was about:

Sheet sheet = workbook.createSheet(p[4]);

Because this returns the sheet (and so would have been expected within
the thread), but in fact its Workbook related and must happen before
the threads spawn.

> Won't that slow down the 99.9% case, where people are just using a
> single 
> thread for the whole workbook? 

Locking a Semaphore does not take (much) time.
The 99.9% case creates some simple sheets and will not notice 100 ms
for the locking.

The remaining 0.01% who are pushing it and create 500 MB Excel files
will save multiple minutes from parallelism.
(My report builder reduce time from 20 mins to less than 2 minutes just
by writing to the sheets in parallel. Its a huge benefit for us!)

> And doesn't it risk people trying to do 
> things that aren't generally thread-safe because "that bit seemed
> fine" 
> and getting stuck later?

Very correct and agreed. I am still curious what the ACTUAL limitations
are: Are we just overly defensive/careful declaring it "threadsafe as
per Workbook" only?
Or are there real show stoppers to work on different sheets in parallel
(each sheet in its own thread without cross references of course).

I feel like its the second one, and then we could synchronize those
methods .

Cheers
Andreas

Re: Thread Safety of sheet operations

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 16 May 2022, Andreas Reichel wrote:
> Of course I am fully aware that POI is not thread safety on workbook
> level and threads can only be used as long as every sheet is processed
> within its own thread (without altering styles).

We only ensure that you are safe if each workbook is created in its own 
thread, we generally advise against having multiple threads working on the 
same workbook as it often goes wrong. It sounds like you've pushed things 
well beyond our normally recommended limits already...

> So far so good.
>
> I only wonder, if POI should have blocked Sheet sheet =
> workbook.createSheet(p[4]); by itself, e. g. using a semaphore and
> print a warning.

Won't that slow down the 99.9% case, where people are just using a single 
thread for the whole workbook? And doesn't it risk people trying to do 
things that aren't generally thread-safe because "that bit seemed fine" 
and getting stuck later?

Nick