You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Brandon Chinn <br...@gmail.com> on 2022/01/06 22:47:18 UTC

[Python] Does write_file() finish when returning?

When `pyarrow.parquet.write_file()` returns, is the parquet file finished
writing on disk, or is it still writing?

Context:
https://www.reddit.com/r/learnpython/comments/rxmq43/help_with_python_file_flakily_not_returning_full/hrj99tq/?context=3

Thanks!
Brandon Chinn

Re: [Python] Does write_file() finish when returning?

Posted by Brandon Chinn <br...@gmail.com>.
Oh! The reader and writer are on the same thread & process, but it's
possible for other threads to call `write_table` simultaneously. That
probably accounts for the race condition here. Thanks!

On Thu, Jan 6, 2022 at 4:03 PM Weston Pace <we...@gmail.com> wrote:

> I'm guessing you mean write_table?  Assuming you are passing a
> filename / string (and not an open output stream) to write_table I
> would expect that any files opened during the call have been closed
> before the call returns.
>
> Pedantically, this is not quite the same thing as "finished writing on
> disk" but more accurately, "finished writing to the OS".  A power
> outage shortly after a call to write_table completes could lead to
> partial loss of a file.
>
> However, this should not matter for your case if I am understanding
> your problem statement in that reddit post.  As long as you open that
> file handle to read after you have finished the call to write_table
> you should see all of the contents immediately.
>
> There is always the opportunity for bugs but many of our unit tests
> write files and then turn around and immediately read them and we
> don't typically have trouble here.  I'm assuming your reader & writer
> are on the same thread & process?  If you open a reader it's possible
> your read task is running while your write task is running and then no
> guarantees would be made.
>
> On Thu, Jan 6, 2022 at 12:47 PM Brandon Chinn <br...@gmail.com>
> wrote:
> >
> > When `pyarrow.parquet.write_file()` returns, is the parquet file
> finished writing on disk, or is it still writing?
> >
> > Context:
> https://www.reddit.com/r/learnpython/comments/rxmq43/help_with_python_file_flakily_not_returning_full/hrj99tq/?context=3
> >
> > Thanks!
> > Brandon Chinn
>

Re: [Python] Does write_file() finish when returning?

Posted by Weston Pace <we...@gmail.com>.
I'm guessing you mean write_table?  Assuming you are passing a
filename / string (and not an open output stream) to write_table I
would expect that any files opened during the call have been closed
before the call returns.

Pedantically, this is not quite the same thing as "finished writing on
disk" but more accurately, "finished writing to the OS".  A power
outage shortly after a call to write_table completes could lead to
partial loss of a file.

However, this should not matter for your case if I am understanding
your problem statement in that reddit post.  As long as you open that
file handle to read after you have finished the call to write_table
you should see all of the contents immediately.

There is always the opportunity for bugs but many of our unit tests
write files and then turn around and immediately read them and we
don't typically have trouble here.  I'm assuming your reader & writer
are on the same thread & process?  If you open a reader it's possible
your read task is running while your write task is running and then no
guarantees would be made.

On Thu, Jan 6, 2022 at 12:47 PM Brandon Chinn <br...@gmail.com> wrote:
>
> When `pyarrow.parquet.write_file()` returns, is the parquet file finished writing on disk, or is it still writing?
>
> Context: https://www.reddit.com/r/learnpython/comments/rxmq43/help_with_python_file_flakily_not_returning_full/hrj99tq/?context=3
>
> Thanks!
> Brandon Chinn