You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Corey Nolet <cj...@gmail.com> on 2018/07/10 15:45:41 UTC

Pyarrow Plasma client.release() fault

I'm on a system with 12TB of memory and attempting to use Pyarrow's Plasma
client to convert a series of CSV files (via Pandas) into a Parquet store.

I've got a little over 20k CSV files to process which are about 1-2gb each.
I'm loading 500 to 1000 files at a time.

In each iteration, I'm loading a series of files, partitioning them by a
time field into separate dataframes, then writing parquet files in
directories for each day.

The problem I'm having is that the Plasma client & server appear to lock up
after about 2-3 iterations. It locks up to the point where I can't even
CTRL+C the server. I am able to stop the notebook and re-trying the code
just continues to lock up when interacting with Jupyter. There are no
errors in my logs to tell me something's wrong.

Just to make sure I'm not just being impatient and possibly need to wait
for some background services to finish, I allowed the code to run overnight
and it was still in the same state when I came in to work this morning. I'm
running the Plasma server with 4TB max.

In an attempt to pro-actively free up some of the object ids that I no
longer need, I also attempted to use the client.release() function but I
cannot seem to figure out how to make this work properly. It crashes my
Jupyter kernel each time I try.

I'm using Pyarrow 0.9.0

Thanks in advance.

Re: Pyarrow Plasma client.release() fault

Posted by Philipp Moritz <pc...@gmail.com>.

Also you should avoid calling release directly, because it will also be
called automatically here:

https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx#L222

Instead, you should call "del buffer" on the PlasmaBuffer. I'll submit a PR
to make the release method private.

The only real way to see what is going on is to have code to reproduce your
workload (maybe by shrinking the datasize we can make it run on a smaller
machine). Can you reproduce the problem with synthetic data?

-- Philipp.

On Fri, Jul 20, 2018 at 1:48 PM, Robert Nishihara <robertnishihara@gmail.com
> wrote:

> Hi Corey,
>
> It is possible that the current eviction policy will evict a ton of objects
> at once. Since the plasma store is single threaded, this could cause the
> plasma store to be unresponsive while the eviction is happening (though it
> should not hang permanently, just temporarily).
>
> You could always try starting the plasma store with a smaller amount of
> memory (using the "-m" flag) and see if that changes things.
>
> Glad to hear that ray is simplifying things.
>
> -Robert
>
> On Fri, Jul 20, 2018 at 1:30 PM Corey Nolet <cj...@gmail.com> wrote:
>
> > Robert,
> >
> > Yes I am using separate Plasma clients in each different thread. I also
> > verified that I am not using up all the file descriptors or reaching the
> > overcommit limit.
> >
> > I do see that the Plasma server is evicting objects every so often. I'm
> > assuming this eviction may be going on in the background? Is it possible
> > that the locking up may be the result of a massive eviction? I am
> > allocating over 8TB for the Plasma server.
> >
> > Wes,
> >
> > Best practices would be great. I did find that the @ray.remote scheduler
> > from the Ray project has drastically simplified my code.
> >
> > I also attempted using single-node PySpark but the type conversion I need
> > for going from CSV->Dataframes was orders of magnitude slower than Pandas
> > and Python.
> >
> >
> >
> > On Mon, Jul 16, 2018 at 8:17 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > Seems like we might want to write down some best practices for this
> > > level of large scale usage, essentially a supercomputer-like rig. I
> > > wouldn't even know where to come by a machine with a machine with >
> > > 2TB memory for scalability / concurrency load testing
> > >
> > > On Mon, Jul 16, 2018 at 2:59 PM, Robert Nishihara
> > > <ro...@gmail.com> wrote:
> > > > Are you using the same plasma client from all of the different
> threads?
> > > If
> > > > so, that could cause race conditions as the client is not thread
> safe.
> > > >
> > > > Alternatively, if you have a separate plasma client for each thread,
> > then
> > > > you may be running out of file descriptors somewhere (either the
> client
> > > > process or the store).
> > > >
> > > > Can you check if the object store evicting objects (it prints
> something
> > > to
> > > > stdout/stderr when this happens)? Could you be running out of memory
> > but
> > > > failing to release the objects?
> > > >
> > > > On Tue, Jul 10, 2018 at 9:48 AM Corey Nolet <cj...@gmail.com>
> wrote:
> > > >
> > > >> Update:
> > > >>
> > > >> I'm investigating the possibility that I've reached the overcommit
> > > limit in
> > > >> the kernel as a result of all the parallel processes.
> > > >>
> > > >> This still doesn't fix the client.release() problem but it might
> > explain
> > > >> why the processing appears to halt, after some time, until I restart
> > the
> > > >> Jupyter kernel.
> > > >>
> > > >> On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet <cj...@gmail.com>
> > wrote:
> > > >>
> > > >> > Wes,
> > > >> >
> > > >> > Unfortunately, my code is on a separate network. I'll try to
> explain
> > > what
> > > >> > I'm doing and if you need further detail, I can certainly
> pseudocode
> > > >> > specifics.
> > > >> >
> > > >> > I am using multiprocessing.Pool() to fire up a bunch of threads
> for
> > > >> > different filenames. In each thread, I'm performing a
> pd.read_csv(),
> > > >> > sorting by the timestamp field (rounded to the day) and chunking
> the
> > > >> > Dataframe into separate Dataframes. I create a new Plasma ObjectID
> > for
> > > >> each
> > > >> > of the chunked Dataframes, convert them to RecordBuffer objects,
> > > stream
> > > >> the
> > > >> > bytes to Plasma and seal the objects. Only the objectIDs are
> > returned
> > > to
> > > >> > the orchestration thread.
> > > >> >
> > > >> > In follow-on processing, I'm combining the ObjectIDs for each of
> the
> > > >> > unique day timestamps into lists and I'm passing those into a
> > > function in
> > > >> > parallel using multiprocessing.Pool(). In this function, I'm
> > iterating
> > > >> > through the lists of objectIds, loading them back into Dataframes,
> > > >> > appending them together until their size
> > > >> > is > some predefined threshold, and performing a df.to_parquet().
> > > >> >
> > > >> > The steps in the 2 paragraphs above are performing in a loop,
> > > batching up
> > > >> > 500-1k files at a time for each iteration.
> > > >> >
> > > >> > When I run this iteration a few times, it eventually locks up the
> > > Plasma
> > > >> > client. With regards to the release() fault, it doesn't seem to
> > matter
> > > >> when
> > > >> > or where I run it (in the orchestration thread or in other
> threads),
> > > it
> > > >> > always seems to crash the Jupyter kernel. I'm thinking I might be
> > > using
> > > >> it
> > > >> > wrong, I'm just trying to figure out where and what I'm doing.
> > > >> >
> > > >> > Thanks again!
> > > >> >
> > > >> > On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney <
> wesmckinn@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> >> hi Corey,
> > > >> >>
> > > >> >> Can you provide the code (or a simplified version thereof) that
> > shows
> > > >> >> how you're using Plasma?
> > > >> >>
> > > >> >> - Wes
> > > >> >>
> > > >> >> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet <cjnolet@gmail.com
> >
> > > >> wrote:
> > > >> >> > I'm on a system with 12TB of memory and attempting to use
> > Pyarrow's
> > > >> >> Plasma
> > > >> >> > client to convert a series of CSV files (via Pandas) into a
> > Parquet
> > > >> >> store.
> > > >> >> >
> > > >> >> > I've got a little over 20k CSV files to process which are about
> > > 1-2gb
> > > >> >> each.
> > > >> >> > I'm loading 500 to 1000 files at a time.
> > > >> >> >
> > > >> >> > In each iteration, I'm loading a series of files, partitioning
> > them
> > > >> by a
> > > >> >> > time field into separate dataframes, then writing parquet files
> > in
> > > >> >> > directories for each day.
> > > >> >> >
> > > >> >> > The problem I'm having is that the Plasma client & server
> appear
> > to
> > > >> >> lock up
> > > >> >> > after about 2-3 iterations. It locks up to the point where I
> > can't
> > > >> even
> > > >> >> > CTRL+C the server. I am able to stop the notebook and re-trying
> > the
> > > >> code
> > > >> >> > just continues to lock up when interacting with Jupyter. There
> > are
> > > no
> > > >> >> > errors in my logs to tell me something's wrong.
> > > >> >> >
> > > >> >> > Just to make sure I'm not just being impatient and possibly
> need
> > to
> > > >> wait
> > > >> >> > for some background services to finish, I allowed the code to
> run
> > > >> >> overnight
> > > >> >> > and it was still in the same state when I came in to work this
> > > >> morning.
> > > >> >> I'm
> > > >> >> > running the Plasma server with 4TB max.
> > > >> >> >
> > > >> >> > In an attempt to pro-actively free up some of the object ids
> that
> > > I no
> > > >> >> > longer need, I also attempted to use the client.release()
> > function
> > > >> but I
> > > >> >> > cannot seem to figure out how to make this work properly. It
> > > crashes
> > > >> my
> > > >> >> > Jupyter kernel each time I try.
> > > >> >> >
> > > >> >> > I'm using Pyarrow 0.9.0
> > > >> >> >
> > > >> >> > Thanks in advance.
> > > >> >>
> > > >> >
> > > >>
> > >
> >
>

Re: Pyarrow Plasma client.release() fault

Posted by Robert Nishihara <ro...@gmail.com>.

Hi Corey,

It is possible that the current eviction policy will evict a ton of objects
at once. Since the plasma store is single threaded, this could cause the
plasma store to be unresponsive while the eviction is happening (though it
should not hang permanently, just temporarily).

You could always try starting the plasma store with a smaller amount of
memory (using the "-m" flag) and see if that changes things.

Glad to hear that ray is simplifying things.

-Robert

On Fri, Jul 20, 2018 at 1:30 PM Corey Nolet <cj...@gmail.com> wrote:

> Robert,
>
> Yes I am using separate Plasma clients in each different thread. I also
> verified that I am not using up all the file descriptors or reaching the
> overcommit limit.
>
> I do see that the Plasma server is evicting objects every so often. I'm
> assuming this eviction may be going on in the background? Is it possible
> that the locking up may be the result of a massive eviction? I am
> allocating over 8TB for the Plasma server.
>
> Wes,
>
> Best practices would be great. I did find that the @ray.remote scheduler
> from the Ray project has drastically simplified my code.
>
> I also attempted using single-node PySpark but the type conversion I need
> for going from CSV->Dataframes was orders of magnitude slower than Pandas
> and Python.
>
>
>
> On Mon, Jul 16, 2018 at 8:17 PM Wes McKinney <we...@gmail.com> wrote:
>
> > Seems like we might want to write down some best practices for this
> > level of large scale usage, essentially a supercomputer-like rig. I
> > wouldn't even know where to come by a machine with a machine with >
> > 2TB memory for scalability / concurrency load testing
> >
> > On Mon, Jul 16, 2018 at 2:59 PM, Robert Nishihara
> > <ro...@gmail.com> wrote:
> > > Are you using the same plasma client from all of the different threads?
> > If
> > > so, that could cause race conditions as the client is not thread safe.
> > >
> > > Alternatively, if you have a separate plasma client for each thread,
> then
> > > you may be running out of file descriptors somewhere (either the client
> > > process or the store).
> > >
> > > Can you check if the object store evicting objects (it prints something
> > to
> > > stdout/stderr when this happens)? Could you be running out of memory
> but
> > > failing to release the objects?
> > >
> > > On Tue, Jul 10, 2018 at 9:48 AM Corey Nolet <cj...@gmail.com> wrote:
> > >
> > >> Update:
> > >>
> > >> I'm investigating the possibility that I've reached the overcommit
> > limit in
> > >> the kernel as a result of all the parallel processes.
> > >>
> > >> This still doesn't fix the client.release() problem but it might
> explain
> > >> why the processing appears to halt, after some time, until I restart
> the
> > >> Jupyter kernel.
> > >>
> > >> On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet <cj...@gmail.com>
> wrote:
> > >>
> > >> > Wes,
> > >> >
> > >> > Unfortunately, my code is on a separate network. I'll try to explain
> > what
> > >> > I'm doing and if you need further detail, I can certainly pseudocode
> > >> > specifics.
> > >> >
> > >> > I am using multiprocessing.Pool() to fire up a bunch of threads for
> > >> > different filenames. In each thread, I'm performing a pd.read_csv(),
> > >> > sorting by the timestamp field (rounded to the day) and chunking the
> > >> > Dataframe into separate Dataframes. I create a new Plasma ObjectID
> for
> > >> each
> > >> > of the chunked Dataframes, convert them to RecordBuffer objects,
> > stream
> > >> the
> > >> > bytes to Plasma and seal the objects. Only the objectIDs are
> returned
> > to
> > >> > the orchestration thread.
> > >> >
> > >> > In follow-on processing, I'm combining the ObjectIDs for each of the
> > >> > unique day timestamps into lists and I'm passing those into a
> > function in
> > >> > parallel using multiprocessing.Pool(). In this function, I'm
> iterating
> > >> > through the lists of objectIds, loading them back into Dataframes,
> > >> > appending them together until their size
> > >> > is > some predefined threshold, and performing a df.to_parquet().
> > >> >
> > >> > The steps in the 2 paragraphs above are performing in a loop,
> > batching up
> > >> > 500-1k files at a time for each iteration.
> > >> >
> > >> > When I run this iteration a few times, it eventually locks up the
> > Plasma
> > >> > client. With regards to the release() fault, it doesn't seem to
> matter
> > >> when
> > >> > or where I run it (in the orchestration thread or in other threads),
> > it
> > >> > always seems to crash the Jupyter kernel. I'm thinking I might be
> > using
> > >> it
> > >> > wrong, I'm just trying to figure out where and what I'm doing.
> > >> >
> > >> > Thanks again!
> > >> >
> > >> > On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney <we...@gmail.com>
> > >> wrote:
> > >> >
> > >> >> hi Corey,
> > >> >>
> > >> >> Can you provide the code (or a simplified version thereof) that
> shows
> > >> >> how you're using Plasma?
> > >> >>
> > >> >> - Wes
> > >> >>
> > >> >> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet <cj...@gmail.com>
> > >> wrote:
> > >> >> > I'm on a system with 12TB of memory and attempting to use
> Pyarrow's
> > >> >> Plasma
> > >> >> > client to convert a series of CSV files (via Pandas) into a
> Parquet
> > >> >> store.
> > >> >> >
> > >> >> > I've got a little over 20k CSV files to process which are about
> > 1-2gb
> > >> >> each.
> > >> >> > I'm loading 500 to 1000 files at a time.
> > >> >> >
> > >> >> > In each iteration, I'm loading a series of files, partitioning
> them
> > >> by a
> > >> >> > time field into separate dataframes, then writing parquet files
> in
> > >> >> > directories for each day.
> > >> >> >
> > >> >> > The problem I'm having is that the Plasma client & server appear
> to
> > >> >> lock up
> > >> >> > after about 2-3 iterations. It locks up to the point where I
> can't
> > >> even
> > >> >> > CTRL+C the server. I am able to stop the notebook and re-trying
> the
> > >> code
> > >> >> > just continues to lock up when interacting with Jupyter. There
> are
> > no
> > >> >> > errors in my logs to tell me something's wrong.
> > >> >> >
> > >> >> > Just to make sure I'm not just being impatient and possibly need
> to
> > >> wait
> > >> >> > for some background services to finish, I allowed the code to run
> > >> >> overnight
> > >> >> > and it was still in the same state when I came in to work this
> > >> morning.
> > >> >> I'm
> > >> >> > running the Plasma server with 4TB max.
> > >> >> >
> > >> >> > In an attempt to pro-actively free up some of the object ids that
> > I no
> > >> >> > longer need, I also attempted to use the client.release()
> function
> > >> but I
> > >> >> > cannot seem to figure out how to make this work properly. It
> > crashes
> > >> my
> > >> >> > Jupyter kernel each time I try.
> > >> >> >
> > >> >> > I'm using Pyarrow 0.9.0
> > >> >> >
> > >> >> > Thanks in advance.
> > >> >>
> > >> >
> > >>
> >
>

Re: Pyarrow Plasma client.release() fault

Posted by Corey Nolet <cj...@gmail.com>.

Robert,

Yes I am using separate Plasma clients in each different thread. I also
verified that I am not using up all the file descriptors or reaching the
overcommit limit.

I do see that the Plasma server is evicting objects every so often. I'm
assuming this eviction may be going on in the background? Is it possible
that the locking up may be the result of a massive eviction? I am
allocating over 8TB for the Plasma server.

Wes,

Best practices would be great. I did find that the @ray.remote scheduler
from the Ray project has drastically simplified my code.

I also attempted using single-node PySpark but the type conversion I need
for going from CSV->Dataframes was orders of magnitude slower than Pandas
and Python.



On Mon, Jul 16, 2018 at 8:17 PM Wes McKinney <we...@gmail.com> wrote:

> Seems like we might want to write down some best practices for this
> level of large scale usage, essentially a supercomputer-like rig. I
> wouldn't even know where to come by a machine with a machine with >
> 2TB memory for scalability / concurrency load testing
>
> On Mon, Jul 16, 2018 at 2:59 PM, Robert Nishihara
> <ro...@gmail.com> wrote:
> > Are you using the same plasma client from all of the different threads?
> If
> > so, that could cause race conditions as the client is not thread safe.
> >
> > Alternatively, if you have a separate plasma client for each thread, then
> > you may be running out of file descriptors somewhere (either the client
> > process or the store).
> >
> > Can you check if the object store evicting objects (it prints something
> to
> > stdout/stderr when this happens)? Could you be running out of memory but
> > failing to release the objects?
> >
> > On Tue, Jul 10, 2018 at 9:48 AM Corey Nolet <cj...@gmail.com> wrote:
> >
> >> Update:
> >>
> >> I'm investigating the possibility that I've reached the overcommit
> limit in
> >> the kernel as a result of all the parallel processes.
> >>
> >> This still doesn't fix the client.release() problem but it might explain
> >> why the processing appears to halt, after some time, until I restart the
> >> Jupyter kernel.
> >>
> >> On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet <cj...@gmail.com> wrote:
> >>
> >> > Wes,
> >> >
> >> > Unfortunately, my code is on a separate network. I'll try to explain
> what
> >> > I'm doing and if you need further detail, I can certainly pseudocode
> >> > specifics.
> >> >
> >> > I am using multiprocessing.Pool() to fire up a bunch of threads for
> >> > different filenames. In each thread, I'm performing a pd.read_csv(),
> >> > sorting by the timestamp field (rounded to the day) and chunking the
> >> > Dataframe into separate Dataframes. I create a new Plasma ObjectID for
> >> each
> >> > of the chunked Dataframes, convert them to RecordBuffer objects,
> stream
> >> the
> >> > bytes to Plasma and seal the objects. Only the objectIDs are returned
> to
> >> > the orchestration thread.
> >> >
> >> > In follow-on processing, I'm combining the ObjectIDs for each of the
> >> > unique day timestamps into lists and I'm passing those into a
> function in
> >> > parallel using multiprocessing.Pool(). In this function, I'm iterating
> >> > through the lists of objectIds, loading them back into Dataframes,
> >> > appending them together until their size
> >> > is > some predefined threshold, and performing a df.to_parquet().
> >> >
> >> > The steps in the 2 paragraphs above are performing in a loop,
> batching up
> >> > 500-1k files at a time for each iteration.
> >> >
> >> > When I run this iteration a few times, it eventually locks up the
> Plasma
> >> > client. With regards to the release() fault, it doesn't seem to matter
> >> when
> >> > or where I run it (in the orchestration thread or in other threads),
> it
> >> > always seems to crash the Jupyter kernel. I'm thinking I might be
> using
> >> it
> >> > wrong, I'm just trying to figure out where and what I'm doing.
> >> >
> >> > Thanks again!
> >> >
> >> > On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney <we...@gmail.com>
> >> wrote:
> >> >
> >> >> hi Corey,
> >> >>
> >> >> Can you provide the code (or a simplified version thereof) that shows
> >> >> how you're using Plasma?
> >> >>
> >> >> - Wes
> >> >>
> >> >> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet <cj...@gmail.com>
> >> wrote:
> >> >> > I'm on a system with 12TB of memory and attempting to use Pyarrow's
> >> >> Plasma
> >> >> > client to convert a series of CSV files (via Pandas) into a Parquet
> >> >> store.
> >> >> >
> >> >> > I've got a little over 20k CSV files to process which are about
> 1-2gb
> >> >> each.
> >> >> > I'm loading 500 to 1000 files at a time.
> >> >> >
> >> >> > In each iteration, I'm loading a series of files, partitioning them
> >> by a
> >> >> > time field into separate dataframes, then writing parquet files in
> >> >> > directories for each day.
> >> >> >
> >> >> > The problem I'm having is that the Plasma client & server appear to
> >> >> lock up
> >> >> > after about 2-3 iterations. It locks up to the point where I can't
> >> even
> >> >> > CTRL+C the server. I am able to stop the notebook and re-trying the
> >> code
> >> >> > just continues to lock up when interacting with Jupyter. There are
> no
> >> >> > errors in my logs to tell me something's wrong.
> >> >> >
> >> >> > Just to make sure I'm not just being impatient and possibly need to
> >> wait
> >> >> > for some background services to finish, I allowed the code to run
> >> >> overnight
> >> >> > and it was still in the same state when I came in to work this
> >> morning.
> >> >> I'm
> >> >> > running the Plasma server with 4TB max.
> >> >> >
> >> >> > In an attempt to pro-actively free up some of the object ids that
> I no
> >> >> > longer need, I also attempted to use the client.release() function
> >> but I
> >> >> > cannot seem to figure out how to make this work properly. It
> crashes
> >> my
> >> >> > Jupyter kernel each time I try.
> >> >> >
> >> >> > I'm using Pyarrow 0.9.0
> >> >> >
> >> >> > Thanks in advance.
> >> >>
> >> >
> >>
>

Re: Pyarrow Plasma client.release() fault

Posted by Wes McKinney <we...@gmail.com>.

Seems like we might want to write down some best practices for this
level of large scale usage, essentially a supercomputer-like rig. I
wouldn't even know where to come by a machine with a machine with >
2TB memory for scalability / concurrency load testing

On Mon, Jul 16, 2018 at 2:59 PM, Robert Nishihara
<ro...@gmail.com> wrote:
> Are you using the same plasma client from all of the different threads? If
> so, that could cause race conditions as the client is not thread safe.
>
> Alternatively, if you have a separate plasma client for each thread, then
> you may be running out of file descriptors somewhere (either the client
> process or the store).
>
> Can you check if the object store evicting objects (it prints something to
> stdout/stderr when this happens)? Could you be running out of memory but
> failing to release the objects?
>
> On Tue, Jul 10, 2018 at 9:48 AM Corey Nolet <cj...@gmail.com> wrote:
>
>> Update:
>>
>> I'm investigating the possibility that I've reached the overcommit limit in
>> the kernel as a result of all the parallel processes.
>>
>> This still doesn't fix the client.release() problem but it might explain
>> why the processing appears to halt, after some time, until I restart the
>> Jupyter kernel.
>>
>> On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet <cj...@gmail.com> wrote:
>>
>> > Wes,
>> >
>> > Unfortunately, my code is on a separate network. I'll try to explain what
>> > I'm doing and if you need further detail, I can certainly pseudocode
>> > specifics.
>> >
>> > I am using multiprocessing.Pool() to fire up a bunch of threads for
>> > different filenames. In each thread, I'm performing a pd.read_csv(),
>> > sorting by the timestamp field (rounded to the day) and chunking the
>> > Dataframe into separate Dataframes. I create a new Plasma ObjectID for
>> each
>> > of the chunked Dataframes, convert them to RecordBuffer objects, stream
>> the
>> > bytes to Plasma and seal the objects. Only the objectIDs are returned to
>> > the orchestration thread.
>> >
>> > In follow-on processing, I'm combining the ObjectIDs for each of the
>> > unique day timestamps into lists and I'm passing those into a function in
>> > parallel using multiprocessing.Pool(). In this function, I'm iterating
>> > through the lists of objectIds, loading them back into Dataframes,
>> > appending them together until their size
>> > is > some predefined threshold, and performing a df.to_parquet().
>> >
>> > The steps in the 2 paragraphs above are performing in a loop, batching up
>> > 500-1k files at a time for each iteration.
>> >
>> > When I run this iteration a few times, it eventually locks up the Plasma
>> > client. With regards to the release() fault, it doesn't seem to matter
>> when
>> > or where I run it (in the orchestration thread or in other threads), it
>> > always seems to crash the Jupyter kernel. I'm thinking I might be using
>> it
>> > wrong, I'm just trying to figure out where and what I'm doing.
>> >
>> > Thanks again!
>> >
>> > On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> >> hi Corey,
>> >>
>> >> Can you provide the code (or a simplified version thereof) that shows
>> >> how you're using Plasma?
>> >>
>> >> - Wes
>> >>
>> >> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet <cj...@gmail.com>
>> wrote:
>> >> > I'm on a system with 12TB of memory and attempting to use Pyarrow's
>> >> Plasma
>> >> > client to convert a series of CSV files (via Pandas) into a Parquet
>> >> store.
>> >> >
>> >> > I've got a little over 20k CSV files to process which are about 1-2gb
>> >> each.
>> >> > I'm loading 500 to 1000 files at a time.
>> >> >
>> >> > In each iteration, I'm loading a series of files, partitioning them
>> by a
>> >> > time field into separate dataframes, then writing parquet files in
>> >> > directories for each day.
>> >> >
>> >> > The problem I'm having is that the Plasma client & server appear to
>> >> lock up
>> >> > after about 2-3 iterations. It locks up to the point where I can't
>> even
>> >> > CTRL+C the server. I am able to stop the notebook and re-trying the
>> code
>> >> > just continues to lock up when interacting with Jupyter. There are no
>> >> > errors in my logs to tell me something's wrong.
>> >> >
>> >> > Just to make sure I'm not just being impatient and possibly need to
>> wait
>> >> > for some background services to finish, I allowed the code to run
>> >> overnight
>> >> > and it was still in the same state when I came in to work this
>> morning.
>> >> I'm
>> >> > running the Plasma server with 4TB max.
>> >> >
>> >> > In an attempt to pro-actively free up some of the object ids that I no
>> >> > longer need, I also attempted to use the client.release() function
>> but I
>> >> > cannot seem to figure out how to make this work properly. It crashes
>> my
>> >> > Jupyter kernel each time I try.
>> >> >
>> >> > I'm using Pyarrow 0.9.0
>> >> >
>> >> > Thanks in advance.
>> >>
>> >
>>

Re: Pyarrow Plasma client.release() fault

Posted by Robert Nishihara <ro...@gmail.com>.

Are you using the same plasma client from all of the different threads? If
so, that could cause race conditions as the client is not thread safe.

Alternatively, if you have a separate plasma client for each thread, then
you may be running out of file descriptors somewhere (either the client
process or the store).

Can you check if the object store evicting objects (it prints something to
stdout/stderr when this happens)? Could you be running out of memory but
failing to release the objects?

On Tue, Jul 10, 2018 at 9:48 AM Corey Nolet <cj...@gmail.com> wrote:

> Update:
>
> I'm investigating the possibility that I've reached the overcommit limit in
> the kernel as a result of all the parallel processes.
>
> This still doesn't fix the client.release() problem but it might explain
> why the processing appears to halt, after some time, until I restart the
> Jupyter kernel.
>
> On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet <cj...@gmail.com> wrote:
>
> > Wes,
> >
> > Unfortunately, my code is on a separate network. I'll try to explain what
> > I'm doing and if you need further detail, I can certainly pseudocode
> > specifics.
> >
> > I am using multiprocessing.Pool() to fire up a bunch of threads for
> > different filenames. In each thread, I'm performing a pd.read_csv(),
> > sorting by the timestamp field (rounded to the day) and chunking the
> > Dataframe into separate Dataframes. I create a new Plasma ObjectID for
> each
> > of the chunked Dataframes, convert them to RecordBuffer objects, stream
> the
> > bytes to Plasma and seal the objects. Only the objectIDs are returned to
> > the orchestration thread.
> >
> > In follow-on processing, I'm combining the ObjectIDs for each of the
> > unique day timestamps into lists and I'm passing those into a function in
> > parallel using multiprocessing.Pool(). In this function, I'm iterating
> > through the lists of objectIds, loading them back into Dataframes,
> > appending them together until their size
> > is > some predefined threshold, and performing a df.to_parquet().
> >
> > The steps in the 2 paragraphs above are performing in a loop, batching up
> > 500-1k files at a time for each iteration.
> >
> > When I run this iteration a few times, it eventually locks up the Plasma
> > client. With regards to the release() fault, it doesn't seem to matter
> when
> > or where I run it (in the orchestration thread or in other threads), it
> > always seems to crash the Jupyter kernel. I'm thinking I might be using
> it
> > wrong, I'm just trying to figure out where and what I'm doing.
> >
> > Thanks again!
> >
> > On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> >> hi Corey,
> >>
> >> Can you provide the code (or a simplified version thereof) that shows
> >> how you're using Plasma?
> >>
> >> - Wes
> >>
> >> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet <cj...@gmail.com>
> wrote:
> >> > I'm on a system with 12TB of memory and attempting to use Pyarrow's
> >> Plasma
> >> > client to convert a series of CSV files (via Pandas) into a Parquet
> >> store.
> >> >
> >> > I've got a little over 20k CSV files to process which are about 1-2gb
> >> each.
> >> > I'm loading 500 to 1000 files at a time.
> >> >
> >> > In each iteration, I'm loading a series of files, partitioning them
> by a
> >> > time field into separate dataframes, then writing parquet files in
> >> > directories for each day.
> >> >
> >> > The problem I'm having is that the Plasma client & server appear to
> >> lock up
> >> > after about 2-3 iterations. It locks up to the point where I can't
> even
> >> > CTRL+C the server. I am able to stop the notebook and re-trying the
> code
> >> > just continues to lock up when interacting with Jupyter. There are no
> >> > errors in my logs to tell me something's wrong.
> >> >
> >> > Just to make sure I'm not just being impatient and possibly need to
> wait
> >> > for some background services to finish, I allowed the code to run
> >> overnight
> >> > and it was still in the same state when I came in to work this
> morning.
> >> I'm
> >> > running the Plasma server with 4TB max.
> >> >
> >> > In an attempt to pro-actively free up some of the object ids that I no
> >> > longer need, I also attempted to use the client.release() function
> but I
> >> > cannot seem to figure out how to make this work properly. It crashes
> my
> >> > Jupyter kernel each time I try.
> >> >
> >> > I'm using Pyarrow 0.9.0
> >> >
> >> > Thanks in advance.
> >>
> >
>

Re: Pyarrow Plasma client.release() fault

Posted by Corey Nolet <cj...@gmail.com>.

Update:

I'm investigating the possibility that I've reached the overcommit limit in
the kernel as a result of all the parallel processes.

This still doesn't fix the client.release() problem but it might explain
why the processing appears to halt, after some time, until I restart the
Jupyter kernel.

On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet <cj...@gmail.com> wrote:

> Wes,
>
> Unfortunately, my code is on a separate network. I'll try to explain what
> I'm doing and if you need further detail, I can certainly pseudocode
> specifics.
>
> I am using multiprocessing.Pool() to fire up a bunch of threads for
> different filenames. In each thread, I'm performing a pd.read_csv(),
> sorting by the timestamp field (rounded to the day) and chunking the
> Dataframe into separate Dataframes. I create a new Plasma ObjectID for each
> of the chunked Dataframes, convert them to RecordBuffer objects, stream the
> bytes to Plasma and seal the objects. Only the objectIDs are returned to
> the orchestration thread.
>
> In follow-on processing, I'm combining the ObjectIDs for each of the
> unique day timestamps into lists and I'm passing those into a function in
> parallel using multiprocessing.Pool(). In this function, I'm iterating
> through the lists of objectIds, loading them back into Dataframes,
> appending them together until their size
> is > some predefined threshold, and performing a df.to_parquet().
>
> The steps in the 2 paragraphs above are performing in a loop, batching up
> 500-1k files at a time for each iteration.
>
> When I run this iteration a few times, it eventually locks up the Plasma
> client. With regards to the release() fault, it doesn't seem to matter when
> or where I run it (in the orchestration thread or in other threads), it
> always seems to crash the Jupyter kernel. I'm thinking I might be using it
> wrong, I'm just trying to figure out where and what I'm doing.
>
> Thanks again!
>
> On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney <we...@gmail.com> wrote:
>
>> hi Corey,
>>
>> Can you provide the code (or a simplified version thereof) that shows
>> how you're using Plasma?
>>
>> - Wes
>>
>> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet <cj...@gmail.com> wrote:
>> > I'm on a system with 12TB of memory and attempting to use Pyarrow's
>> Plasma
>> > client to convert a series of CSV files (via Pandas) into a Parquet
>> store.
>> >
>> > I've got a little over 20k CSV files to process which are about 1-2gb
>> each.
>> > I'm loading 500 to 1000 files at a time.
>> >
>> > In each iteration, I'm loading a series of files, partitioning them by a
>> > time field into separate dataframes, then writing parquet files in
>> > directories for each day.
>> >
>> > The problem I'm having is that the Plasma client & server appear to
>> lock up
>> > after about 2-3 iterations. It locks up to the point where I can't even
>> > CTRL+C the server. I am able to stop the notebook and re-trying the code
>> > just continues to lock up when interacting with Jupyter. There are no
>> > errors in my logs to tell me something's wrong.
>> >
>> > Just to make sure I'm not just being impatient and possibly need to wait
>> > for some background services to finish, I allowed the code to run
>> overnight
>> > and it was still in the same state when I came in to work this morning.
>> I'm
>> > running the Plasma server with 4TB max.
>> >
>> > In an attempt to pro-actively free up some of the object ids that I no
>> > longer need, I also attempted to use the client.release() function but I
>> > cannot seem to figure out how to make this work properly. It crashes my
>> > Jupyter kernel each time I try.
>> >
>> > I'm using Pyarrow 0.9.0
>> >
>> > Thanks in advance.
>>
>

Re: Pyarrow Plasma client.release() fault

Posted by Corey Nolet <cj...@gmail.com>.

Wes,

Unfortunately, my code is on a separate network. I'll try to explain what
I'm doing and if you need further detail, I can certainly pseudocode
specifics.

I am using multiprocessing.Pool() to fire up a bunch of threads for
different filenames. In each thread, I'm performing a pd.read_csv(),
sorting by the timestamp field (rounded to the day) and chunking the
Dataframe into separate Dataframes. I create a new Plasma ObjectID for each
of the chunked Dataframes, convert them to RecordBuffer objects, stream the
bytes to Plasma and seal the objects. Only the objectIDs are returned to
the orchestration thread.

In follow-on processing, I'm combining the ObjectIDs for each of the unique
day timestamps into lists and I'm passing those into a function in parallel
using multiprocessing.Pool(). In this function, I'm iterating through the
lists of objectIds, loading them back into Dataframes, appending them
together until their size
is > some predefined threshold, and performing a df.to_parquet().

The steps in the 2 paragraphs above are performing in a loop, batching up
500-1k files at a time for each iteration.

When I run this iteration a few times, it eventually locks up the Plasma
client. With regards to the release() fault, it doesn't seem to matter when
or where I run it (in the orchestration thread or in other threads), it
always seems to crash the Jupyter kernel. I'm thinking I might be using it
wrong, I'm just trying to figure out where and what I'm doing.

Thanks again!

On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney <we...@gmail.com> wrote:

> hi Corey,
>
> Can you provide the code (or a simplified version thereof) that shows
> how you're using Plasma?
>
> - Wes
>
> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet <cj...@gmail.com> wrote:
> > I'm on a system with 12TB of memory and attempting to use Pyarrow's
> Plasma
> > client to convert a series of CSV files (via Pandas) into a Parquet
> store.
> >
> > I've got a little over 20k CSV files to process which are about 1-2gb
> each.
> > I'm loading 500 to 1000 files at a time.
> >
> > In each iteration, I'm loading a series of files, partitioning them by a
> > time field into separate dataframes, then writing parquet files in
> > directories for each day.
> >
> > The problem I'm having is that the Plasma client & server appear to lock
> up
> > after about 2-3 iterations. It locks up to the point where I can't even
> > CTRL+C the server. I am able to stop the notebook and re-trying the code
> > just continues to lock up when interacting with Jupyter. There are no
> > errors in my logs to tell me something's wrong.
> >
> > Just to make sure I'm not just being impatient and possibly need to wait
> > for some background services to finish, I allowed the code to run
> overnight
> > and it was still in the same state when I came in to work this morning.
> I'm
> > running the Plasma server with 4TB max.
> >
> > In an attempt to pro-actively free up some of the object ids that I no
> > longer need, I also attempted to use the client.release() function but I
> > cannot seem to figure out how to make this work properly. It crashes my
> > Jupyter kernel each time I try.
> >
> > I'm using Pyarrow 0.9.0
> >
> > Thanks in advance.
>

Re: Pyarrow Plasma client.release() fault

Posted by Wes McKinney <we...@gmail.com>.

hi Corey,

Can you provide the code (or a simplified version thereof) that shows
how you're using Plasma?

- Wes

On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet <cj...@gmail.com> wrote:
> I'm on a system with 12TB of memory and attempting to use Pyarrow's Plasma
> client to convert a series of CSV files (via Pandas) into a Parquet store.
>
> I've got a little over 20k CSV files to process which are about 1-2gb each.
> I'm loading 500 to 1000 files at a time.
>
> In each iteration, I'm loading a series of files, partitioning them by a
> time field into separate dataframes, then writing parquet files in
> directories for each day.
>
> The problem I'm having is that the Plasma client & server appear to lock up
> after about 2-3 iterations. It locks up to the point where I can't even
> CTRL+C the server. I am able to stop the notebook and re-trying the code
> just continues to lock up when interacting with Jupyter. There are no
> errors in my logs to tell me something's wrong.
>
> Just to make sure I'm not just being impatient and possibly need to wait
> for some background services to finish, I allowed the code to run overnight
> and it was still in the same state when I came in to work this morning. I'm
> running the Plasma server with 4TB max.
>
> In an attempt to pro-actively free up some of the object ids that I no
> longer need, I also attempted to use the client.release() function but I
> cannot seem to figure out how to make this work properly. It crashes my
> Jupyter kernel each time I try.
>
> I'm using Pyarrow 0.9.0
>
> Thanks in advance.