You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Sam Shleifer <ss...@gmail.com> on 2021/02/27 18:27:11 UTC

Python Plasma Store Best Practices

Hi!

I am trying to use plasma store to reduce the memory usage of a pytorch dataset/dataloader combination, and had 4  questions. I don’t think any of them require pytorch knowledge. If you prefer to comment inline there is a quip with identical content and prettier formatting here https://quip.com/3mwGAJ9KR2HT

*1)* My script starts the plasma-store from python with 200 GB:

nbytes = (1024 ** 3) * 200

_server = subprocess.Popen(["plasma_store", "-m", str(nbytes), "-s", path])

where nbytes is chosen arbitrarily. From my experiments it seems that one should start the store as large as possible within the limits of dev/shm . I wanted to verify whether this is actually the best practice (it would be hard for my app to know the storage needs up front) and also whether there is an automated way to figure out how much storage to allocate.

*2)* Does plasma store support simultaneous reads? My code, which has multiple clients all asking for the 6 arrays from the plasma-store thousands of times, was segfaulting with different errors, e.g.

Check failed: RemoveFromClientObjectIds(object_id, entry, client) == 1

until I added a lock around my client.get

if self.use_lock: # Fix segfault

with FileLock("/tmp/plasma_lock"):

ret = self.client.get(self.object_id)

else:

ret = self.client.get(self.object_id)

which fixes.

Here is a full traceback of the failure without the lock https://gist.github.com/sshleifer/75145ba828fcb4e998d5e34c46ce13fc

Is this expected behavior?

*3)* Is there a simple way to add many objects to the plasma store at once? Right now, we are considering changing,

oid = client.put(array)

to

oids = [client.put(x) for x in array]

so that we can fetch one entry at a time. but the writes are much slower.

* 3a) Is there a lower level interface for bulk writes?

* 3b) Or is it recommended to chunk the array and have different python processes write simultaneously to make this faster?

*4)* Is there a way to save/load the contents of the plasma-store to disk without loading everything into memory and then saving it to some other format?

Replication

Setup instructions for fairseq+replicating the segfault: https://gist.github.com/sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a

My code is here: https://github.com/pytorch/fairseq/pull/3287

Thanks!

Sam

Re: Python Plasma Store Best Practices

Posted by Wes McKinney <we...@gmail.com>.
Also to be clear, if someone wants to maintain it, they are more than
welcome to do so.

On Tue, Mar 2, 2021 at 11:49 AM Sam Shleifer <ss...@gmail.com> wrote:

> Thanks, had no idea!
>
>
> On Tue, Mar 02, 2021 at 12:00 PM, Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Sam,
>> I think the lack of responses might be because Plasma is not being
>> actively maintained.  The original authors have forked it into the Ray
>> project.
>>
>> I'm sorry I don't have the expertise to answer your questions.
>>
>> -Micah
>>
>> On Mon, Mar 1, 2021 at 6:48 PM Sam Shleifer <ss...@gmail.com> wrote:
>>
>>> Partial answers are super helpful!
>>> I'm happy to break this up if it's too much for 1 question @moderators
>>> Sam
>>>
>>>
>>>
>>> On Sat, Feb 27, 2021 at 1:27 PM, Sam Shleifer <ss...@gmail.com>
>>> wrote:
>>>
>>>> Hi!
>>>> I am trying to use plasma store to reduce the memory usage of a pytorch
>>>> dataset/dataloader combination, and had 4  questions. I don’t think any of
>>>> them require pytorch knowledge. If you prefer to comment inline there is a
>>>> quip with identical content and prettier formatting here
>>>> https://quip.com/3mwGAJ9KR2HT
>>>>
>>>> *1)* My script starts the plasma-store from python with 200 GB:
>>>>
>>>> nbytes = (1024 ** 3) * 200
>>>> _server = subprocess.Popen(["plasma_store", "-m", str(nbytes), "-s",
>>>> path])
>>>> where nbytes is chosen arbitrarily. From my experiments it seems that
>>>> one should start the store as large as possible within the limits of
>>>> dev/shm . I wanted to verify whether this is actually the best practice (it
>>>> would be hard for my app to know the storage needs up front) and also
>>>> whether there is an automated way to figure out how much storage to
>>>> allocate.
>>>>
>>>> *2)* Does plasma store support simultaneous reads? My code, which has
>>>> multiple clients all asking for the 6 arrays from the plasma-store
>>>> thousands of times, was segfaulting with different errors, e.g.
>>>> Check failed: RemoveFromClientObjectIds(object_id, entry, client) == 1
>>>> until I added a lock around my client.get
>>>>
>>>> if self.use_lock: # Fix segfault
>>>>     with FileLock("/tmp/plasma_lock"):
>>>>         ret = self.client.get(self.object_id)
>>>> else:
>>>>     ret = self.client.get(self.object_id)
>>>>
>>>> which fixes.
>>>>
>>>> Here is a full traceback of the failure without the lock
>>>> https://gist.github.com/sshleifer/75145ba828fcb4e998d5e34c46ce13fc
>>>> Is this expected behavior?
>>>>
>>>> *3)* Is there a simple way to add many objects to the plasma store at
>>>> once? Right now, we are considering changing,
>>>>
>>>> oid = client.put(array)
>>>> to
>>>> oids = [client.put(x) for x in array]
>>>>
>>>> so that we can fetch one entry at a time. but the writes are much
>>>> slower.
>>>>
>>>> * 3a) Is there a lower level interface for bulk writes?
>>>> * 3b) Or is it recommended to chunk the array and have different python
>>>> processes write simultaneously to make this faster?
>>>>
>>>> *4)* Is there a way to save/load the contents of the plasma-store to
>>>> disk without loading everything into memory and then saving it to some
>>>> other format?
>>>>
>>>> Replication
>>>>
>>>> Setup instructions for fairseq+replicating the segfault:
>>>> https://gist.github.com/sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a
>>>> My code is here: https://github.com/pytorch/fairseq/pull/3287
>>>>
>>>> Thanks!
>>>> Sam
>>>>
>>>
>

Re: Python Plasma Store Best Practices

Posted by Sam Shleifer <ss...@gmail.com>.
Thanks, had no idea!

On Tue, Mar 02, 2021 at 12:00 PM, Micah Kornfield < emkornfield@gmail.com > wrote:

> 
> Hi Sam,
> I think the lack of responses might be because Plasma is not being
> actively maintained.  The original authors have forked it into the Ray
> project.
> 
> 
> I'm sorry I don't have the expertise to answer your questions.
> 
> 
> -Micah
> 
> On Mon, Mar 1, 2021 at 6:48 PM Sam Shleifer < sshleifer@ gmail. com (
> sshleifer@gmail.com ) > wrote:
> 
> 
>> Partial answers are super helpful!
>> 
>> I'm happy to break this up if it's too much for 1 question @moderators
>> 
>> Sam
>> 
>> 
>> 
>> 
>> 
>> 
>> On Sat, Feb 27, 2021 at 1:27 PM, Sam Shleifer < sshleifer@ gmail. com (
>> sshleifer@gmail.com ) > wrote:
>> 
>>> Hi!
>>> 
>>> I am trying to use plasma store to reduce the memory usage of a pytorch
>>> dataset/dataloader combination, and had 4  questions. I don’t think any of
>>> them require pytorch knowledge. If you prefer to comment inline there is a
>>> quip with identical content and prettier formatting here https:/ / quip. com/
>>> 3mwGAJ9KR2HT ( https://quip.com/3mwGAJ9KR2HT )
>>> 
>>> 
>>> 
>>> *1)* My script starts the plasma-store from python with 200 GB:
>>> 
>>> 
>>> 
>>> nbytes = (1024 ** 3) * 200
>>> 
>>> _server = subprocess.Popen(["plasma_store", "-m", str(nbytes), "-s",
>>> path])
>>> 
>>> where nbytes is chosen arbitrarily. From my experiments it seems that one
>>> should start the store as large as possible within the limits of dev/shm .
>>> I wanted to verify whether this is actually the best practice (it would be
>>> hard for my app to know the storage needs up front) and also whether there
>>> is an automated way to figure out how much storage to allocate.
>>> 
>>> 
>>> 
>>> *2)* Does plasma store support simultaneous reads? My code, which has
>>> multiple clients all asking for the 6 arrays from the plasma-store
>>> thousands of times, was segfaulting with different errors, e.g.
>>> 
>>> Check failed: RemoveFromClientObjectIds(object_id, entry, client) == 1
>>> 
>>> until I added a lock around my client.get
>>> 
>>> 
>>> 
>>> if self.use_lock: # Fix segfault
>>> 
>>> with FileLock("/tmp/plasma_lock"):
>>> 
>>> ret = self.client.get(self.object_id)
>>> 
>>> else:
>>> 
>>> ret = self.client.get(self.object_id)
>>> 
>>> 
>>> 
>>> which fixes.
>>> 
>>> 
>>> 
>>> Here is a full traceback of the failure without the lock https:/ / gist. github.
>>> com/ sshleifer/ 75145ba828fcb4e998d5e34c46ce13fc (
>>> https://gist.github.com/sshleifer/75145ba828fcb4e998d5e34c46ce13fc )
>>> 
>>> Is this expected behavior?
>>> 
>>> 
>>> 
>>> *3)* Is there a simple way to add many objects to the plasma store at
>>> once? Right now, we are considering changing,
>>> 
>>> 
>>> 
>>> oid = client.put(array)
>>> 
>>> to
>>> 
>>> oids = [client.put(x) for x in array]
>>> 
>>> 
>>> 
>>> so that we can fetch one entry at a time. but the writes are much slower.
>>> 
>>> 
>>> 
>>> * 3a) Is there a lower level interface for bulk writes?
>>> 
>>> * 3b) Or is it recommended to chunk the array and have different python
>>> processes write simultaneously to make this faster?
>>> 
>>> 
>>> 
>>> *4)* Is there a way to save/load the contents of the plasma-store to disk
>>> without loading everything into memory and then saving it to some other
>>> format?
>>> 
>>> 
>>> 
>>> Replication
>>> 
>>> 
>>> 
>>> Setup instructions for fairseq+replicating the segfault: https:/ / gist. github.
>>> com/ sshleifer/ bd6982b3f632f1d4bcefc9feceb30b1a (
>>> https://gist.github.com/sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a )
>>> 
>>> My code is here: https:/ / github. com/ pytorch/ fairseq/ pull/ 3287 (
>>> https://github.com/pytorch/fairseq/pull/3287 )
>>> 
>>> 
>>> 
>>> Thanks!
>>> 
>>> Sam
>>> 
>> 
>> 
> 
>

Re: Python Plasma Store Best Practices

Posted by Micah Kornfield <em...@gmail.com>.
Hi Sam,
I think the lack of responses might be because Plasma is not being actively
maintained.  The original authors have forked it into the Ray project.

I'm sorry I don't have the expertise to answer your questions.

-Micah

On Mon, Mar 1, 2021 at 6:48 PM Sam Shleifer <ss...@gmail.com> wrote:

> Partial answers are super helpful!
> I'm happy to break this up if it's too much for 1 question @moderators
> Sam
>
>
>
> On Sat, Feb 27, 2021 at 1:27 PM, Sam Shleifer <ss...@gmail.com> wrote:
>
>> Hi!
>> I am trying to use plasma store to reduce the memory usage of a pytorch
>> dataset/dataloader combination, and had 4  questions. I don’t think any of
>> them require pytorch knowledge. If you prefer to comment inline there is a
>> quip with identical content and prettier formatting here
>> https://quip.com/3mwGAJ9KR2HT
>>
>> *1)* My script starts the plasma-store from python with 200 GB:
>>
>> nbytes = (1024 ** 3) * 200
>> _server = subprocess.Popen(["plasma_store", "-m", str(nbytes), "-s",
>> path])
>> where nbytes is chosen arbitrarily. From my experiments it seems that one
>> should start the store as large as possible within the limits of dev/shm .
>> I wanted to verify whether this is actually the best practice (it would be
>> hard for my app to know the storage needs up front) and also whether there
>> is an automated way to figure out how much storage to allocate.
>>
>> *2)* Does plasma store support simultaneous reads? My code, which has
>> multiple clients all asking for the 6 arrays from the plasma-store
>> thousands of times, was segfaulting with different errors, e.g.
>> Check failed: RemoveFromClientObjectIds(object_id, entry, client) == 1
>> until I added a lock around my client.get
>>
>> if self.use_lock: # Fix segfault
>>     with FileLock("/tmp/plasma_lock"):
>>         ret = self.client.get(self.object_id)
>> else:
>>     ret = self.client.get(self.object_id)
>>
>> which fixes.
>>
>> Here is a full traceback of the failure without the lock
>> https://gist.github.com/sshleifer/75145ba828fcb4e998d5e34c46ce13fc
>> Is this expected behavior?
>>
>> *3)* Is there a simple way to add many objects to the plasma store at
>> once? Right now, we are considering changing,
>>
>> oid = client.put(array)
>> to
>> oids = [client.put(x) for x in array]
>>
>> so that we can fetch one entry at a time. but the writes are much slower.
>>
>> * 3a) Is there a lower level interface for bulk writes?
>> * 3b) Or is it recommended to chunk the array and have different python
>> processes write simultaneously to make this faster?
>>
>> *4)* Is there a way to save/load the contents of the plasma-store to disk
>> without loading everything into memory and then saving it to some other
>> format?
>>
>> Replication
>>
>> Setup instructions for fairseq+replicating the segfault:
>> https://gist.github.com/sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a
>> My code is here: https://github.com/pytorch/fairseq/pull/3287
>>
>> Thanks!
>> Sam
>>
>
>

Re: Python Plasma Store Best Practices

Posted by Sam Shleifer <ss...@gmail.com>.
Partial answers are super helpful!

I'm happy to break this up if it's too much for 1 question @moderators

Sam

On Sat, Feb 27, 2021 at 1:27 PM, Sam Shleifer < sshleifer@gmail.com > wrote:

> 
> Hi!
> 
> I am trying to use plasma store to reduce the memory usage of a pytorch
> dataset/dataloader combination, and had 4  questions. I don’t think any of
> them require pytorch knowledge. If you prefer to comment inline there is a
> quip with identical content and prettier formatting here https:/ / quip. com/
> 3mwGAJ9KR2HT ( https://quip.com/3mwGAJ9KR2HT )
> 
> 
> 
> *1)* My script starts the plasma-store from python with 200 GB:
> 
> 
> 
> nbytes = (1024 ** 3) * 200
> 
> _server = subprocess.Popen(["plasma_store", "-m", str(nbytes), "-s",
> path])
> 
> where nbytes is chosen arbitrarily. From my experiments it seems that one
> should start the store as large as possible within the limits of dev/shm .
> I wanted to verify whether this is actually the best practice (it would be
> hard for my app to know the storage needs up front) and also whether there
> is an automated way to figure out how much storage to allocate.
> 
> 
> 
> *2)* Does plasma store support simultaneous reads? My code, which has
> multiple clients all asking for the 6 arrays from the plasma-store
> thousands of times, was segfaulting with different errors, e.g.
> 
> Check failed: RemoveFromClientObjectIds(object_id, entry, client) == 1
> 
> until I added a lock around my client.get
> 
> 
> 
> if self.use_lock: # Fix segfault
> 
> with FileLock("/tmp/plasma_lock"):
> 
> ret = self.client.get(self.object_id)
> 
> else:
> 
> ret = self.client.get(self.object_id)
> 
> 
> 
> which fixes.
> 
> 
> 
> Here is a full traceback of the failure without the lock https:/ / gist. github.
> com/ sshleifer/ 75145ba828fcb4e998d5e34c46ce13fc (
> https://gist.github.com/sshleifer/75145ba828fcb4e998d5e34c46ce13fc )
> 
> Is this expected behavior?
> 
> 
> 
> *3)* Is there a simple way to add many objects to the plasma store at
> once? Right now, we are considering changing,
> 
> 
> 
> oid = client.put(array)
> 
> to
> 
> oids = [client.put(x) for x in array]
> 
> 
> 
> so that we can fetch one entry at a time. but the writes are much slower.
> 
> 
> 
> * 3a) Is there a lower level interface for bulk writes?
> 
> * 3b) Or is it recommended to chunk the array and have different python
> processes write simultaneously to make this faster?
> 
> 
> 
> *4)* Is there a way to save/load the contents of the plasma-store to disk
> without loading everything into memory and then saving it to some other
> format?
> 
> 
> 
> Replication
> 
> 
> 
> Setup instructions for fairseq+replicating the segfault: https:/ / gist. github.
> com/ sshleifer/ bd6982b3f632f1d4bcefc9feceb30b1a (
> https://gist.github.com/sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a )
> 
> My code is here: https:/ / github. com/ pytorch/ fairseq/ pull/ 3287 (
> https://github.com/pytorch/fairseq/pull/3287 )
> 
> 
> 
> Thanks!
> 
> Sam
>