You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2020/08/15 04:55:55 UTC

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

>
> Regarding Plasma, you're right we should have started this conversation
> earlier! The way it's being developed in Ray currently isn't useful as a
> standalone project. We realized that tighter integration with Ray's object
> lifetime tracking could be important, and removing IPCs and making it a
> separate thread in the same process as our scheduler could make a big
> difference for performance. Some of these optimizations wouldn't be easy
> without a tight integration, so there are some trade-offs here.

So I guess the question is whether it is worth continuing to try to
maintain a sepearate version of plasma within the Arrow repo?


On Tue, Jul 21, 2020 at 9:28 AM Robert Nishihara <ro...@gmail.com>
wrote:

> Hi all,
>
> Regarding Plasma, you're right we should have started this conversation
> earlier! The way it's being developed in Ray currently isn't useful as a
> standalone project. We realized that tighter integration with Ray's object
> lifetime tracking could be important, and removing IPCs and making it a
> separate thread in the same process as our scheduler could make a big
> difference for performance. Some of these optimizations wouldn't be easy
> without a tight integration, so there are some trade-offs here.
>
> Regarding the Python serialization format, I agree with Antoine that it
> should be deprecated. We began developing it before pickle 5, but now that
> pickle 5 has taken off, it makes less sense (it's useful in its own right,
> but at the end of the day, we were interested in it as a way to serialize
> arbitrary Python objects).
>
> -Robert
>
> On Sun, Jul 12, 2020 at 5:26 PM Wes McKinney <we...@gmail.com> wrote:
>
> > I'll add deprecation warnings to the pyarrow.serialize functions in
> > question, it will be pretty simple.
> >
> > On Sun, Jul 12, 2020, 6:34 PM Neal Richardson <
> neal.p.richardson@gmail.com
> > >
> > wrote:
> >
> > > This seems like something to investigate after the 1.0 release.
> > >
> > > Neal
> > >
> > > On Sun, Jul 12, 2020 at 11:53 AM Antoine Pitrou <an...@python.org>
> > > wrote:
> > >
> > > >
> > > > I'd certainly like to deprecate our custom Python serialization
> format,
> > > > and using pickle protocol 5 instead is a very good idea.
> > > >
> > > > We can probably keep it in 1.0 while raising a FutureWarning.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 12/07/2020 à 19:22, Wes McKinney a écrit :
> > > > > It appears that the Ray developers have decided to fork Plasma and
> > > > > decouple from the Arrow codebase:
> > > > >
> > > > > https://github.com/ray-project/ray/pull/9154
> > > > >
> > > > > This is a disappointing development to occur without any discussion
> > on
> > > > > this mailing list but given the lack of development activity on
> > Plasma
> > > > > I would like to see how others in the community would like to
> > proceed.
> > > > >
> > > > > It appears additionally that the Union-based serialization format
> > > > > implemented by arrow/python/serialize.h and the
> pyarrow/serialize.py
> > > > > has been dropped in favor of pickle5. If there is not value in
> > > > > maintaining this code then it would probably be preferable for us
> to
> > > > > remove this from the codebase.
> > > > >
> > > > > Thanks,
> > > > > Wes
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

Posted by Wes McKinney <we...@gmail.com>.
To be clear, if someone wants to step up as the Plasma maintainer in
Apache Arrow, that's completely fine -- that would be a good outcome.
Many of us had already been concerned for a while about Plasma's
maintenance status -- lots of stale PRs and low engagement on JIRA
issues and mailing list discussions, so now that the hard fork has
happened we want to make sure that we aren't creating the wrong
expectations by shipping a piece of software that has lost its
erstwhile maintainers.

On Sun, Sep 27, 2020 at 6:35 AM Niklas B <ni...@enplore.com> wrote:
>
> We to rely heavily on Plasma (we use Ray as well, but also Plasma independent of Ray). I’ve started a thread on ray dev list to see if Rays plasma can be used standalone outside of ray as well. That would allow us who use Plasma to move to a standalone “ray plasma” when/if it’s removed from Arrow.
>
> > On 26 Sep 2020, at 00:30, Wes McKinney <we...@gmail.com> wrote:
> >
> > I'd suggest as a preliminary that we stop packaging Plasma for 1-2
> > releases to see who is affected by the component's removal. Usage may
> > be more widespread than we realize, and we don't have much telemetry
> > to know for certain.
> >
> > On Tue, Aug 18, 2020 at 1:26 PM Antoine Pitrou <an...@python.org> wrote:
> >>
> >>
> >> Also, the fact that Ray has forked Plasma means their implementation
> >> becomes potentially incompatible with Arrow's.  So even if we keep
> >> Plasma in our codebase, we can't guarantee interoperability with Ray.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 18/08/2020 à 19:51, Wes McKinney a écrit :
> >>> I do not think there is an urgency to remove Plasma from the Arrow
> >>> codebase (as it currently does not cause much maintenance burden), but
> >>> the reality is that Ray has already hard-forked and so new maintainers
> >>> will need to come out of the woodwork to help support the project if
> >>> it is to continue having a life of its own. I started this thread to
> >>> create more awareness of the issue so that existing Plasma
> >>> stakeholders can make themselves known and possibly volunteer their
> >>> time to develop and maintain the codebase.
> >>>
> >>> On Tue, Aug 18, 2020 at 12:02 PM Matthias Vallentin
> >>> <ma...@vallentin.net> wrote:
> >>>>
> >>>> We are very interested in Plasma as a stand-alone project. The fork would
> >>>> hit us doubly hard, because it reduces both the appeal of an Arrow-specific
> >>>> use case as well as our planned Ray integration.
> >>>>
> >>>> We are developing effectively a database for network activity data that
> >>>> runs with Arrow as data plane. See https://github.com/tenzir/vast for
> >>>> details. One of our upcoming features is supporting a 1:N output channel
> >>>> using Plasma, where multiple downstream tools (Python/Pandas, R, Spark) can
> >>>> process the same data set that's exactly materialized in memory once. We
> >>>> currently don't have the developer bandwidth to prioritize this effort, but
> >>>> the concurrent, multi-tool processing capability was one of the main
> >>>> strategic reasons to go with Arrow as data plane. If Plasma has no future,
> >>>> Arrow has a reduced appeal for us in the medium term.
> >>>>
> >>>> We also have Ray as a data consumer on our roadmap, but the dependency
> >>>> chain seems now inverted. If we have to do costly custom plumbing for Ray,
> >>>> with a custom version of Plasma, the Ray integration will lose quite a bit
> >>>> of appeal because it doesn't fit into the existing 1:N model. That is, even
> >>>> though the fork may make sense from a Ray-internal point of view, it
> >>>> decreases the appeal of Ray from the outside. (Again, only speaking shared
> >>>> data plane here.)
> >>>>
> >>>> In the future, we're happy to contribute cycles when it comes to keeping
> >>>> Plasma as a useful standalone project. We recently made sure that static
> >>>> builds work as expected <https://github.com/apache/arrow/pull/7842>. As of
> >>>> now, we unfortunately cannot commit to anything specific though, but our
> >>>> interest extends to Gandiva, Flight, and lots of other parts of the Arrow
> >>>> ecosystem.
> >>>>
> >>>> On Tue, Aug 18, 2020 at 4:02 AM Robert Nishihara <ro...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> To answer Wes's question, the Plasma inside of Ray is not currently usable
> >>>>>
> >>>>>
> >>>>> in a C++ library context, though it wouldn't be impossible to make that
> >>>>>
> >>>>>
> >>>>> happen.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> I (or someone) could conduct a simple poll via Google Forms on the user
> >>>>>
> >>>>>
> >>>>> mailing list to gauge demand if we are concerned about breaking a lot of
> >>>>>
> >>>>>
> >>>>> people's workflow.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Mon, Aug 17, 2020 at 3:21 AM Antoine Pitrou <an...@python.org> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>> Le 15/08/2020 à 17:56, Wes McKinney a écrit :
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>>>> What isn't clear is whether the Plasma that's in Ray is usable in a
> >>>>>
> >>>>>
> >>>>>>> C++ library context (e.g. what we currently ship as libplasma-dev e.g.
> >>>>>
> >>>>>
> >>>>>>> on Ubuntu/Debian). That seems still useful, but if the project isn't
> >>>>>
> >>>>>
> >>>>>>> being actively maintained / developed (which, given the series of
> >>>>>
> >>>>>
> >>>>>>> stale PRs over the last year or two, it doesn't seem to be) it's
> >>>>>
> >>>>>
> >>>>>>> unclear whether we want to keep shipping it.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>> At least on GitHub, the C++ API seems to be getting little use.  Most
> >>>>>
> >>>>>
> >>>>>> search results below are forks/copies of the Arrow or Ray codebases.
> >>>>>
> >>>>>
> >>>>>> There are also a couple stale experiments:
> >>>>>
> >>>>>
> >>>>>> https://github.com/search?l=C%2B%2B&p=1&q=PlasmaClient&type=Code
> >>>>>
> >>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>> Regards
> >>>>>
> >>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>> Antoine.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
>

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

Posted by Niklas B <ni...@enplore.com>.
We to rely heavily on Plasma (we use Ray as well, but also Plasma independent of Ray). I’ve started a thread on ray dev list to see if Rays plasma can be used standalone outside of ray as well. That would allow us who use Plasma to move to a standalone “ray plasma” when/if it’s removed from Arrow.

> On 26 Sep 2020, at 00:30, Wes McKinney <we...@gmail.com> wrote:
> 
> I'd suggest as a preliminary that we stop packaging Plasma for 1-2
> releases to see who is affected by the component's removal. Usage may
> be more widespread than we realize, and we don't have much telemetry
> to know for certain.
> 
> On Tue, Aug 18, 2020 at 1:26 PM Antoine Pitrou <an...@python.org> wrote:
>> 
>> 
>> Also, the fact that Ray has forked Plasma means their implementation
>> becomes potentially incompatible with Arrow's.  So even if we keep
>> Plasma in our codebase, we can't guarantee interoperability with Ray.
>> 
>> Regards
>> 
>> Antoine.
>> 
>> 
>> Le 18/08/2020 à 19:51, Wes McKinney a écrit :
>>> I do not think there is an urgency to remove Plasma from the Arrow
>>> codebase (as it currently does not cause much maintenance burden), but
>>> the reality is that Ray has already hard-forked and so new maintainers
>>> will need to come out of the woodwork to help support the project if
>>> it is to continue having a life of its own. I started this thread to
>>> create more awareness of the issue so that existing Plasma
>>> stakeholders can make themselves known and possibly volunteer their
>>> time to develop and maintain the codebase.
>>> 
>>> On Tue, Aug 18, 2020 at 12:02 PM Matthias Vallentin
>>> <ma...@vallentin.net> wrote:
>>>> 
>>>> We are very interested in Plasma as a stand-alone project. The fork would
>>>> hit us doubly hard, because it reduces both the appeal of an Arrow-specific
>>>> use case as well as our planned Ray integration.
>>>> 
>>>> We are developing effectively a database for network activity data that
>>>> runs with Arrow as data plane. See https://github.com/tenzir/vast for
>>>> details. One of our upcoming features is supporting a 1:N output channel
>>>> using Plasma, where multiple downstream tools (Python/Pandas, R, Spark) can
>>>> process the same data set that's exactly materialized in memory once. We
>>>> currently don't have the developer bandwidth to prioritize this effort, but
>>>> the concurrent, multi-tool processing capability was one of the main
>>>> strategic reasons to go with Arrow as data plane. If Plasma has no future,
>>>> Arrow has a reduced appeal for us in the medium term.
>>>> 
>>>> We also have Ray as a data consumer on our roadmap, but the dependency
>>>> chain seems now inverted. If we have to do costly custom plumbing for Ray,
>>>> with a custom version of Plasma, the Ray integration will lose quite a bit
>>>> of appeal because it doesn't fit into the existing 1:N model. That is, even
>>>> though the fork may make sense from a Ray-internal point of view, it
>>>> decreases the appeal of Ray from the outside. (Again, only speaking shared
>>>> data plane here.)
>>>> 
>>>> In the future, we're happy to contribute cycles when it comes to keeping
>>>> Plasma as a useful standalone project. We recently made sure that static
>>>> builds work as expected <https://github.com/apache/arrow/pull/7842>. As of
>>>> now, we unfortunately cannot commit to anything specific though, but our
>>>> interest extends to Gandiva, Flight, and lots of other parts of the Arrow
>>>> ecosystem.
>>>> 
>>>> On Tue, Aug 18, 2020 at 4:02 AM Robert Nishihara <ro...@gmail.com>
>>>> wrote:
>>>> 
>>>>> To answer Wes's question, the Plasma inside of Ray is not currently usable
>>>>> 
>>>>> 
>>>>> in a C++ library context, though it wouldn't be impossible to make that
>>>>> 
>>>>> 
>>>>> happen.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> I (or someone) could conduct a simple poll via Google Forms on the user
>>>>> 
>>>>> 
>>>>> mailing list to gauge demand if we are concerned about breaking a lot of
>>>>> 
>>>>> 
>>>>> people's workflow.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Aug 17, 2020 at 3:21 AM Antoine Pitrou <an...@python.org> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>>> Le 15/08/2020 à 17:56, Wes McKinney a écrit :
>>>>> 
>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>>>> What isn't clear is whether the Plasma that's in Ray is usable in a
>>>>> 
>>>>> 
>>>>>>> C++ library context (e.g. what we currently ship as libplasma-dev e.g.
>>>>> 
>>>>> 
>>>>>>> on Ubuntu/Debian). That seems still useful, but if the project isn't
>>>>> 
>>>>> 
>>>>>>> being actively maintained / developed (which, given the series of
>>>>> 
>>>>> 
>>>>>>> stale PRs over the last year or two, it doesn't seem to be) it's
>>>>> 
>>>>> 
>>>>>>> unclear whether we want to keep shipping it.
>>>>> 
>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>>> At least on GitHub, the C++ API seems to be getting little use.  Most
>>>>> 
>>>>> 
>>>>>> search results below are forks/copies of the Arrow or Ray codebases.
>>>>> 
>>>>> 
>>>>>> There are also a couple stale experiments:
>>>>> 
>>>>> 
>>>>>> https://github.com/search?l=C%2B%2B&p=1&q=PlasmaClient&type=Code
>>>>> 
>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>>> Regards
>>>>> 
>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>>> Antoine.
>>>>> 
>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 


Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

Posted by Wes McKinney <we...@gmail.com>.
I'd suggest as a preliminary that we stop packaging Plasma for 1-2
releases to see who is affected by the component's removal. Usage may
be more widespread than we realize, and we don't have much telemetry
to know for certain.

On Tue, Aug 18, 2020 at 1:26 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Also, the fact that Ray has forked Plasma means their implementation
> becomes potentially incompatible with Arrow's.  So even if we keep
> Plasma in our codebase, we can't guarantee interoperability with Ray.
>
> Regards
>
> Antoine.
>
>
> Le 18/08/2020 à 19:51, Wes McKinney a écrit :
> > I do not think there is an urgency to remove Plasma from the Arrow
> > codebase (as it currently does not cause much maintenance burden), but
> > the reality is that Ray has already hard-forked and so new maintainers
> > will need to come out of the woodwork to help support the project if
> > it is to continue having a life of its own. I started this thread to
> > create more awareness of the issue so that existing Plasma
> > stakeholders can make themselves known and possibly volunteer their
> > time to develop and maintain the codebase.
> >
> > On Tue, Aug 18, 2020 at 12:02 PM Matthias Vallentin
> > <ma...@vallentin.net> wrote:
> >>
> >> We are very interested in Plasma as a stand-alone project. The fork would
> >> hit us doubly hard, because it reduces both the appeal of an Arrow-specific
> >> use case as well as our planned Ray integration.
> >>
> >> We are developing effectively a database for network activity data that
> >> runs with Arrow as data plane. See https://github.com/tenzir/vast for
> >> details. One of our upcoming features is supporting a 1:N output channel
> >> using Plasma, where multiple downstream tools (Python/Pandas, R, Spark) can
> >> process the same data set that's exactly materialized in memory once. We
> >> currently don't have the developer bandwidth to prioritize this effort, but
> >> the concurrent, multi-tool processing capability was one of the main
> >> strategic reasons to go with Arrow as data plane. If Plasma has no future,
> >> Arrow has a reduced appeal for us in the medium term.
> >>
> >> We also have Ray as a data consumer on our roadmap, but the dependency
> >> chain seems now inverted. If we have to do costly custom plumbing for Ray,
> >> with a custom version of Plasma, the Ray integration will lose quite a bit
> >> of appeal because it doesn't fit into the existing 1:N model. That is, even
> >> though the fork may make sense from a Ray-internal point of view, it
> >> decreases the appeal of Ray from the outside. (Again, only speaking shared
> >> data plane here.)
> >>
> >> In the future, we're happy to contribute cycles when it comes to keeping
> >> Plasma as a useful standalone project. We recently made sure that static
> >> builds work as expected <https://github.com/apache/arrow/pull/7842>. As of
> >> now, we unfortunately cannot commit to anything specific though, but our
> >> interest extends to Gandiva, Flight, and lots of other parts of the Arrow
> >> ecosystem.
> >>
> >> On Tue, Aug 18, 2020 at 4:02 AM Robert Nishihara <ro...@gmail.com>
> >> wrote:
> >>
> >>> To answer Wes's question, the Plasma inside of Ray is not currently usable
> >>>
> >>>
> >>> in a C++ library context, though it wouldn't be impossible to make that
> >>>
> >>>
> >>> happen.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> I (or someone) could conduct a simple poll via Google Forms on the user
> >>>
> >>>
> >>> mailing list to gauge demand if we are concerned about breaking a lot of
> >>>
> >>>
> >>> people's workflow.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, Aug 17, 2020 at 3:21 AM Antoine Pitrou <an...@python.org> wrote:
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>
> >>>
> >>>
> >>>> Le 15/08/2020 à 17:56, Wes McKinney a écrit :
> >>>
> >>>
> >>>>>
> >>>
> >>>
> >>>>> What isn't clear is whether the Plasma that's in Ray is usable in a
> >>>
> >>>
> >>>>> C++ library context (e.g. what we currently ship as libplasma-dev e.g.
> >>>
> >>>
> >>>>> on Ubuntu/Debian). That seems still useful, but if the project isn't
> >>>
> >>>
> >>>>> being actively maintained / developed (which, given the series of
> >>>
> >>>
> >>>>> stale PRs over the last year or two, it doesn't seem to be) it's
> >>>
> >>>
> >>>>> unclear whether we want to keep shipping it.
> >>>
> >>>
> >>>>
> >>>
> >>>
> >>>> At least on GitHub, the C++ API seems to be getting little use.  Most
> >>>
> >>>
> >>>> search results below are forks/copies of the Arrow or Ray codebases.
> >>>
> >>>
> >>>> There are also a couple stale experiments:
> >>>
> >>>
> >>>> https://github.com/search?l=C%2B%2B&p=1&q=PlasmaClient&type=Code
> >>>
> >>>
> >>>>
> >>>
> >>>
> >>>> Regards
> >>>
> >>>
> >>>>
> >>>
> >>>
> >>>> Antoine.
> >>>
> >>>
> >>>>
> >>>
> >>>
> >>>

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

Posted by Antoine Pitrou <an...@python.org>.
Also, the fact that Ray has forked Plasma means their implementation
becomes potentially incompatible with Arrow's.  So even if we keep
Plasma in our codebase, we can't guarantee interoperability with Ray.

Regards

Antoine.


Le 18/08/2020 à 19:51, Wes McKinney a écrit :
> I do not think there is an urgency to remove Plasma from the Arrow
> codebase (as it currently does not cause much maintenance burden), but
> the reality is that Ray has already hard-forked and so new maintainers
> will need to come out of the woodwork to help support the project if
> it is to continue having a life of its own. I started this thread to
> create more awareness of the issue so that existing Plasma
> stakeholders can make themselves known and possibly volunteer their
> time to develop and maintain the codebase.
> 
> On Tue, Aug 18, 2020 at 12:02 PM Matthias Vallentin
> <ma...@vallentin.net> wrote:
>>
>> We are very interested in Plasma as a stand-alone project. The fork would
>> hit us doubly hard, because it reduces both the appeal of an Arrow-specific
>> use case as well as our planned Ray integration.
>>
>> We are developing effectively a database for network activity data that
>> runs with Arrow as data plane. See https://github.com/tenzir/vast for
>> details. One of our upcoming features is supporting a 1:N output channel
>> using Plasma, where multiple downstream tools (Python/Pandas, R, Spark) can
>> process the same data set that's exactly materialized in memory once. We
>> currently don't have the developer bandwidth to prioritize this effort, but
>> the concurrent, multi-tool processing capability was one of the main
>> strategic reasons to go with Arrow as data plane. If Plasma has no future,
>> Arrow has a reduced appeal for us in the medium term.
>>
>> We also have Ray as a data consumer on our roadmap, but the dependency
>> chain seems now inverted. If we have to do costly custom plumbing for Ray,
>> with a custom version of Plasma, the Ray integration will lose quite a bit
>> of appeal because it doesn't fit into the existing 1:N model. That is, even
>> though the fork may make sense from a Ray-internal point of view, it
>> decreases the appeal of Ray from the outside. (Again, only speaking shared
>> data plane here.)
>>
>> In the future, we're happy to contribute cycles when it comes to keeping
>> Plasma as a useful standalone project. We recently made sure that static
>> builds work as expected <https://github.com/apache/arrow/pull/7842>. As of
>> now, we unfortunately cannot commit to anything specific though, but our
>> interest extends to Gandiva, Flight, and lots of other parts of the Arrow
>> ecosystem.
>>
>> On Tue, Aug 18, 2020 at 4:02 AM Robert Nishihara <ro...@gmail.com>
>> wrote:
>>
>>> To answer Wes's question, the Plasma inside of Ray is not currently usable
>>>
>>>
>>> in a C++ library context, though it wouldn't be impossible to make that
>>>
>>>
>>> happen.
>>>
>>>
>>>
>>>
>>>
>>> I (or someone) could conduct a simple poll via Google Forms on the user
>>>
>>>
>>> mailing list to gauge demand if we are concerned about breaking a lot of
>>>
>>>
>>> people's workflow.
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Aug 17, 2020 at 3:21 AM Antoine Pitrou <an...@python.org> wrote:
>>>
>>>
>>>
>>>
>>>
>>>>
>>>
>>>
>>>> Le 15/08/2020 à 17:56, Wes McKinney a écrit :
>>>
>>>
>>>>>
>>>
>>>
>>>>> What isn't clear is whether the Plasma that's in Ray is usable in a
>>>
>>>
>>>>> C++ library context (e.g. what we currently ship as libplasma-dev e.g.
>>>
>>>
>>>>> on Ubuntu/Debian). That seems still useful, but if the project isn't
>>>
>>>
>>>>> being actively maintained / developed (which, given the series of
>>>
>>>
>>>>> stale PRs over the last year or two, it doesn't seem to be) it's
>>>
>>>
>>>>> unclear whether we want to keep shipping it.
>>>
>>>
>>>>
>>>
>>>
>>>> At least on GitHub, the C++ API seems to be getting little use.  Most
>>>
>>>
>>>> search results below are forks/copies of the Arrow or Ray codebases.
>>>
>>>
>>>> There are also a couple stale experiments:
>>>
>>>
>>>> https://github.com/search?l=C%2B%2B&p=1&q=PlasmaClient&type=Code
>>>
>>>
>>>>
>>>
>>>
>>>> Regards
>>>
>>>
>>>>
>>>
>>>
>>>> Antoine.
>>>
>>>
>>>>
>>>
>>>
>>>

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

Posted by Wes McKinney <we...@gmail.com>.
I do not think there is an urgency to remove Plasma from the Arrow
codebase (as it currently does not cause much maintenance burden), but
the reality is that Ray has already hard-forked and so new maintainers
will need to come out of the woodwork to help support the project if
it is to continue having a life of its own. I started this thread to
create more awareness of the issue so that existing Plasma
stakeholders can make themselves known and possibly volunteer their
time to develop and maintain the codebase.

On Tue, Aug 18, 2020 at 12:02 PM Matthias Vallentin
<ma...@vallentin.net> wrote:
>
> We are very interested in Plasma as a stand-alone project. The fork would
> hit us doubly hard, because it reduces both the appeal of an Arrow-specific
> use case as well as our planned Ray integration.
>
> We are developing effectively a database for network activity data that
> runs with Arrow as data plane. See https://github.com/tenzir/vast for
> details. One of our upcoming features is supporting a 1:N output channel
> using Plasma, where multiple downstream tools (Python/Pandas, R, Spark) can
> process the same data set that's exactly materialized in memory once. We
> currently don't have the developer bandwidth to prioritize this effort, but
> the concurrent, multi-tool processing capability was one of the main
> strategic reasons to go with Arrow as data plane. If Plasma has no future,
> Arrow has a reduced appeal for us in the medium term.
>
> We also have Ray as a data consumer on our roadmap, but the dependency
> chain seems now inverted. If we have to do costly custom plumbing for Ray,
> with a custom version of Plasma, the Ray integration will lose quite a bit
> of appeal because it doesn't fit into the existing 1:N model. That is, even
> though the fork may make sense from a Ray-internal point of view, it
> decreases the appeal of Ray from the outside. (Again, only speaking shared
> data plane here.)
>
> In the future, we're happy to contribute cycles when it comes to keeping
> Plasma as a useful standalone project. We recently made sure that static
> builds work as expected <https://github.com/apache/arrow/pull/7842>. As of
> now, we unfortunately cannot commit to anything specific though, but our
> interest extends to Gandiva, Flight, and lots of other parts of the Arrow
> ecosystem.
>
> On Tue, Aug 18, 2020 at 4:02 AM Robert Nishihara <ro...@gmail.com>
> wrote:
>
> > To answer Wes's question, the Plasma inside of Ray is not currently usable
> >
> >
> > in a C++ library context, though it wouldn't be impossible to make that
> >
> >
> > happen.
> >
> >
> >
> >
> >
> > I (or someone) could conduct a simple poll via Google Forms on the user
> >
> >
> > mailing list to gauge demand if we are concerned about breaking a lot of
> >
> >
> > people's workflow.
> >
> >
> >
> >
> >
> > On Mon, Aug 17, 2020 at 3:21 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> >
> >
> >
> > >
> >
> >
> > > Le 15/08/2020 à 17:56, Wes McKinney a écrit :
> >
> >
> > > >
> >
> >
> > > > What isn't clear is whether the Plasma that's in Ray is usable in a
> >
> >
> > > > C++ library context (e.g. what we currently ship as libplasma-dev e.g.
> >
> >
> > > > on Ubuntu/Debian). That seems still useful, but if the project isn't
> >
> >
> > > > being actively maintained / developed (which, given the series of
> >
> >
> > > > stale PRs over the last year or two, it doesn't seem to be) it's
> >
> >
> > > > unclear whether we want to keep shipping it.
> >
> >
> > >
> >
> >
> > > At least on GitHub, the C++ API seems to be getting little use.  Most
> >
> >
> > > search results below are forks/copies of the Arrow or Ray codebases.
> >
> >
> > > There are also a couple stale experiments:
> >
> >
> > > https://github.com/search?l=C%2B%2B&p=1&q=PlasmaClient&type=Code
> >
> >
> > >
> >
> >
> > > Regards
> >
> >
> > >
> >
> >
> > > Antoine.
> >
> >
> > >
> >
> >
> >

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

Posted by Matthias Vallentin <ma...@vallentin.net>.
We are very interested in Plasma as a stand-alone project. The fork would
hit us doubly hard, because it reduces both the appeal of an Arrow-specific
use case as well as our planned Ray integration.

We are developing effectively a database for network activity data that
runs with Arrow as data plane. See https://github.com/tenzir/vast for
details. One of our upcoming features is supporting a 1:N output channel
using Plasma, where multiple downstream tools (Python/Pandas, R, Spark) can
process the same data set that's exactly materialized in memory once. We
currently don't have the developer bandwidth to prioritize this effort, but
the concurrent, multi-tool processing capability was one of the main
strategic reasons to go with Arrow as data plane. If Plasma has no future,
Arrow has a reduced appeal for us in the medium term.

We also have Ray as a data consumer on our roadmap, but the dependency
chain seems now inverted. If we have to do costly custom plumbing for Ray,
with a custom version of Plasma, the Ray integration will lose quite a bit
of appeal because it doesn't fit into the existing 1:N model. That is, even
though the fork may make sense from a Ray-internal point of view, it
decreases the appeal of Ray from the outside. (Again, only speaking shared
data plane here.)

In the future, we're happy to contribute cycles when it comes to keeping
Plasma as a useful standalone project. We recently made sure that static
builds work as expected <https://github.com/apache/arrow/pull/7842>. As of
now, we unfortunately cannot commit to anything specific though, but our
interest extends to Gandiva, Flight, and lots of other parts of the Arrow
ecosystem.

On Tue, Aug 18, 2020 at 4:02 AM Robert Nishihara <ro...@gmail.com>
wrote:

> To answer Wes's question, the Plasma inside of Ray is not currently usable
>
>
> in a C++ library context, though it wouldn't be impossible to make that
>
>
> happen.
>
>
>
>
>
> I (or someone) could conduct a simple poll via Google Forms on the user
>
>
> mailing list to gauge demand if we are concerned about breaking a lot of
>
>
> people's workflow.
>
>
>
>
>
> On Mon, Aug 17, 2020 at 3:21 AM Antoine Pitrou <an...@python.org> wrote:
>
>
>
>
>
> >
>
>
> > Le 15/08/2020 à 17:56, Wes McKinney a écrit :
>
>
> > >
>
>
> > > What isn't clear is whether the Plasma that's in Ray is usable in a
>
>
> > > C++ library context (e.g. what we currently ship as libplasma-dev e.g.
>
>
> > > on Ubuntu/Debian). That seems still useful, but if the project isn't
>
>
> > > being actively maintained / developed (which, given the series of
>
>
> > > stale PRs over the last year or two, it doesn't seem to be) it's
>
>
> > > unclear whether we want to keep shipping it.
>
>
> >
>
>
> > At least on GitHub, the C++ API seems to be getting little use.  Most
>
>
> > search results below are forks/copies of the Arrow or Ray codebases.
>
>
> > There are also a couple stale experiments:
>
>
> > https://github.com/search?l=C%2B%2B&p=1&q=PlasmaClient&type=Code
>
>
> >
>
>
> > Regards
>
>
> >
>
>
> > Antoine.
>
>
> >
>
>
>

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

Posted by Robert Nishihara <ro...@gmail.com>.
To answer Wes's question, the Plasma inside of Ray is not currently usable
in a C++ library context, though it wouldn't be impossible to make that
happen.

I (or someone) could conduct a simple poll via Google Forms on the user
mailing list to gauge demand if we are concerned about breaking a lot of
people's workflow.

On Mon, Aug 17, 2020 at 3:21 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 15/08/2020 à 17:56, Wes McKinney a écrit :
> >
> > What isn't clear is whether the Plasma that's in Ray is usable in a
> > C++ library context (e.g. what we currently ship as libplasma-dev e.g.
> > on Ubuntu/Debian). That seems still useful, but if the project isn't
> > being actively maintained / developed (which, given the series of
> > stale PRs over the last year or two, it doesn't seem to be) it's
> > unclear whether we want to keep shipping it.
>
> At least on GitHub, the C++ API seems to be getting little use.  Most
> search results below are forks/copies of the Arrow or Ray codebases.
> There are also a couple stale experiments:
> https://github.com/search?l=C%2B%2B&p=1&q=PlasmaClient&type=Code
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

Posted by Antoine Pitrou <an...@python.org>.
Le 15/08/2020 à 17:56, Wes McKinney a écrit :
> 
> What isn't clear is whether the Plasma that's in Ray is usable in a
> C++ library context (e.g. what we currently ship as libplasma-dev e.g.
> on Ubuntu/Debian). That seems still useful, but if the project isn't
> being actively maintained / developed (which, given the series of
> stale PRs over the last year or two, it doesn't seem to be) it's
> unclear whether we want to keep shipping it.

At least on GitHub, the C++ API seems to be getting little use.  Most
search results below are forks/copies of the Arrow or Ray codebases.
There are also a couple stale experiments:
https://github.com/search?l=C%2B%2B&p=1&q=PlasmaClient&type=Code

Regards

Antoine.

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

Posted by Wes McKinney <we...@gmail.com>.
On Fri, Aug 14, 2020 at 11:56 PM Micah Kornfield <em...@gmail.com> wrote:
>
> >
> > Regarding Plasma, you're right we should have started this conversation
> > earlier! The way it's being developed in Ray currently isn't useful as a
> > standalone project. We realized that tighter integration with Ray's object
> > lifetime tracking could be important, and removing IPCs and making it a
> > separate thread in the same process as our scheduler could make a big
> > difference for performance. Some of these optimizations wouldn't be easy
> > without a tight integration, so there are some trade-offs here.
>
> So I guess the question is whether it is worth continuing to try to
> maintain a sepearate version of plasma within the Arrow repo?
>

What isn't clear is whether the Plasma that's in Ray is usable in a
C++ library context (e.g. what we currently ship as libplasma-dev e.g.
on Ubuntu/Debian). That seems still useful, but if the project isn't
being actively maintained / developed (which, given the series of
stale PRs over the last year or two, it doesn't seem to be) it's
unclear whether we want to keep shipping it.

> On Tue, Jul 21, 2020 at 9:28 AM Robert Nishihara <ro...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > Regarding Plasma, you're right we should have started this conversation
> > earlier! The way it's being developed in Ray currently isn't useful as a
> > standalone project. We realized that tighter integration with Ray's object
> > lifetime tracking could be important, and removing IPCs and making it a
> > separate thread in the same process as our scheduler could make a big
> > difference for performance. Some of these optimizations wouldn't be easy
> > without a tight integration, so there are some trade-offs here.
> >
> > Regarding the Python serialization format, I agree with Antoine that it
> > should be deprecated. We began developing it before pickle 5, but now that
> > pickle 5 has taken off, it makes less sense (it's useful in its own right,
> > but at the end of the day, we were interested in it as a way to serialize
> > arbitrary Python objects).
> >
> > -Robert
> >
> > On Sun, Jul 12, 2020 at 5:26 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > I'll add deprecation warnings to the pyarrow.serialize functions in
> > > question, it will be pretty simple.
> > >
> > > On Sun, Jul 12, 2020, 6:34 PM Neal Richardson <
> > neal.p.richardson@gmail.com
> > > >
> > > wrote:
> > >
> > > > This seems like something to investigate after the 1.0 release.
> > > >
> > > > Neal
> > > >
> > > > On Sun, Jul 12, 2020 at 11:53 AM Antoine Pitrou <an...@python.org>
> > > > wrote:
> > > >
> > > > >
> > > > > I'd certainly like to deprecate our custom Python serialization
> > format,
> > > > > and using pickle protocol 5 instead is a very good idea.
> > > > >
> > > > > We can probably keep it in 1.0 while raising a FutureWarning.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > Le 12/07/2020 à 19:22, Wes McKinney a écrit :
> > > > > > It appears that the Ray developers have decided to fork Plasma and
> > > > > > decouple from the Arrow codebase:
> > > > > >
> > > > > > https://github.com/ray-project/ray/pull/9154
> > > > > >
> > > > > > This is a disappointing development to occur without any discussion
> > > on
> > > > > > this mailing list but given the lack of development activity on
> > > Plasma
> > > > > > I would like to see how others in the community would like to
> > > proceed.
> > > > > >
> > > > > > It appears additionally that the Union-based serialization format
> > > > > > implemented by arrow/python/serialize.h and the
> > pyarrow/serialize.py
> > > > > > has been dropped in favor of pickle5. If there is not value in
> > > > > > maintaining this code then it would probably be preferable for us
> > to
> > > > > > remove this from the codebase.
> > > > > >
> > > > > > Thanks,
> > > > > > Wes
> > > > > >
> > > > >
> > > >
> > >
> >