You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Bryan Cutler <cu...@gmail.com> on 2019/04/21 05:45:03 UTC

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

The Arrow data format is not yet stable, meaning there are no guarantees on
backwards/forwards compatibility. Once version 1.0 is released, it will
have those guarantees but it's hard to say when that will be. The remaining
work to get there can be seen at
https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
So yes, it is a risk that exposing Spark data as Arrow could cause an issue
if handled by a different version that is not compatible. That being said,
changes to format are not taken lightly and are backwards compatible when
possible. I think it would be fair to mark the APIs exposing Arrow data as
experimental for the time being, and clearly state the version that must be
used to be compatible in the docs. Also, adding features like this and
SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
release. Adding the Arrow dev list to CC.

Bryan

On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com>
wrote:

> Okay, that makes sense, but is the Arrow data format stable? If not, we
> risk breakage when Arrow changes in the future and some libraries using
> this feature are begin to use the new Arrow code.
>
> Matei
>
> > On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
> >
> > I want to be clear that this SPIP is not proposing exposing Arrow
> APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
> because of the overlap between the two SPIPs I scaled this one back to
> concentrate just on the columnar processing aspects. Sorry for the
> confusion as I didn't update the JIRA description clearly enough when we
> adjusted it during the discussion on the JIRA.  As part of the columnar
> processing, we plan on providing arrow formatted data, but that will be
> exposed through a Spark owned API.
> >
> > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com>
> wrote:
> > FYI, I’d also be concerned about exposing the Arrow API or format as a
> public API if it’s not yet stable. Is stabilization of the API and format
> coming soon on the roadmap there? Maybe someone can work with the Arrow
> community to make that happen.
> >
> > We’ve been bitten lots of times by API changes forced by external
> libraries even when those were widely popular. For example, we used Guava’s
> Optional for a while, which changed at some point, and we also had issues
> with Protobuf and Scala itself (especially how Scala’s APIs appear in
> Java). API breakage might not be as serious in dynamic languages like
> Python, where you can often keep compatibility with old behaviors, but it
> really hurts in Java and Scala.
> >
> > The problem is especially bad for us because of two aspects of how Spark
> is used:
> >
> > 1) Spark is used for production data transformation jobs that people
> need to keep running for a long time. Nobody wants to make changes to a job
> that’s been working fine and computing something correctly for years just
> to get a bug fix from the latest Spark release or whatever. It’s much
> better if they can upgrade Spark without editing every job.
> >
> > 2) Spark is often used as “glue” to combine data processing code in
> other libraries, and these might start to require different versions of our
> dependencies. For example, the Guava class exposed in Spark became a
> problem when third-party libraries started requiring a new version of
> Guava: those new libraries just couldn’t work with Spark. Protobuf was
> especially bad because some users wanted to read data stored as Protobufs
> (or in a format that uses Protobuf inside), so they needed a different
> version of the library in their main data processing code.
> >
> > If there was some guarantee that this stuff would remain
> backward-compatible, we’d be in a much better stuff. It’s not that hard to
> keep a storage format backward-compatible: just document the format and
> extend it only in ways that don’t break the meaning of old data (for
> example, add new version numbers or field types that are read in a
> different way). It’s a bit harder for a Java API, but maybe Spark could
> just expose byte arrays directly and work on those if the API is not
> guaranteed to stay stable (that is, we’d still use our own classes to
> manipulate the data internally, and end users could use the Arrow library
> if they want it).
> >
> > Matei
> >
> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
> > >
> > > I think you misunderstood the point of this SPIP. I responded to your
> comments in the SPIP JIRA.
> > >
> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com>
> wrote:
> > > I posted my comment in the JIRA. Main concerns here:
> > >
> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
> 1.0 release someday.
> > > 2. ML/DL systems that can benefits from columnar format are mostly in
> Python.
> > > 3. Simple operations, though benefits vectorization, might not be
> worth the data exchange overhead.
> > >
> > > So would an improved Pandas UDF API would be good enough? For example,
> SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > >
> > > Sorry that I should join the discussion earlier! Hope it is not too
> late:)
> > >
> > > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
> > > +1 (non-binding) for better columnar data processing support.
> > >
> > >
> > >
> > > From: Jules Damji <dm...@comcast.net>
> > > Sent: Friday, April 19, 2019 12:21 PM
> > > To: Bryan Cutler <cu...@gmail.com>
> > > Cc: Dev <de...@spark.apache.org>
> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
> > >
> > >
> > >
> > > + (non-binding)
> > >
> > > Sent from my iPhone
> > >
> > > Pardon the dumb thumb typos :)
> > >
> > >
> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com> wrote:
> > >
> > > +1 (non-binding)
> > >
> > >
> > >
> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote:
> > >
> > > +1 (non-binding).  Looking forward to seeing better support for
> processing columnar data.
> > >
> > >
> > >
> > > Jason
> > >
> > >
> > >
> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
> <tg...@yahoo.com.invalid> wrote:
> > >
> > > Hi everyone,
> > >
> > >
> > >
> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
> > >
> > >
> > >
> > > You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
> > >
> > >
> > >
> > > Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
> > >
> > >
> > >
> > > [ ] +1: Accept the proposal as an official SPIP
> > >
> > > [ ] +0
> > >
> > > [ ] -1: I don't think this is a good idea because ...
> > >
> > >
> > >
> > >
> > >
> > > Thanks!
> > >
> > > Tom Graves
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Bobby Evans <re...@gmail.com>.

Agreed.

Tom, could you cancel the vote?



On Mon, Apr 22, 2019 at 1:07 PM Reynold Xin <rx...@databricks.com> wrote:

> "if others think it would be helpful, we can cancel this vote, update the
> SPIP to clarify exactly what I am proposing, and then restart the vote
> after we have gotten more agreement on what APIs should be exposed"
>
> That'd be very useful. At least I was confused by what the SPIP was about.
> No point voting on something when there is still a lot of confusion about
> what it is.
>
>
> On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans <re...@gmail.com> wrote:
>
>> Xiangrui Meng,
>>
>> I provided some examples in the original discussion thread.
>>
>>
>> https://lists.apache.org/thread.html/f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@%3Cdev.spark.apache.org%3E
>>
>> But the concrete use case that we have is GPU accelerated ETL on Spark.
>> Primarily as data preparation and feature engineering for ML tools like
>> XGBoost, which by the way exposes a Spark specific scala API, not just a
>> python one. We built a proof of concept and saw decent performance gains.
>> Enough gains to more than pay for the added cost of a GPU, with the
>> potential for even better performance in the future. With that proof of
>> concept, we were able to make all of the processing columnar end-to-end for
>> many queries so there really wasn't any data conversion costs to overcome,
>> but we did want the design flexible enough to include a cost-based
>> optimizer. \
>>
>> It looks like there is some confusion around this SPIP especially in how
>> it relates to features in other SPIPs around data exchange between
>> different systems. I didn't want to update the text of this SPIP while it
>> was under an active vote, but if others think it would be helpful, we can
>> cancel this vote, update the SPIP to clarify exactly what I am proposing,
>> and then restart the vote after we have gotten more agreement on what APIs
>> should be exposed.
>>
>> Thanks,
>>
>> Bobby
>>
>> On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng <me...@gmail.com> wrote:
>>
>> Per Robert's comment on the JIRA, ETL is the main use case for the SPIP.
>> I think the SPIP should list a concrete ETL use case (from POC?) that can
>> benefit from this *public Java/Scala API, *does *vectorization*, and
>> significantly *boosts the performance *even with data conversion overhead.
>>
>> The current mid-term success (Pandas UDF) doesn't match the purpose of
>> SPIP and it can be done without exposing any public APIs.
>>
>> Depending how much benefit it brings, we might agree that a public
>> Java/Scala API is needed. Then we might want to step slightly into how. I
>> saw three options mentioned in the JIRA and discussion threads:
>>
>> 1. Expose `Array[Byte]` in Arrow format. Let user decode it using an
>> Arrow library.
>> 2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
>> 3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also
>> used by Spark internals. It makes us hard to change Spark internals in the
>> future.
>> 4. Expose something like `SparkRecordBatch` that is Arrow-compatible and
>> maintain conversion between internal `ColumnarBatch` and
>> `SparkRecordBatch`. It might cause conversion overhead in the future if
>> our internal becomes different from Arrow.
>>
>> Note that both 3 and 4 will make many APIs public to be Arrow compatible.
>> So we should really give concrete ETL cases to prove that it is important
>> for us to do so.
>>
>> On Mon, Apr 22, 2019 at 8:27 AM Tom Graves <tg...@yahoo.com> wrote:
>>
>> Based on there is still discussion and Spark Summit is this week, I'm
>> going to extend the vote til Friday the 26th.
>>
>> Tom
>> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans <re...@gmail.com>
>> wrote:
>>
>> Yes, it is technically possible for the layout to change. No, it is not
>> going to happen. It is already baked into several different official
>> libraries which are widely used, not just for holding and processing the
>> data, but also for transfer of the data between the various
>> implementations. There would have to be a really serious reason to force an
>> incompatible change at this point. So in the worst case, we can version the
>> layout and bake that into the API that exposes the internal layout of the
>> data. That way code that wants to program against a JAVA API can do so
>> using the API that Spark provides, those who want to interface with
>> something that expects the data in arrow format will already have to know
>> what version of the format it was programmed against and in the worst case
>> if the layout does change we can support the new layout if needed.
>>
>> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com> wrote:
>>
>> The Arrow data format is not yet stable, meaning there are no guarantees
>> on backwards/forwards compatibility. Once version 1.0 is released, it will
>> have those guarantees but it's hard to say when that will be. The remaining
>> work to get there can be seen at
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
>> So yes, it is a risk that exposing Spark data as Arrow could cause an issue
>> if handled by a different version that is not compatible. That being said,
>> changes to format are not taken lightly and are backwards compatible when
>> possible. I think it would be fair to mark the APIs exposing Arrow data as
>> experimental for the time being, and clearly state the version that must be
>> used to be compatible in the docs. Also, adding features like this and
>> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
>> release. Adding the Arrow dev list to CC.
>>
>> Bryan
>>
>> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com>
>> wrote:
>>
>> Okay, that makes sense, but is the Arrow data format stable? If not, we
>> risk breakage when Arrow changes in the future and some libraries using
>> this feature are begin to use the new Arrow code.
>>
>> Matei
>>
>> On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
>>
>> I want to be clear that this SPIP is not proposing exposing Arrow
>>
>> APIs/Classes through any Spark APIs. SPARK-24579 is doing that, and
>> because of the overlap between the two SPIPs I scaled this one back to
>> concentrate just on the columnar processing aspects. Sorry for the
>> confusion as I didn't update the JIRA description clearly enough when we
>> adjusted it during the discussion on the JIRA. As part of the columnar
>> processing, we plan on providing arrow formatted data, but that will be
>> exposed through a Spark owned API.
>>
>> On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com>
>>
>> wrote:
>>
>> FYI, I’d also be concerned about exposing the Arrow API or format as a
>>
>> public API if it’s not yet stable. Is stabilization of the API and format
>> coming soon on the roadmap there? Maybe someone can work with the Arrow
>> community to make that happen.
>>
>> We’ve been bitten lots of times by API changes forced by external
>>
>> libraries even when those were widely popular. For example, we used
>> Guava’s Optional for a while, which changed at some point, and we also had
>> issues with Protobuf and Scala itself (especially how Scala’s APIs appear
>> in Java). API breakage might not be as serious in dynamic languages like
>> Python, where you can often keep compatibility with old behaviors, but it
>> really hurts in Java and Scala.
>>
>> The problem is especially bad for us because of two aspects of how
>>
>> Spark is used:
>>
>> 1) Spark is used for production data transformation jobs that people
>>
>> need to keep running for a long time. Nobody wants to make changes to a
>> job that’s been working fine and computing something correctly for years
>> just to get a bug fix from the latest Spark release or whatever. It’s much
>> better if they can upgrade Spark without editing every job.
>>
>> 2) Spark is often used as “glue” to combine data processing code in
>>
>> other libraries, and these might start to require different versions of
>> our dependencies. For example, the Guava class exposed in Spark became a
>> problem when third-party libraries started requiring a new version of
>> Guava: those new libraries just couldn’t work with Spark. Protobuf was
>> especially bad because some users wanted to read data stored as Protobufs
>> (or in a format that uses Protobuf inside), so they needed a different
>> version of the library in their main data processing code.
>>
>> If there was some guarantee that this stuff would remain
>>
>> backward-compatible, we’d be in a much better stuff. It’s not that hard
>> to keep a storage format backward-compatible: just document the format and
>> extend it only in ways that don’t break the meaning of old data (for
>> example, add new version numbers or field types that are read in a
>> different way). It’s a bit harder for a Java API, but maybe Spark could
>> just expose byte arrays directly and work on those if the API is not
>> guaranteed to stay stable (that is, we’d still use our own classes to
>> manipulate the data internally, and end users could use the Arrow library
>> if they want it).
>>
>> Matei
>>
>> On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
>>
>> I think you misunderstood the point of this SPIP. I responded to your
>>
>> comments in the SPIP JIRA.
>>
>> On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com>
>>
>> wrote:
>>
>> I posted my comment in the JIRA. Main concerns here:
>>
>> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>>
>> 1.0 release someday.
>>
>> 2. ML/DL systems that can benefits from columnar format are mostly in
>>
>> Python.
>>
>> 3. Simple operations, though benefits vectorization, might not be
>>
>> worth the data exchange overhead.
>>
>> So would an improved Pandas UDF API would be good enough? For
>>
>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>>
>> Sorry that I should join the discussion earlier! Hope it is not too
>>
>> late:)
>>
>> On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
>> +1 (non-binding) for better columnar data processing support.
>>
>> From: Jules Damji <dm...@comcast.net>
>> Sent: Friday, April 19, 2019 12:21 PM
>> To: Bryan Cutler <cu...@gmail.com>
>> Cc: Dev <de...@spark.apache.org>
>> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>>
>> Columnar Processing Support
>>
>> + (non-binding)
>>
>> Sent from my iPhone
>>
>> Pardon the dumb thumb typos :)
>>
>> On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com> wrote:
>>
>> +1 (non-binding)
>>
>> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote:
>>
>> +1 (non-binding). Looking forward to seeing better support for
>>
>> processing columnar data.
>>
>> Jason
>>
>> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>>
>> <tg...@yahoo.com.invalid> wrote:
>>
>> Hi everyone,
>>
>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>>
>> extended Columnar Processing Support. The proposal is to extend the
>> support to allow for more columnar processing.
>>
>> You can find the full proposal in the jira at:
>>
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>>
>> Please vote as early as you can, I will leave the vote open until
>>
>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>>
>> [ ] +0
>>
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thanks!
>>
>> Tom Graves
>>
>> --------------------------------------------------------------------- To
>> unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Bobby Evans <re...@gmail.com>.

Agreed.

Tom, could you cancel the vote?



On Mon, Apr 22, 2019 at 1:07 PM Reynold Xin <rx...@databricks.com> wrote:

> "if others think it would be helpful, we can cancel this vote, update the
> SPIP to clarify exactly what I am proposing, and then restart the vote
> after we have gotten more agreement on what APIs should be exposed"
>
> That'd be very useful. At least I was confused by what the SPIP was about.
> No point voting on something when there is still a lot of confusion about
> what it is.
>
>
> On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans <re...@gmail.com> wrote:
>
>> Xiangrui Meng,
>>
>> I provided some examples in the original discussion thread.
>>
>>
>> https://lists.apache.org/thread.html/f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@%3Cdev.spark.apache.org%3E
>>
>> But the concrete use case that we have is GPU accelerated ETL on Spark.
>> Primarily as data preparation and feature engineering for ML tools like
>> XGBoost, which by the way exposes a Spark specific scala API, not just a
>> python one. We built a proof of concept and saw decent performance gains.
>> Enough gains to more than pay for the added cost of a GPU, with the
>> potential for even better performance in the future. With that proof of
>> concept, we were able to make all of the processing columnar end-to-end for
>> many queries so there really wasn't any data conversion costs to overcome,
>> but we did want the design flexible enough to include a cost-based
>> optimizer. \
>>
>> It looks like there is some confusion around this SPIP especially in how
>> it relates to features in other SPIPs around data exchange between
>> different systems. I didn't want to update the text of this SPIP while it
>> was under an active vote, but if others think it would be helpful, we can
>> cancel this vote, update the SPIP to clarify exactly what I am proposing,
>> and then restart the vote after we have gotten more agreement on what APIs
>> should be exposed.
>>
>> Thanks,
>>
>> Bobby
>>
>> On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng <me...@gmail.com> wrote:
>>
>> Per Robert's comment on the JIRA, ETL is the main use case for the SPIP.
>> I think the SPIP should list a concrete ETL use case (from POC?) that can
>> benefit from this *public Java/Scala API, *does *vectorization*, and
>> significantly *boosts the performance *even with data conversion overhead.
>>
>> The current mid-term success (Pandas UDF) doesn't match the purpose of
>> SPIP and it can be done without exposing any public APIs.
>>
>> Depending how much benefit it brings, we might agree that a public
>> Java/Scala API is needed. Then we might want to step slightly into how. I
>> saw three options mentioned in the JIRA and discussion threads:
>>
>> 1. Expose `Array[Byte]` in Arrow format. Let user decode it using an
>> Arrow library.
>> 2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
>> 3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also
>> used by Spark internals. It makes us hard to change Spark internals in the
>> future.
>> 4. Expose something like `SparkRecordBatch` that is Arrow-compatible and
>> maintain conversion between internal `ColumnarBatch` and
>> `SparkRecordBatch`. It might cause conversion overhead in the future if
>> our internal becomes different from Arrow.
>>
>> Note that both 3 and 4 will make many APIs public to be Arrow compatible.
>> So we should really give concrete ETL cases to prove that it is important
>> for us to do so.
>>
>> On Mon, Apr 22, 2019 at 8:27 AM Tom Graves <tg...@yahoo.com> wrote:
>>
>> Based on there is still discussion and Spark Summit is this week, I'm
>> going to extend the vote til Friday the 26th.
>>
>> Tom
>> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans <re...@gmail.com>
>> wrote:
>>
>> Yes, it is technically possible for the layout to change. No, it is not
>> going to happen. It is already baked into several different official
>> libraries which are widely used, not just for holding and processing the
>> data, but also for transfer of the data between the various
>> implementations. There would have to be a really serious reason to force an
>> incompatible change at this point. So in the worst case, we can version the
>> layout and bake that into the API that exposes the internal layout of the
>> data. That way code that wants to program against a JAVA API can do so
>> using the API that Spark provides, those who want to interface with
>> something that expects the data in arrow format will already have to know
>> what version of the format it was programmed against and in the worst case
>> if the layout does change we can support the new layout if needed.
>>
>> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com> wrote:
>>
>> The Arrow data format is not yet stable, meaning there are no guarantees
>> on backwards/forwards compatibility. Once version 1.0 is released, it will
>> have those guarantees but it's hard to say when that will be. The remaining
>> work to get there can be seen at
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
>> So yes, it is a risk that exposing Spark data as Arrow could cause an issue
>> if handled by a different version that is not compatible. That being said,
>> changes to format are not taken lightly and are backwards compatible when
>> possible. I think it would be fair to mark the APIs exposing Arrow data as
>> experimental for the time being, and clearly state the version that must be
>> used to be compatible in the docs. Also, adding features like this and
>> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
>> release. Adding the Arrow dev list to CC.
>>
>> Bryan
>>
>> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com>
>> wrote:
>>
>> Okay, that makes sense, but is the Arrow data format stable? If not, we
>> risk breakage when Arrow changes in the future and some libraries using
>> this feature are begin to use the new Arrow code.
>>
>> Matei
>>
>> On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
>>
>> I want to be clear that this SPIP is not proposing exposing Arrow
>>
>> APIs/Classes through any Spark APIs. SPARK-24579 is doing that, and
>> because of the overlap between the two SPIPs I scaled this one back to
>> concentrate just on the columnar processing aspects. Sorry for the
>> confusion as I didn't update the JIRA description clearly enough when we
>> adjusted it during the discussion on the JIRA. As part of the columnar
>> processing, we plan on providing arrow formatted data, but that will be
>> exposed through a Spark owned API.
>>
>> On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com>
>>
>> wrote:
>>
>> FYI, I’d also be concerned about exposing the Arrow API or format as a
>>
>> public API if it’s not yet stable. Is stabilization of the API and format
>> coming soon on the roadmap there? Maybe someone can work with the Arrow
>> community to make that happen.
>>
>> We’ve been bitten lots of times by API changes forced by external
>>
>> libraries even when those were widely popular. For example, we used
>> Guava’s Optional for a while, which changed at some point, and we also had
>> issues with Protobuf and Scala itself (especially how Scala’s APIs appear
>> in Java). API breakage might not be as serious in dynamic languages like
>> Python, where you can often keep compatibility with old behaviors, but it
>> really hurts in Java and Scala.
>>
>> The problem is especially bad for us because of two aspects of how
>>
>> Spark is used:
>>
>> 1) Spark is used for production data transformation jobs that people
>>
>> need to keep running for a long time. Nobody wants to make changes to a
>> job that’s been working fine and computing something correctly for years
>> just to get a bug fix from the latest Spark release or whatever. It’s much
>> better if they can upgrade Spark without editing every job.
>>
>> 2) Spark is often used as “glue” to combine data processing code in
>>
>> other libraries, and these might start to require different versions of
>> our dependencies. For example, the Guava class exposed in Spark became a
>> problem when third-party libraries started requiring a new version of
>> Guava: those new libraries just couldn’t work with Spark. Protobuf was
>> especially bad because some users wanted to read data stored as Protobufs
>> (or in a format that uses Protobuf inside), so they needed a different
>> version of the library in their main data processing code.
>>
>> If there was some guarantee that this stuff would remain
>>
>> backward-compatible, we’d be in a much better stuff. It’s not that hard
>> to keep a storage format backward-compatible: just document the format and
>> extend it only in ways that don’t break the meaning of old data (for
>> example, add new version numbers or field types that are read in a
>> different way). It’s a bit harder for a Java API, but maybe Spark could
>> just expose byte arrays directly and work on those if the API is not
>> guaranteed to stay stable (that is, we’d still use our own classes to
>> manipulate the data internally, and end users could use the Arrow library
>> if they want it).
>>
>> Matei
>>
>> On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
>>
>> I think you misunderstood the point of this SPIP. I responded to your
>>
>> comments in the SPIP JIRA.
>>
>> On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com>
>>
>> wrote:
>>
>> I posted my comment in the JIRA. Main concerns here:
>>
>> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>>
>> 1.0 release someday.
>>
>> 2. ML/DL systems that can benefits from columnar format are mostly in
>>
>> Python.
>>
>> 3. Simple operations, though benefits vectorization, might not be
>>
>> worth the data exchange overhead.
>>
>> So would an improved Pandas UDF API would be good enough? For
>>
>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>>
>> Sorry that I should join the discussion earlier! Hope it is not too
>>
>> late:)
>>
>> On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
>> +1 (non-binding) for better columnar data processing support.
>>
>> From: Jules Damji <dm...@comcast.net>
>> Sent: Friday, April 19, 2019 12:21 PM
>> To: Bryan Cutler <cu...@gmail.com>
>> Cc: Dev <de...@spark.apache.org>
>> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>>
>> Columnar Processing Support
>>
>> + (non-binding)
>>
>> Sent from my iPhone
>>
>> Pardon the dumb thumb typos :)
>>
>> On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com> wrote:
>>
>> +1 (non-binding)
>>
>> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote:
>>
>> +1 (non-binding). Looking forward to seeing better support for
>>
>> processing columnar data.
>>
>> Jason
>>
>> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>>
>> <tg...@yahoo.com.invalid> wrote:
>>
>> Hi everyone,
>>
>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>>
>> extended Columnar Processing Support. The proposal is to extend the
>> support to allow for more columnar processing.
>>
>> You can find the full proposal in the jira at:
>>
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>>
>> Please vote as early as you can, I will leave the vote open until
>>
>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>>
>> [ ] +0
>>
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thanks!
>>
>> Tom Graves
>>
>> --------------------------------------------------------------------- To
>> unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Tom Graves <tg...@yahoo.com.INVALID>.

 Ok, I'm cancelling the vote for now then and we will make some updates to the SPIP to try to clarify.
Tom
    On Monday, April 22, 2019, 1:07:25 PM CDT, Reynold Xin <rx...@databricks.com> wrote:  
 
 "if others think it would be helpful, we can cancel this vote, update the SPIP to clarify exactly what I am proposing, and then restart the vote after we have gotten more agreement on what APIs should be exposed"

That'd be very useful. At least I was confused by what the SPIP was about. No point voting on something when there is still a lot of confusion about what it is.

On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans <re...@gmail.com> wrote:


Xiangrui Meng,

I provided some examples in the original discussion thread.

https:/ / lists. apache. org/ thread. html/ f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@ %3Cdev. spark. apache. org%3E

But the concrete use case that we have is GPU accelerated ETL on Spark.Primarily as data preparation and feature engineering for ML tools likeXGBoost, which by the way exposes a Spark specific scala API, not just apython one. We built a proof of concept and saw decent performance gains.Enough gains to more than pay for the added cost of a GPU, with thepotential for even better performance in the future. With that proof ofconcept, we were able to make all of the processing columnar end-to-end formany queries so there really wasn't any data conversion costs to overcome,but we did want the design flexible enough to include a cost-basedoptimizer. \

It looks like there is some confusion around this SPIP especially in how itrelates to features in other SPIPs around data exchange between differentsystems. I didn't want to update the text of this SPIP while it was underan active vote, but if others think it would be helpful, we can cancel thisvote, update the SPIP to clarify exactly what I am proposing, and thenrestart the vote after we have gotten more agreement on what APIs should beexposed.

Thanks,

Bobby

On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng <mengxr@ gmail. com> wrote:


Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. Ithink the SPIP should list a concrete ETL use case (from POC?) that canbenefit from this *public Java/Scala API, *does *vectorization*, andsignificantly *boosts the performance *even with data conversion overhead.

The current mid-term success (Pandas UDF) doesn't match the purpose ofSPIP and it can be done without exposing any public APIs.

Depending how much benefit it brings, we might agree that a publicJava/Scala API is needed. Then we might want to step slightly into how. Isaw three options mentioned in the JIRA and discussion threads:

1. Expose `Array[Byte]` in Arrow format. Let user decode it using an Arrowlibrary.
2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also usedby Spark internals. It makes us hard to change Spark internals in thefuture.
4. Expose something like `SparkRecordBatch` that is Arrow-compatible andmaintain conversion between internal `ColumnarBatch` and
`SparkRecordBatch`. It might cause conversion overhead in the future if ourinternal becomes different from Arrow.

Note that both 3 and 4 will make many APIs public to be Arrow compatible.So we should really give concrete ETL cases to prove that it is importantfor us to do so.

On Mon, Apr 22, 2019 at 8:27 AM Tom Graves <tgraves_cs@ yahoo. com> wrote:


Based on there is still discussion and Spark Summit is this week, I'mgoing to extend the vote til Friday the 26th.

Tom
On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans <revans2@ gmail. com>wrote:

Yes, it is technically possible for the layout to change. No, it is notgoing to happen. It is already baked into several different officiallibraries which are widely used, not just for holding and processing thedata, but also for transfer of the data between the variousimplementations. There would have to be a really serious reason to forcean incompatible change at this point. So in the worst case, we can versionthe layout and bake that into the API that exposes the internal layout ofthe data. That way code that wants to program against a JAVA API can do sousing the API that Spark provides, those who want to interface withsomething that expects the data in arrow format will already have to knowwhat version of the format it was programmed against and in the worst caseif the layout does change we can support the new layout if needed.

On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cutlerb@ gmail. com> wrote:

The Arrow data format is not yet stable, meaning there are no guaranteeson backwards/forwards compatibility. Once version 1.0 is released, it willhave those guarantees but it's hard to say when that will be. The remainingwork to get there can be seen at
https:/ / cwiki. apache. org/ confluence/ display/ ARROW/ Columnar+Format+1. 0+Milestone.So yes, it is a risk that exposing Spark data as Arrow could cause an issueif handled by a different version that is not compatible. That being said,changes to format are not taken lightly and are backwards compatible whenpossible. I think it would be fair to mark the APIs exposing Arrow data asexperimental for the time being, and clearly state the version that must beused to be compatible in the docs. Also, adding features like this andSPARK-24579 will probably help adoption of Arrow and accelerate a 1.0release. Adding the Arrow dev list to CC.

Bryan

On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <matei. zaharia@ gmail. com>wrote:

Okay, that makes sense, but is the Arrow data format stable? If not, werisk breakage when Arrow changes in the future and some libraries usingthis feature are begin to use the new Arrow code.

Matei


On Apr 20, 2019, at 1:39 PM, Bobby Evans <revans2@ gmail. com> wrote:

I want to be clear that this SPIP is not proposing exposing Arrow


APIs/Classes through any Spark APIs. SPARK-24579 is doing that, andbecause of the overlap between the two SPIPs I scaled this one back toconcentrate just on the columnar processing aspects. Sorry for theconfusion as I didn't update the JIRA description clearly enough when weadjusted it during the discussion on the JIRA. As part of the columnarprocessing, we plan on providing arrow formatted data, but that will beexposed through a Spark owned API.


On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <matei. zaharia@ gmail. com>


wrote:


FYI, I’d also be concerned about exposing the Arrow API or format as a


public API if it’s not yet stable. Is stabilization of the API and formatcoming soon on the roadmap there? Maybe someone can work with the Arrowcommunity to make that happen.


We’ve been bitten lots of times by API changes forced by external


libraries even when those were widely popular. For example, we used Guava’sOptional for a while, which changed at some point, and we also had issueswith Protobuf and Scala itself (especially how Scala’s APIs appear inJava). API breakage might not be as serious in dynamic languages likePython, where you can often keep compatibility with old behaviors, but itreally hurts in Java and Scala.


The problem is especially bad for us because of two aspects of how


Spark is used:


1) Spark is used for production data transformation jobs that people


need to keep running for a long time. Nobody wants to make changes to a jobthat’s been working fine and computing something correctly for years justto get a bug fix from the latest Spark release or whatever. It’s muchbetter if they can upgrade Spark without editing every job.


2) Spark is often used as “glue” to combine data processing code in


other libraries, and these might start to require different versions of ourdependencies. For example, the Guava class exposed in Spark became aproblem when third-party libraries started requiring a new version ofGuava: those new libraries just couldn’t work with Spark. Protobuf wasespecially bad because some users wanted to read data stored as Protobufs
(or in a format that uses Protobuf inside), so they needed a differentversion of the library in their main data processing code.


If there was some guarantee that this stuff would remain


backward-compatible, we’d be in a much better stuff. It’s not that hard tokeep a storage format backward-compatible: just document the format andextend it only in ways that don’t break the meaning of old data (forexample, add new version numbers or field types that are read in adifferent way). It’s a bit harder for a Java API, but maybe Spark couldjust expose byte arrays directly and work on those if the API is notguaranteed to stay stable (that is, we’d still use our own classes tomanipulate the data internally, and end users could use the Arrow libraryif they want it).


Matei


On Apr 20, 2019, at 8:38 AM, Bobby Evans <revans2@ gmail. com> wrote:

I think you misunderstood the point of this SPIP. I responded to your



comments in the SPIP JIRA.



On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <mengxr@ gmail. com>



wrote:



I posted my comment in the JIRA. Main concerns here:

1. Exposing third-party Java APIs in Spark is risky. Arrow might have



1.0 release someday.



2. ML/DL systems that can benefits from columnar format are mostly in



Python.



3. Simple operations, though benefits vectorization, might not be



worth the data exchange overhead.



So would an improved Pandas UDF API would be good enough? For



example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).



Sorry that I should join the discussion earlier! Hope it is not too



late:)



On Fri, Apr 19, 2019 at 1:20 PM <tcondie@ gmail. com> wrote:
+1 (non-binding) for better columnar data processing support.

From: Jules Damji <dmatrix@ comcast. net>
Sent: Friday, April 19, 2019 12:21 PM
To: Bryan Cutler <cutlerb@ gmail. com>
Cc: Dev <dev@ spark. apache. org>
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended



Columnar Processing Support



+ (non-binding)

Sent from my iPhone

Pardon the dumb thumb typos :)

On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cutlerb@ gmail. com> wrote:

+1 (non-binding)

On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jlowe@ apache. org> wrote:

+1 (non-binding). Looking forward to seeing better support for



processing columnar data.



Jason

On Tue, Apr 16, 2019 at 10:38 AM Tom Graves



<tgraves_cs@ yahoo. com. invalid> wrote:



Hi everyone,

I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for



extended Columnar Processing Support. The proposal is to extend thesupport to allow for more columnar processing.



You can find the full proposal in the jira at:



https:/ / issues. apache. org/ jira/ browse/ SPARK-27396. There was also aDISCUSS thread in the dev mailing list.



Please vote as early as you can, I will leave the vote open until



next Monday (the 22nd), 2pm CST to give people plenty of time.



[ ] +1: Accept the proposal as an official SPIP

[ ] +0

[ ] -1: I don't think this is a good idea because ...

Thanks!

Tom Graves



---------------------------------------------------------------------To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Tom Graves <tg...@yahoo.com.INVALID>.

 Ok, I'm cancelling the vote for now then and we will make some updates to the SPIP to try to clarify.
Tom
    On Monday, April 22, 2019, 1:07:25 PM CDT, Reynold Xin <rx...@databricks.com> wrote:  
 
 "if others think it would be helpful, we can cancel this vote, update the SPIP to clarify exactly what I am proposing, and then restart the vote after we have gotten more agreement on what APIs should be exposed"

That'd be very useful. At least I was confused by what the SPIP was about. No point voting on something when there is still a lot of confusion about what it is.

On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans <re...@gmail.com> wrote:


Xiangrui Meng,

I provided some examples in the original discussion thread.

https:/ / lists. apache. org/ thread. html/ f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@ %3Cdev. spark. apache. org%3E

But the concrete use case that we have is GPU accelerated ETL on Spark.Primarily as data preparation and feature engineering for ML tools likeXGBoost, which by the way exposes a Spark specific scala API, not just apython one. We built a proof of concept and saw decent performance gains.Enough gains to more than pay for the added cost of a GPU, with thepotential for even better performance in the future. With that proof ofconcept, we were able to make all of the processing columnar end-to-end formany queries so there really wasn't any data conversion costs to overcome,but we did want the design flexible enough to include a cost-basedoptimizer. \

It looks like there is some confusion around this SPIP especially in how itrelates to features in other SPIPs around data exchange between differentsystems. I didn't want to update the text of this SPIP while it was underan active vote, but if others think it would be helpful, we can cancel thisvote, update the SPIP to clarify exactly what I am proposing, and thenrestart the vote after we have gotten more agreement on what APIs should beexposed.

Thanks,

Bobby

On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng <mengxr@ gmail. com> wrote:


Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. Ithink the SPIP should list a concrete ETL use case (from POC?) that canbenefit from this *public Java/Scala API, *does *vectorization*, andsignificantly *boosts the performance *even with data conversion overhead.

The current mid-term success (Pandas UDF) doesn't match the purpose ofSPIP and it can be done without exposing any public APIs.

Depending how much benefit it brings, we might agree that a publicJava/Scala API is needed. Then we might want to step slightly into how. Isaw three options mentioned in the JIRA and discussion threads:

1. Expose `Array[Byte]` in Arrow format. Let user decode it using an Arrowlibrary.
2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also usedby Spark internals. It makes us hard to change Spark internals in thefuture.
4. Expose something like `SparkRecordBatch` that is Arrow-compatible andmaintain conversion between internal `ColumnarBatch` and
`SparkRecordBatch`. It might cause conversion overhead in the future if ourinternal becomes different from Arrow.

Note that both 3 and 4 will make many APIs public to be Arrow compatible.So we should really give concrete ETL cases to prove that it is importantfor us to do so.

On Mon, Apr 22, 2019 at 8:27 AM Tom Graves <tgraves_cs@ yahoo. com> wrote:


Based on there is still discussion and Spark Summit is this week, I'mgoing to extend the vote til Friday the 26th.

Tom
On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans <revans2@ gmail. com>wrote:

Yes, it is technically possible for the layout to change. No, it is notgoing to happen. It is already baked into several different officiallibraries which are widely used, not just for holding and processing thedata, but also for transfer of the data between the variousimplementations. There would have to be a really serious reason to forcean incompatible change at this point. So in the worst case, we can versionthe layout and bake that into the API that exposes the internal layout ofthe data. That way code that wants to program against a JAVA API can do sousing the API that Spark provides, those who want to interface withsomething that expects the data in arrow format will already have to knowwhat version of the format it was programmed against and in the worst caseif the layout does change we can support the new layout if needed.

On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cutlerb@ gmail. com> wrote:

The Arrow data format is not yet stable, meaning there are no guaranteeson backwards/forwards compatibility. Once version 1.0 is released, it willhave those guarantees but it's hard to say when that will be. The remainingwork to get there can be seen at
https:/ / cwiki. apache. org/ confluence/ display/ ARROW/ Columnar+Format+1. 0+Milestone.So yes, it is a risk that exposing Spark data as Arrow could cause an issueif handled by a different version that is not compatible. That being said,changes to format are not taken lightly and are backwards compatible whenpossible. I think it would be fair to mark the APIs exposing Arrow data asexperimental for the time being, and clearly state the version that must beused to be compatible in the docs. Also, adding features like this andSPARK-24579 will probably help adoption of Arrow and accelerate a 1.0release. Adding the Arrow dev list to CC.

Bryan

On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <matei. zaharia@ gmail. com>wrote:

Okay, that makes sense, but is the Arrow data format stable? If not, werisk breakage when Arrow changes in the future and some libraries usingthis feature are begin to use the new Arrow code.

Matei


On Apr 20, 2019, at 1:39 PM, Bobby Evans <revans2@ gmail. com> wrote:

I want to be clear that this SPIP is not proposing exposing Arrow


APIs/Classes through any Spark APIs. SPARK-24579 is doing that, andbecause of the overlap between the two SPIPs I scaled this one back toconcentrate just on the columnar processing aspects. Sorry for theconfusion as I didn't update the JIRA description clearly enough when weadjusted it during the discussion on the JIRA. As part of the columnarprocessing, we plan on providing arrow formatted data, but that will beexposed through a Spark owned API.


On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <matei. zaharia@ gmail. com>


wrote:


FYI, I’d also be concerned about exposing the Arrow API or format as a


public API if it’s not yet stable. Is stabilization of the API and formatcoming soon on the roadmap there? Maybe someone can work with the Arrowcommunity to make that happen.


We’ve been bitten lots of times by API changes forced by external


libraries even when those were widely popular. For example, we used Guava’sOptional for a while, which changed at some point, and we also had issueswith Protobuf and Scala itself (especially how Scala’s APIs appear inJava). API breakage might not be as serious in dynamic languages likePython, where you can often keep compatibility with old behaviors, but itreally hurts in Java and Scala.


The problem is especially bad for us because of two aspects of how


Spark is used:


1) Spark is used for production data transformation jobs that people


need to keep running for a long time. Nobody wants to make changes to a jobthat’s been working fine and computing something correctly for years justto get a bug fix from the latest Spark release or whatever. It’s muchbetter if they can upgrade Spark without editing every job.


2) Spark is often used as “glue” to combine data processing code in


other libraries, and these might start to require different versions of ourdependencies. For example, the Guava class exposed in Spark became aproblem when third-party libraries started requiring a new version ofGuava: those new libraries just couldn’t work with Spark. Protobuf wasespecially bad because some users wanted to read data stored as Protobufs
(or in a format that uses Protobuf inside), so they needed a differentversion of the library in their main data processing code.


If there was some guarantee that this stuff would remain


backward-compatible, we’d be in a much better stuff. It’s not that hard tokeep a storage format backward-compatible: just document the format andextend it only in ways that don’t break the meaning of old data (forexample, add new version numbers or field types that are read in adifferent way). It’s a bit harder for a Java API, but maybe Spark couldjust expose byte arrays directly and work on those if the API is notguaranteed to stay stable (that is, we’d still use our own classes tomanipulate the data internally, and end users could use the Arrow libraryif they want it).


Matei


On Apr 20, 2019, at 8:38 AM, Bobby Evans <revans2@ gmail. com> wrote:

I think you misunderstood the point of this SPIP. I responded to your



comments in the SPIP JIRA.



On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <mengxr@ gmail. com>



wrote:



I posted my comment in the JIRA. Main concerns here:

1. Exposing third-party Java APIs in Spark is risky. Arrow might have



1.0 release someday.



2. ML/DL systems that can benefits from columnar format are mostly in



Python.



3. Simple operations, though benefits vectorization, might not be



worth the data exchange overhead.



So would an improved Pandas UDF API would be good enough? For



example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).



Sorry that I should join the discussion earlier! Hope it is not too



late:)



On Fri, Apr 19, 2019 at 1:20 PM <tcondie@ gmail. com> wrote:
+1 (non-binding) for better columnar data processing support.

From: Jules Damji <dmatrix@ comcast. net>
Sent: Friday, April 19, 2019 12:21 PM
To: Bryan Cutler <cutlerb@ gmail. com>
Cc: Dev <dev@ spark. apache. org>
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended



Columnar Processing Support



+ (non-binding)

Sent from my iPhone

Pardon the dumb thumb typos :)

On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cutlerb@ gmail. com> wrote:

+1 (non-binding)

On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jlowe@ apache. org> wrote:

+1 (non-binding). Looking forward to seeing better support for



processing columnar data.



Jason

On Tue, Apr 16, 2019 at 10:38 AM Tom Graves



<tgraves_cs@ yahoo. com. invalid> wrote:



Hi everyone,

I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for



extended Columnar Processing Support. The proposal is to extend thesupport to allow for more columnar processing.



You can find the full proposal in the jira at:



https:/ / issues. apache. org/ jira/ browse/ SPARK-27396. There was also aDISCUSS thread in the dev mailing list.



Please vote as early as you can, I will leave the vote open until



next Monday (the 22nd), 2pm CST to give people plenty of time.



[ ] +1: Accept the proposal as an official SPIP

[ ] +0

[ ] -1: I don't think this is a good idea because ...

Thanks!

Tom Graves



---------------------------------------------------------------------To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Reynold Xin <rx...@databricks.com>.

"if others think it would be helpful, we can cancel this vote, update the SPIP to clarify exactly what I am proposing, and then restart the vote after we have gotten more agreement on what APIs should be exposed"

That'd be very useful. At least I was confused by what the SPIP was about. No point voting on something when there is still a lot of confusion about what it is.

On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans < revans2@gmail.com > wrote:

> 
> 
> 
> Xiangrui Meng,
> 
> 
> 
> I provided some examples in the original discussion thread.
> 
> 
> 
> https:/ / lists. apache. org/ thread. html/ f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@
> %3Cdev. spark. apache. org%3E (
> https://lists.apache.org/thread.html/f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@%3Cdev.spark.apache.org%3E
> )
> 
> 
> 
> But the concrete use case that we have is GPU accelerated ETL on Spark.
> Primarily as data preparation and feature engineering for ML tools like
> XGBoost, which by the way exposes a Spark specific scala API, not just a
> python one. We built a proof of concept and saw decent performance gains.
> Enough gains to more than pay for the added cost of a GPU, with the
> potential for even better performance in the future. With that proof of
> concept, we were able to make all of the processing columnar end-to-end
> for many queries so there really wasn't any data conversion costs to
> overcome, but we did want the design flexible enough to include a
> cost-based optimizer. \
> 
> 
> 
> It looks like there is some confusion around this SPIP especially in how
> it relates to features in other SPIPs around data exchange between
> different systems. I didn't want to update the text of this SPIP while it
> was under an active vote, but if others think it would be helpful, we can
> cancel this vote, update the SPIP to clarify exactly what I am proposing,
> and then restart the vote after we have gotten more agreement on what APIs
> should be exposed.
> 
> 
> 
> Thanks,
> 
> 
> 
> Bobby
> 
> 
> 
> On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng < mengxr@ gmail. com (
> mengxr@gmail.com ) > wrote:
> 
> 
>> 
>> 
>> Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. I
>> think the SPIP should list a concrete ETL use case (from POC?) that can
>> benefit from this *public Java/Scala API, *does *vectorization*, and
>> significantly *boosts the performance *even with data conversion overhead.
>> 
>> 
>> 
>> 
>> The current mid-term success (Pandas UDF) doesn't match the purpose of
>> SPIP and it can be done without exposing any public APIs.
>> 
>> 
>> 
>> Depending how much benefit it brings, we might agree that a public
>> Java/Scala API is needed. Then we might want to step slightly into how. I
>> saw three options mentioned in the JIRA and discussion threads:
>> 
>> 
>> 
>> 1. Expose `Array[Byte]` in Arrow format. Let user decode it using an Arrow
>> library.
>> 2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
>> 3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also used
>> by Spark internals. It makes us hard to change Spark internals in the
>> future.
>> 4. Expose something like `SparkRecordBatch` that is Arrow-compatible and
>> maintain conversion between internal `ColumnarBatch` and
>> `SparkRecordBatch`. It might cause conversion overhead in the future if
>> our internal becomes different from Arrow.
>> 
>> 
>> 
>> Note that both 3 and 4 will make many APIs public to be Arrow compatible.
>> So we should really give concrete ETL cases to prove that it is important
>> for us to do so.
>> 
>> 
>> 
>> On Mon, Apr 22, 2019 at 8:27 AM Tom Graves < tgraves_cs@ yahoo. com (
>> tgraves_cs@yahoo.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> Based on there is still discussion and Spark Summit is this week, I'm
>>> going to extend the vote til Friday the 26th.
>>> 
>>> 
>>> 
>>> Tom
>>> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans < revans2@ gmail. com
>>> ( revans2@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> Yes, it is technically possible for the layout to change. No, it is not
>>> going to happen. It is already baked into several different official
>>> libraries which are widely used, not just for holding and processing the
>>> data, but also for transfer of the data between the various
>>> implementations. There would have to be a really serious reason to force
>>> an incompatible change at this point. So in the worst case, we can version
>>> the layout and bake that into the API that exposes the internal layout of
>>> the data. That way code that wants to program against a JAVA API can do so
>>> using the API that Spark provides, those who want to interface with
>>> something that expects the data in arrow format will already have to know
>>> what version of the format it was programmed against and in the worst case
>>> if the layout does change we can support the new layout if needed.
>>> 
>>> 
>>> 
>>> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler < cutlerb@ gmail. com (
>>> cutlerb@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> The Arrow data format is not yet stable, meaning there are no guarantees
>>> on backwards/forwards compatibility. Once version 1.0 is released, it will
>>> have those guarantees but it's hard to say when that will be. The
>>> remaining work to get there can be seen at
>>> https:/ / cwiki. apache. org/ confluence/ display/ ARROW/ Columnar+Format+1.
>>> 0+Milestone (
>>> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone
>>> ). So yes, it is a risk that exposing Spark data as Arrow could cause an
>>> issue if handled by a different version that is not compatible. That being
>>> said, changes to format are not taken lightly and are backwards compatible
>>> when possible. I think it would be fair to mark the APIs exposing Arrow
>>> data as experimental for the time being, and clearly state the version
>>> that must be used to be compatible in the docs. Also, adding features like
>>> this and SPARK-24579 will probably help adoption of Arrow and accelerate a
>>> 1.0 release. Adding the Arrow dev list to CC.
>>> 
>>> 
>>> 
>>> Bryan
>>> 
>>> 
>>> 
>>> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia < matei. zaharia@ gmail. com
>>> ( matei.zaharia@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> Okay, that makes sense, but is the Arrow data format stable? If not, we
>>> risk breakage when Arrow changes in the future and some libraries using
>>> this feature are begin to use the new Arrow code.
>>> 
>>> 
>>> 
>>> Matei
>>> 
>>> 
>>>> 
>>>> 
>>>> On Apr 20, 2019, at 1:39 PM, Bobby Evans < revans2@ gmail. com (
>>>> revans2@gmail.com ) > wrote:
>>>> 
>>>> 
>>>> 
>>>> I want to be clear that this SPIP is not proposing exposing Arrow
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> APIs/Classes through any Spark APIs. SPARK-24579 is doing that, and
>>> because of the overlap between the two SPIPs I scaled this one back to
>>> concentrate just on the columnar processing aspects. Sorry for the
>>> confusion as I didn't update the JIRA description clearly enough when we
>>> adjusted it during the discussion on the JIRA. As part of the columnar
>>> processing, we plan on providing arrow formatted data, but that will be
>>> exposed through a Spark owned API.
>>> 
>>> 
>>>> 
>>>> 
>>>> On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia < matei. zaharia@ gmail. com
>>>> ( matei.zaharia@gmail.com ) >
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> wrote:
>>> 
>>> 
>>>> 
>>>> 
>>>> FYI, I’d also be concerned about exposing the Arrow API or format as a
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> public API if it’s not yet stable. Is stabilization of the API and format
>>> coming soon on the roadmap there? Maybe someone can work with the Arrow
>>> community to make that happen.
>>> 
>>> 
>>>> 
>>>> 
>>>> We’ve been bitten lots of times by API changes forced by external
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> libraries even when those were widely popular. For example, we used
>>> Guava’s Optional for a while, which changed at some point, and we also had
>>> issues with Protobuf and Scala itself (especially how Scala’s APIs appear
>>> in Java). API breakage might not be as serious in dynamic languages like
>>> Python, where you can often keep compatibility with old behaviors, but it
>>> really hurts in Java and Scala.
>>> 
>>> 
>>>> 
>>>> 
>>>> The problem is especially bad for us because of two aspects of how
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Spark is used:
>>> 
>>> 
>>>> 
>>>> 
>>>> 1) Spark is used for production data transformation jobs that people
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> need to keep running for a long time. Nobody wants to make changes to a
>>> job that’s been working fine and computing something correctly for years
>>> just to get a bug fix from the latest Spark release or whatever. It’s much
>>> better if they can upgrade Spark without editing every job.
>>> 
>>> 
>>>> 
>>>> 
>>>> 2) Spark is often used as “glue” to combine data processing code in
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> other libraries, and these might start to require different versions of
>>> our dependencies. For example, the Guava class exposed in Spark became a
>>> problem when third-party libraries started requiring a new version of
>>> Guava: those new libraries just couldn’t work with Spark. Protobuf was
>>> especially bad because some users wanted to read data stored as Protobufs
>>> (or in a format that uses Protobuf inside), so they needed a different
>>> version of the library in their main data processing code.
>>> 
>>> 
>>>> 
>>>> 
>>>> If there was some guarantee that this stuff would remain
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> backward-compatible, we’d be in a much better stuff. It’s not that hard to
>>> keep a storage format backward-compatible: just document the format and
>>> extend it only in ways that don’t break the meaning of old data (for
>>> example, add new version numbers or field types that are read in a
>>> different way). It’s a bit harder for a Java API, but maybe Spark could
>>> just expose byte arrays directly and work on those if the API is not
>>> guaranteed to stay stable (that is, we’d still use our own classes to
>>> manipulate the data internally, and end users could use the Arrow library
>>> if they want it).
>>> 
>>> 
>>>> 
>>>> 
>>>> Matei
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Apr 20, 2019, at 8:38 AM, Bobby Evans < revans2@ gmail. com (
>>>>> revans2@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> I think you misunderstood the point of this SPIP. I responded to your
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> comments in the SPIP JIRA.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng < mengxr@ gmail. com (
>>>>> mengxr@gmail.com ) >
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> wrote:
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> I posted my comment in the JIRA. Main concerns here:
>>>>> 
>>>>> 
>>>>> 
>>>>> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 1.0 release someday.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> 2. ML/DL systems that can benefits from columnar format are mostly in
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Python.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> 3. Simple operations, though benefits vectorization, might not be
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> worth the data exchange overhead.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> So would an improved Pandas UDF API would be good enough? For
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Sorry that I should join the discussion earlier! Hope it is not too
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> late:)
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Apr 19, 2019 at 1:20 PM < tcondie@ gmail. com ( tcondie@gmail.com )
>>>>> > wrote:
>>>>> +1 (non-binding) for better columnar data processing support.
>>>>> 
>>>>> 
>>>>> 
>>>>> From: Jules Damji < dmatrix@ comcast. net ( dmatrix@comcast.net ) >
>>>>> Sent: Friday, April 19, 2019 12:21 PM
>>>>> To: Bryan Cutler < cutlerb@ gmail. com ( cutlerb@gmail.com ) >
>>>>> Cc: Dev < dev@ spark. apache. org ( dev@spark.apache.org ) >
>>>>> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Columnar Processing Support
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> + (non-binding)
>>>>> 
>>>>> 
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>> 
>>>>> 
>>>>> Pardon the dumb thumb typos :)
>>>>> 
>>>>> 
>>>>> 
>>>>> On Apr 19, 2019, at 10:30 AM, Bryan Cutler < cutlerb@ gmail. com (
>>>>> cutlerb@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> +1 (non-binding)
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe < jlowe@ apache. org (
>>>>> jlowe@apache.org ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> +1 (non-binding). Looking forward to seeing better support for
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> processing columnar data.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Jason
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> < tgraves_cs@ yahoo. com. invalid ( tgraves_cs@yahoo.com.invalid ) > wrote:
>>> 
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Hi everyone,
>>>>> 
>>>>> 
>>>>> 
>>>>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> extended Columnar Processing Support. The proposal is to extend the
>>> support to allow for more columnar processing.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> You can find the full proposal in the jira at:
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27396 (
>>> https://issues.apache.org/jira/browse/SPARK-27396 ). There was also a
>>> DISCUSS thread in the dev mailing list.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Please vote as early as you can, I will leave the vote open until
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>> 
>>>>> 
>>>>> 
>>>>> [ ] +0
>>>>> 
>>>>> 
>>>>> 
>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> 
>>>>> 
>>>>> Tom Graves
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --------------------------------------------------------------------- To
>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>> dev-unsubscribe@spark.apache.org )
>>> 
>>> 
>> 
>> 
> 
> 
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Reynold Xin <rx...@databricks.com>.

"if others think it would be helpful, we can cancel this vote, update the SPIP to clarify exactly what I am proposing, and then restart the vote after we have gotten more agreement on what APIs should be exposed"

That'd be very useful. At least I was confused by what the SPIP was about. No point voting on something when there is still a lot of confusion about what it is.

On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans < revans2@gmail.com > wrote:

> 
> 
> 
> Xiangrui Meng,
> 
> 
> 
> I provided some examples in the original discussion thread.
> 
> 
> 
> https:/ / lists. apache. org/ thread. html/ f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@
> %3Cdev. spark. apache. org%3E (
> https://lists.apache.org/thread.html/f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@%3Cdev.spark.apache.org%3E
> )
> 
> 
> 
> But the concrete use case that we have is GPU accelerated ETL on Spark.
> Primarily as data preparation and feature engineering for ML tools like
> XGBoost, which by the way exposes a Spark specific scala API, not just a
> python one. We built a proof of concept and saw decent performance gains.
> Enough gains to more than pay for the added cost of a GPU, with the
> potential for even better performance in the future. With that proof of
> concept, we were able to make all of the processing columnar end-to-end
> for many queries so there really wasn't any data conversion costs to
> overcome, but we did want the design flexible enough to include a
> cost-based optimizer. \
> 
> 
> 
> It looks like there is some confusion around this SPIP especially in how
> it relates to features in other SPIPs around data exchange between
> different systems. I didn't want to update the text of this SPIP while it
> was under an active vote, but if others think it would be helpful, we can
> cancel this vote, update the SPIP to clarify exactly what I am proposing,
> and then restart the vote after we have gotten more agreement on what APIs
> should be exposed.
> 
> 
> 
> Thanks,
> 
> 
> 
> Bobby
> 
> 
> 
> On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng < mengxr@ gmail. com (
> mengxr@gmail.com ) > wrote:
> 
> 
>> 
>> 
>> Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. I
>> think the SPIP should list a concrete ETL use case (from POC?) that can
>> benefit from this *public Java/Scala API, *does *vectorization*, and
>> significantly *boosts the performance *even with data conversion overhead.
>> 
>> 
>> 
>> 
>> The current mid-term success (Pandas UDF) doesn't match the purpose of
>> SPIP and it can be done without exposing any public APIs.
>> 
>> 
>> 
>> Depending how much benefit it brings, we might agree that a public
>> Java/Scala API is needed. Then we might want to step slightly into how. I
>> saw three options mentioned in the JIRA and discussion threads:
>> 
>> 
>> 
>> 1. Expose `Array[Byte]` in Arrow format. Let user decode it using an Arrow
>> library.
>> 2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
>> 3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also used
>> by Spark internals. It makes us hard to change Spark internals in the
>> future.
>> 4. Expose something like `SparkRecordBatch` that is Arrow-compatible and
>> maintain conversion between internal `ColumnarBatch` and
>> `SparkRecordBatch`. It might cause conversion overhead in the future if
>> our internal becomes different from Arrow.
>> 
>> 
>> 
>> Note that both 3 and 4 will make many APIs public to be Arrow compatible.
>> So we should really give concrete ETL cases to prove that it is important
>> for us to do so.
>> 
>> 
>> 
>> On Mon, Apr 22, 2019 at 8:27 AM Tom Graves < tgraves_cs@ yahoo. com (
>> tgraves_cs@yahoo.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> Based on there is still discussion and Spark Summit is this week, I'm
>>> going to extend the vote til Friday the 26th.
>>> 
>>> 
>>> 
>>> Tom
>>> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans < revans2@ gmail. com
>>> ( revans2@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> Yes, it is technically possible for the layout to change. No, it is not
>>> going to happen. It is already baked into several different official
>>> libraries which are widely used, not just for holding and processing the
>>> data, but also for transfer of the data between the various
>>> implementations. There would have to be a really serious reason to force
>>> an incompatible change at this point. So in the worst case, we can version
>>> the layout and bake that into the API that exposes the internal layout of
>>> the data. That way code that wants to program against a JAVA API can do so
>>> using the API that Spark provides, those who want to interface with
>>> something that expects the data in arrow format will already have to know
>>> what version of the format it was programmed against and in the worst case
>>> if the layout does change we can support the new layout if needed.
>>> 
>>> 
>>> 
>>> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler < cutlerb@ gmail. com (
>>> cutlerb@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> The Arrow data format is not yet stable, meaning there are no guarantees
>>> on backwards/forwards compatibility. Once version 1.0 is released, it will
>>> have those guarantees but it's hard to say when that will be. The
>>> remaining work to get there can be seen at
>>> https:/ / cwiki. apache. org/ confluence/ display/ ARROW/ Columnar+Format+1.
>>> 0+Milestone (
>>> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone
>>> ). So yes, it is a risk that exposing Spark data as Arrow could cause an
>>> issue if handled by a different version that is not compatible. That being
>>> said, changes to format are not taken lightly and are backwards compatible
>>> when possible. I think it would be fair to mark the APIs exposing Arrow
>>> data as experimental for the time being, and clearly state the version
>>> that must be used to be compatible in the docs. Also, adding features like
>>> this and SPARK-24579 will probably help adoption of Arrow and accelerate a
>>> 1.0 release. Adding the Arrow dev list to CC.
>>> 
>>> 
>>> 
>>> Bryan
>>> 
>>> 
>>> 
>>> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia < matei. zaharia@ gmail. com
>>> ( matei.zaharia@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> Okay, that makes sense, but is the Arrow data format stable? If not, we
>>> risk breakage when Arrow changes in the future and some libraries using
>>> this feature are begin to use the new Arrow code.
>>> 
>>> 
>>> 
>>> Matei
>>> 
>>> 
>>>> 
>>>> 
>>>> On Apr 20, 2019, at 1:39 PM, Bobby Evans < revans2@ gmail. com (
>>>> revans2@gmail.com ) > wrote:
>>>> 
>>>> 
>>>> 
>>>> I want to be clear that this SPIP is not proposing exposing Arrow
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> APIs/Classes through any Spark APIs. SPARK-24579 is doing that, and
>>> because of the overlap between the two SPIPs I scaled this one back to
>>> concentrate just on the columnar processing aspects. Sorry for the
>>> confusion as I didn't update the JIRA description clearly enough when we
>>> adjusted it during the discussion on the JIRA. As part of the columnar
>>> processing, we plan on providing arrow formatted data, but that will be
>>> exposed through a Spark owned API.
>>> 
>>> 
>>>> 
>>>> 
>>>> On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia < matei. zaharia@ gmail. com
>>>> ( matei.zaharia@gmail.com ) >
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> wrote:
>>> 
>>> 
>>>> 
>>>> 
>>>> FYI, I’d also be concerned about exposing the Arrow API or format as a
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> public API if it’s not yet stable. Is stabilization of the API and format
>>> coming soon on the roadmap there? Maybe someone can work with the Arrow
>>> community to make that happen.
>>> 
>>> 
>>>> 
>>>> 
>>>> We’ve been bitten lots of times by API changes forced by external
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> libraries even when those were widely popular. For example, we used
>>> Guava’s Optional for a while, which changed at some point, and we also had
>>> issues with Protobuf and Scala itself (especially how Scala’s APIs appear
>>> in Java). API breakage might not be as serious in dynamic languages like
>>> Python, where you can often keep compatibility with old behaviors, but it
>>> really hurts in Java and Scala.
>>> 
>>> 
>>>> 
>>>> 
>>>> The problem is especially bad for us because of two aspects of how
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Spark is used:
>>> 
>>> 
>>>> 
>>>> 
>>>> 1) Spark is used for production data transformation jobs that people
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> need to keep running for a long time. Nobody wants to make changes to a
>>> job that’s been working fine and computing something correctly for years
>>> just to get a bug fix from the latest Spark release or whatever. It’s much
>>> better if they can upgrade Spark without editing every job.
>>> 
>>> 
>>>> 
>>>> 
>>>> 2) Spark is often used as “glue” to combine data processing code in
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> other libraries, and these might start to require different versions of
>>> our dependencies. For example, the Guava class exposed in Spark became a
>>> problem when third-party libraries started requiring a new version of
>>> Guava: those new libraries just couldn’t work with Spark. Protobuf was
>>> especially bad because some users wanted to read data stored as Protobufs
>>> (or in a format that uses Protobuf inside), so they needed a different
>>> version of the library in their main data processing code.
>>> 
>>> 
>>>> 
>>>> 
>>>> If there was some guarantee that this stuff would remain
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> backward-compatible, we’d be in a much better stuff. It’s not that hard to
>>> keep a storage format backward-compatible: just document the format and
>>> extend it only in ways that don’t break the meaning of old data (for
>>> example, add new version numbers or field types that are read in a
>>> different way). It’s a bit harder for a Java API, but maybe Spark could
>>> just expose byte arrays directly and work on those if the API is not
>>> guaranteed to stay stable (that is, we’d still use our own classes to
>>> manipulate the data internally, and end users could use the Arrow library
>>> if they want it).
>>> 
>>> 
>>>> 
>>>> 
>>>> Matei
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Apr 20, 2019, at 8:38 AM, Bobby Evans < revans2@ gmail. com (
>>>>> revans2@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> I think you misunderstood the point of this SPIP. I responded to your
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> comments in the SPIP JIRA.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng < mengxr@ gmail. com (
>>>>> mengxr@gmail.com ) >
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> wrote:
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> I posted my comment in the JIRA. Main concerns here:
>>>>> 
>>>>> 
>>>>> 
>>>>> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 1.0 release someday.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> 2. ML/DL systems that can benefits from columnar format are mostly in
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Python.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> 3. Simple operations, though benefits vectorization, might not be
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> worth the data exchange overhead.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> So would an improved Pandas UDF API would be good enough? For
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Sorry that I should join the discussion earlier! Hope it is not too
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> late:)
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Apr 19, 2019 at 1:20 PM < tcondie@ gmail. com ( tcondie@gmail.com )
>>>>> > wrote:
>>>>> +1 (non-binding) for better columnar data processing support.
>>>>> 
>>>>> 
>>>>> 
>>>>> From: Jules Damji < dmatrix@ comcast. net ( dmatrix@comcast.net ) >
>>>>> Sent: Friday, April 19, 2019 12:21 PM
>>>>> To: Bryan Cutler < cutlerb@ gmail. com ( cutlerb@gmail.com ) >
>>>>> Cc: Dev < dev@ spark. apache. org ( dev@spark.apache.org ) >
>>>>> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Columnar Processing Support
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> + (non-binding)
>>>>> 
>>>>> 
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>> 
>>>>> 
>>>>> Pardon the dumb thumb typos :)
>>>>> 
>>>>> 
>>>>> 
>>>>> On Apr 19, 2019, at 10:30 AM, Bryan Cutler < cutlerb@ gmail. com (
>>>>> cutlerb@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> +1 (non-binding)
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe < jlowe@ apache. org (
>>>>> jlowe@apache.org ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> +1 (non-binding). Looking forward to seeing better support for
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> processing columnar data.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Jason
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> < tgraves_cs@ yahoo. com. invalid ( tgraves_cs@yahoo.com.invalid ) > wrote:
>>> 
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Hi everyone,
>>>>> 
>>>>> 
>>>>> 
>>>>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> extended Columnar Processing Support. The proposal is to extend the
>>> support to allow for more columnar processing.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> You can find the full proposal in the jira at:
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27396 (
>>> https://issues.apache.org/jira/browse/SPARK-27396 ). There was also a
>>> DISCUSS thread in the dev mailing list.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Please vote as early as you can, I will leave the vote open until
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>> 
>>>>> 
>>>>> 
>>>>> [ ] +0
>>>>> 
>>>>> 
>>>>> 
>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> 
>>>>> 
>>>>> Tom Graves
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --------------------------------------------------------------------- To
>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>> dev-unsubscribe@spark.apache.org )
>>> 
>>> 
>> 
>> 
> 
> 
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Bobby Evans <re...@gmail.com>.

Xiangrui Meng,

I provided some examples in the original discussion thread.

https://lists.apache.org/thread.html/f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@%3Cdev.spark.apache.org%3E

But the concrete use case that we have is GPU accelerated ETL on Spark.
Primarily as data preparation and feature engineering for ML tools like
XGBoost, which by the way exposes a Spark specific scala API, not just a
python one. We built a proof of concept and saw decent performance gains.
Enough gains to more than pay for the added cost of a GPU, with the
potential for even better performance in the future.  With that proof of
concept, we were able to make all of the processing columnar end-to-end for
many queries so there really wasn't any data conversion costs to overcome,
but we did want the design flexible enough to include a cost-based
optimizer.  \

It looks like there is some confusion around this SPIP especially in how it
relates to features in other SPIPs around data exchange between different
systems.  I didn't want to update the text of this SPIP while it was under
an active vote, but if others think it would be helpful, we can cancel this
vote, update the SPIP to clarify exactly what I am proposing, and then
restart the vote after we have gotten more agreement on what APIs should be
exposed.

Thanks,

Bobby

On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng <me...@gmail.com> wrote:

> Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. I
> think the SPIP should list a concrete ETL use case (from POC?) that can
> benefit from this *public Java/Scala API, *does *vectorization*, and
> significantly *boosts the performance *even with data conversion overhead.
>
> The current mid-term success (Pandas UDF) doesn't match the purpose of
> SPIP and it can be done without exposing any public APIs.
>
> Depending how much benefit it brings, we might agree that a public
> Java/Scala API is needed. Then we might want to step slightly into how. I
> saw three options mentioned in the JIRA and discussion threads:
>
> 1. Expose `Array[Byte]` in Arrow format. Let user decode it using an Arrow
> library.
> 2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
> 3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also used
> by Spark internals. It makes us hard to change Spark internals in the
> future.
> 4. Expose something like `SparkRecordBatch` that is Arrow-compatible and
> maintain conversion between internal `ColumnarBatch` and
> `SparkRecordBatch`. It might cause conversion overhead in the future if our
> internal becomes different from Arrow.
>
> Note that both 3 and 4 will make many APIs public to be Arrow compatible.
> So we should really give concrete ETL cases to prove that it is important
> for us to do so.
>
> On Mon, Apr 22, 2019 at 8:27 AM Tom Graves <tg...@yahoo.com> wrote:
>
>>
>> Based on there is still discussion and Spark Summit is this week, I'm
>> going to extend the vote til Friday the 26th.
>>
>> Tom
>> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans <re...@gmail.com>
>> wrote:
>>
>>
>> Yes, it is technically possible for the layout to change.  No, it is not
>> going to happen.  It is already baked into several different official
>> libraries which are widely used, not just for holding and processing the
>> data, but also for transfer of the data between the various
>> implementations.  There would have to be a really serious reason to force
>> an incompatible change at this point.  So in the worst case, we can version
>> the layout and bake that into the API that exposes the internal layout of
>> the data.  That way code that wants to program against a JAVA API can do so
>> using the API that Spark provides, those who want to interface with
>> something that expects the data in arrow format will already have to know
>> what version of the format it was programmed against and in the worst case
>> if the layout does change we can support the new layout if needed.
>>
>> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com> wrote:
>>
>> The Arrow data format is not yet stable, meaning there are no guarantees
>> on backwards/forwards compatibility. Once version 1.0 is released, it will
>> have those guarantees but it's hard to say when that will be. The remaining
>> work to get there can be seen at
>> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
>> So yes, it is a risk that exposing Spark data as Arrow could cause an issue
>> if handled by a different version that is not compatible. That being said,
>> changes to format are not taken lightly and are backwards compatible when
>> possible. I think it would be fair to mark the APIs exposing Arrow data as
>> experimental for the time being, and clearly state the version that must be
>> used to be compatible in the docs. Also, adding features like this and
>> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
>> release. Adding the Arrow dev list to CC.
>>
>> Bryan
>>
>> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com>
>> wrote:
>>
>> Okay, that makes sense, but is the Arrow data format stable? If not, we
>> risk breakage when Arrow changes in the future and some libraries using
>> this feature are begin to use the new Arrow code.
>>
>> Matei
>>
>> > On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
>> >
>> > I want to be clear that this SPIP is not proposing exposing Arrow
>> APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
>> because of the overlap between the two SPIPs I scaled this one back to
>> concentrate just on the columnar processing aspects. Sorry for the
>> confusion as I didn't update the JIRA description clearly enough when we
>> adjusted it during the discussion on the JIRA.  As part of the columnar
>> processing, we plan on providing arrow formatted data, but that will be
>> exposed through a Spark owned API.
>> >
>> > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com>
>> wrote:
>> > FYI, I’d also be concerned about exposing the Arrow API or format as a
>> public API if it’s not yet stable. Is stabilization of the API and format
>> coming soon on the roadmap there? Maybe someone can work with the Arrow
>> community to make that happen.
>> >
>> > We’ve been bitten lots of times by API changes forced by external
>> libraries even when those were widely popular. For example, we used Guava’s
>> Optional for a while, which changed at some point, and we also had issues
>> with Protobuf and Scala itself (especially how Scala’s APIs appear in
>> Java). API breakage might not be as serious in dynamic languages like
>> Python, where you can often keep compatibility with old behaviors, but it
>> really hurts in Java and Scala.
>> >
>> > The problem is especially bad for us because of two aspects of how
>> Spark is used:
>> >
>> > 1) Spark is used for production data transformation jobs that people
>> need to keep running for a long time. Nobody wants to make changes to a job
>> that’s been working fine and computing something correctly for years just
>> to get a bug fix from the latest Spark release or whatever. It’s much
>> better if they can upgrade Spark without editing every job.
>> >
>> > 2) Spark is often used as “glue” to combine data processing code in
>> other libraries, and these might start to require different versions of our
>> dependencies. For example, the Guava class exposed in Spark became a
>> problem when third-party libraries started requiring a new version of
>> Guava: those new libraries just couldn’t work with Spark. Protobuf was
>> especially bad because some users wanted to read data stored as Protobufs
>> (or in a format that uses Protobuf inside), so they needed a different
>> version of the library in their main data processing code.
>> >
>> > If there was some guarantee that this stuff would remain
>> backward-compatible, we’d be in a much better stuff. It’s not that hard to
>> keep a storage format backward-compatible: just document the format and
>> extend it only in ways that don’t break the meaning of old data (for
>> example, add new version numbers or field types that are read in a
>> different way). It’s a bit harder for a Java API, but maybe Spark could
>> just expose byte arrays directly and work on those if the API is not
>> guaranteed to stay stable (that is, we’d still use our own classes to
>> manipulate the data internally, and end users could use the Arrow library
>> if they want it).
>> >
>> > Matei
>> >
>> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
>> > >
>> > > I think you misunderstood the point of this SPIP. I responded to your
>> comments in the SPIP JIRA.
>> > >
>> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com>
>> wrote:
>> > > I posted my comment in the JIRA. Main concerns here:
>> > >
>> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>> 1.0 release someday.
>> > > 2. ML/DL systems that can benefits from columnar format are mostly in
>> Python.
>> > > 3. Simple operations, though benefits vectorization, might not be
>> worth the data exchange overhead.
>> > >
>> > > So would an improved Pandas UDF API would be good enough? For
>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>> > >
>> > > Sorry that I should join the discussion earlier! Hope it is not too
>> late:)
>> > >
>> > > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
>> > > +1 (non-binding) for better columnar data processing support.
>> > >
>> > >
>> > >
>> > > From: Jules Damji <dm...@comcast.net>
>> > > Sent: Friday, April 19, 2019 12:21 PM
>> > > To: Bryan Cutler <cu...@gmail.com>
>> > > Cc: Dev <de...@spark.apache.org>
>> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>> Columnar Processing Support
>> > >
>> > >
>> > >
>> > > + (non-binding)
>> > >
>> > > Sent from my iPhone
>> > >
>> > > Pardon the dumb thumb typos :)
>> > >
>> > >
>> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com> wrote:
>> > >
>> > > +1 (non-binding)
>> > >
>> > >
>> > >
>> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote:
>> > >
>> > > +1 (non-binding).  Looking forward to seeing better support for
>> processing columnar data.
>> > >
>> > >
>> > >
>> > > Jason
>> > >
>> > >
>> > >
>> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>> <tg...@yahoo.com.invalid> wrote:
>> > >
>> > > Hi everyone,
>> > >
>> > >
>> > >
>> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>> extended Columnar Processing Support.  The proposal is to extend the
>> support to allow for more columnar processing.
>> > >
>> > >
>> > >
>> > > You can find the full proposal in the jira at:
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>> > >
>> > >
>> > >
>> > > Please vote as early as you can, I will leave the vote open until
>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>> > >
>> > >
>> > >
>> > > [ ] +1: Accept the proposal as an official SPIP
>> > >
>> > > [ ] +0
>> > >
>> > > [ ] -1: I don't think this is a good idea because ...
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > Thanks!
>> > >
>> > > Tom Graves
>> > >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Bobby Evans <re...@gmail.com>.

Xiangrui Meng,

I provided some examples in the original discussion thread.

https://lists.apache.org/thread.html/f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@%3Cdev.spark.apache.org%3E

But the concrete use case that we have is GPU accelerated ETL on Spark.
Primarily as data preparation and feature engineering for ML tools like
XGBoost, which by the way exposes a Spark specific scala API, not just a
python one. We built a proof of concept and saw decent performance gains.
Enough gains to more than pay for the added cost of a GPU, with the
potential for even better performance in the future.  With that proof of
concept, we were able to make all of the processing columnar end-to-end for
many queries so there really wasn't any data conversion costs to overcome,
but we did want the design flexible enough to include a cost-based
optimizer.  \

It looks like there is some confusion around this SPIP especially in how it
relates to features in other SPIPs around data exchange between different
systems.  I didn't want to update the text of this SPIP while it was under
an active vote, but if others think it would be helpful, we can cancel this
vote, update the SPIP to clarify exactly what I am proposing, and then
restart the vote after we have gotten more agreement on what APIs should be
exposed.

Thanks,

Bobby

On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng <me...@gmail.com> wrote:

> Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. I
> think the SPIP should list a concrete ETL use case (from POC?) that can
> benefit from this *public Java/Scala API, *does *vectorization*, and
> significantly *boosts the performance *even with data conversion overhead.
>
> The current mid-term success (Pandas UDF) doesn't match the purpose of
> SPIP and it can be done without exposing any public APIs.
>
> Depending how much benefit it brings, we might agree that a public
> Java/Scala API is needed. Then we might want to step slightly into how. I
> saw three options mentioned in the JIRA and discussion threads:
>
> 1. Expose `Array[Byte]` in Arrow format. Let user decode it using an Arrow
> library.
> 2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
> 3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also used
> by Spark internals. It makes us hard to change Spark internals in the
> future.
> 4. Expose something like `SparkRecordBatch` that is Arrow-compatible and
> maintain conversion between internal `ColumnarBatch` and
> `SparkRecordBatch`. It might cause conversion overhead in the future if our
> internal becomes different from Arrow.
>
> Note that both 3 and 4 will make many APIs public to be Arrow compatible.
> So we should really give concrete ETL cases to prove that it is important
> for us to do so.
>
> On Mon, Apr 22, 2019 at 8:27 AM Tom Graves <tg...@yahoo.com> wrote:
>
>>
>> Based on there is still discussion and Spark Summit is this week, I'm
>> going to extend the vote til Friday the 26th.
>>
>> Tom
>> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans <re...@gmail.com>
>> wrote:
>>
>>
>> Yes, it is technically possible for the layout to change.  No, it is not
>> going to happen.  It is already baked into several different official
>> libraries which are widely used, not just for holding and processing the
>> data, but also for transfer of the data between the various
>> implementations.  There would have to be a really serious reason to force
>> an incompatible change at this point.  So in the worst case, we can version
>> the layout and bake that into the API that exposes the internal layout of
>> the data.  That way code that wants to program against a JAVA API can do so
>> using the API that Spark provides, those who want to interface with
>> something that expects the data in arrow format will already have to know
>> what version of the format it was programmed against and in the worst case
>> if the layout does change we can support the new layout if needed.
>>
>> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com> wrote:
>>
>> The Arrow data format is not yet stable, meaning there are no guarantees
>> on backwards/forwards compatibility. Once version 1.0 is released, it will
>> have those guarantees but it's hard to say when that will be. The remaining
>> work to get there can be seen at
>> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
>> So yes, it is a risk that exposing Spark data as Arrow could cause an issue
>> if handled by a different version that is not compatible. That being said,
>> changes to format are not taken lightly and are backwards compatible when
>> possible. I think it would be fair to mark the APIs exposing Arrow data as
>> experimental for the time being, and clearly state the version that must be
>> used to be compatible in the docs. Also, adding features like this and
>> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
>> release. Adding the Arrow dev list to CC.
>>
>> Bryan
>>
>> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com>
>> wrote:
>>
>> Okay, that makes sense, but is the Arrow data format stable? If not, we
>> risk breakage when Arrow changes in the future and some libraries using
>> this feature are begin to use the new Arrow code.
>>
>> Matei
>>
>> > On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
>> >
>> > I want to be clear that this SPIP is not proposing exposing Arrow
>> APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
>> because of the overlap between the two SPIPs I scaled this one back to
>> concentrate just on the columnar processing aspects. Sorry for the
>> confusion as I didn't update the JIRA description clearly enough when we
>> adjusted it during the discussion on the JIRA.  As part of the columnar
>> processing, we plan on providing arrow formatted data, but that will be
>> exposed through a Spark owned API.
>> >
>> > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com>
>> wrote:
>> > FYI, I’d also be concerned about exposing the Arrow API or format as a
>> public API if it’s not yet stable. Is stabilization of the API and format
>> coming soon on the roadmap there? Maybe someone can work with the Arrow
>> community to make that happen.
>> >
>> > We’ve been bitten lots of times by API changes forced by external
>> libraries even when those were widely popular. For example, we used Guava’s
>> Optional for a while, which changed at some point, and we also had issues
>> with Protobuf and Scala itself (especially how Scala’s APIs appear in
>> Java). API breakage might not be as serious in dynamic languages like
>> Python, where you can often keep compatibility with old behaviors, but it
>> really hurts in Java and Scala.
>> >
>> > The problem is especially bad for us because of two aspects of how
>> Spark is used:
>> >
>> > 1) Spark is used for production data transformation jobs that people
>> need to keep running for a long time. Nobody wants to make changes to a job
>> that’s been working fine and computing something correctly for years just
>> to get a bug fix from the latest Spark release or whatever. It’s much
>> better if they can upgrade Spark without editing every job.
>> >
>> > 2) Spark is often used as “glue” to combine data processing code in
>> other libraries, and these might start to require different versions of our
>> dependencies. For example, the Guava class exposed in Spark became a
>> problem when third-party libraries started requiring a new version of
>> Guava: those new libraries just couldn’t work with Spark. Protobuf was
>> especially bad because some users wanted to read data stored as Protobufs
>> (or in a format that uses Protobuf inside), so they needed a different
>> version of the library in their main data processing code.
>> >
>> > If there was some guarantee that this stuff would remain
>> backward-compatible, we’d be in a much better stuff. It’s not that hard to
>> keep a storage format backward-compatible: just document the format and
>> extend it only in ways that don’t break the meaning of old data (for
>> example, add new version numbers or field types that are read in a
>> different way). It’s a bit harder for a Java API, but maybe Spark could
>> just expose byte arrays directly and work on those if the API is not
>> guaranteed to stay stable (that is, we’d still use our own classes to
>> manipulate the data internally, and end users could use the Arrow library
>> if they want it).
>> >
>> > Matei
>> >
>> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
>> > >
>> > > I think you misunderstood the point of this SPIP. I responded to your
>> comments in the SPIP JIRA.
>> > >
>> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com>
>> wrote:
>> > > I posted my comment in the JIRA. Main concerns here:
>> > >
>> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>> 1.0 release someday.
>> > > 2. ML/DL systems that can benefits from columnar format are mostly in
>> Python.
>> > > 3. Simple operations, though benefits vectorization, might not be
>> worth the data exchange overhead.
>> > >
>> > > So would an improved Pandas UDF API would be good enough? For
>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>> > >
>> > > Sorry that I should join the discussion earlier! Hope it is not too
>> late:)
>> > >
>> > > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
>> > > +1 (non-binding) for better columnar data processing support.
>> > >
>> > >
>> > >
>> > > From: Jules Damji <dm...@comcast.net>
>> > > Sent: Friday, April 19, 2019 12:21 PM
>> > > To: Bryan Cutler <cu...@gmail.com>
>> > > Cc: Dev <de...@spark.apache.org>
>> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>> Columnar Processing Support
>> > >
>> > >
>> > >
>> > > + (non-binding)
>> > >
>> > > Sent from my iPhone
>> > >
>> > > Pardon the dumb thumb typos :)
>> > >
>> > >
>> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com> wrote:
>> > >
>> > > +1 (non-binding)
>> > >
>> > >
>> > >
>> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote:
>> > >
>> > > +1 (non-binding).  Looking forward to seeing better support for
>> processing columnar data.
>> > >
>> > >
>> > >
>> > > Jason
>> > >
>> > >
>> > >
>> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>> <tg...@yahoo.com.invalid> wrote:
>> > >
>> > > Hi everyone,
>> > >
>> > >
>> > >
>> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>> extended Columnar Processing Support.  The proposal is to extend the
>> support to allow for more columnar processing.
>> > >
>> > >
>> > >
>> > > You can find the full proposal in the jira at:
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>> > >
>> > >
>> > >
>> > > Please vote as early as you can, I will leave the vote open until
>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>> > >
>> > >
>> > >
>> > > [ ] +1: Accept the proposal as an official SPIP
>> > >
>> > > [ ] +0
>> > >
>> > > [ ] -1: I don't think this is a good idea because ...
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > Thanks!
>> > >
>> > > Tom Graves
>> > >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Xiangrui Meng <me...@gmail.com>.

Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. I
think the SPIP should list a concrete ETL use case (from POC?) that can
benefit from this *public Java/Scala API, *does *vectorization*, and
significantly *boosts the performance *even with data conversion overhead.

The current mid-term success (Pandas UDF) doesn't match the purpose of SPIP
and it can be done without exposing any public APIs.

Depending how much benefit it brings, we might agree that a public
Java/Scala API is needed. Then we might want to step slightly into how. I
saw three options mentioned in the JIRA and discussion threads:

1. Expose `Array[Byte]` in Arrow format. Let user decode it using an Arrow
library.
2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also used
by Spark internals. It makes us hard to change Spark internals in the
future.
4. Expose something like `SparkRecordBatch` that is Arrow-compatible and
maintain conversion between internal `ColumnarBatch` and
`SparkRecordBatch`. It might cause conversion overhead in the future if our
internal becomes different from Arrow.

Note that both 3 and 4 will make many APIs public to be Arrow compatible.
So we should really give concrete ETL cases to prove that it is important
for us to do so.

On Mon, Apr 22, 2019 at 8:27 AM Tom Graves <tg...@yahoo.com> wrote:

>
> Based on there is still discussion and Spark Summit is this week, I'm
> going to extend the vote til Friday the 26th.
>
> Tom
> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans <re...@gmail.com>
> wrote:
>
>
> Yes, it is technically possible for the layout to change.  No, it is not
> going to happen.  It is already baked into several different official
> libraries which are widely used, not just for holding and processing the
> data, but also for transfer of the data between the various
> implementations.  There would have to be a really serious reason to force
> an incompatible change at this point.  So in the worst case, we can version
> the layout and bake that into the API that exposes the internal layout of
> the data.  That way code that wants to program against a JAVA API can do so
> using the API that Spark provides, those who want to interface with
> something that expects the data in arrow format will already have to know
> what version of the format it was programmed against and in the worst case
> if the layout does change we can support the new layout if needed.
>
> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com> wrote:
>
> The Arrow data format is not yet stable, meaning there are no guarantees
> on backwards/forwards compatibility. Once version 1.0 is released, it will
> have those guarantees but it's hard to say when that will be. The remaining
> work to get there can be seen at
> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
> So yes, it is a risk that exposing Spark data as Arrow could cause an issue
> if handled by a different version that is not compatible. That being said,
> changes to format are not taken lightly and are backwards compatible when
> possible. I think it would be fair to mark the APIs exposing Arrow data as
> experimental for the time being, and clearly state the version that must be
> used to be compatible in the docs. Also, adding features like this and
> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
> release. Adding the Arrow dev list to CC.
>
> Bryan
>
> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com>
> wrote:
>
> Okay, that makes sense, but is the Arrow data format stable? If not, we
> risk breakage when Arrow changes in the future and some libraries using
> this feature are begin to use the new Arrow code.
>
> Matei
>
> > On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
> >
> > I want to be clear that this SPIP is not proposing exposing Arrow
> APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
> because of the overlap between the two SPIPs I scaled this one back to
> concentrate just on the columnar processing aspects. Sorry for the
> confusion as I didn't update the JIRA description clearly enough when we
> adjusted it during the discussion on the JIRA.  As part of the columnar
> processing, we plan on providing arrow formatted data, but that will be
> exposed through a Spark owned API.
> >
> > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com>
> wrote:
> > FYI, I’d also be concerned about exposing the Arrow API or format as a
> public API if it’s not yet stable. Is stabilization of the API and format
> coming soon on the roadmap there? Maybe someone can work with the Arrow
> community to make that happen.
> >
> > We’ve been bitten lots of times by API changes forced by external
> libraries even when those were widely popular. For example, we used Guava’s
> Optional for a while, which changed at some point, and we also had issues
> with Protobuf and Scala itself (especially how Scala’s APIs appear in
> Java). API breakage might not be as serious in dynamic languages like
> Python, where you can often keep compatibility with old behaviors, but it
> really hurts in Java and Scala.
> >
> > The problem is especially bad for us because of two aspects of how Spark
> is used:
> >
> > 1) Spark is used for production data transformation jobs that people
> need to keep running for a long time. Nobody wants to make changes to a job
> that’s been working fine and computing something correctly for years just
> to get a bug fix from the latest Spark release or whatever. It’s much
> better if they can upgrade Spark without editing every job.
> >
> > 2) Spark is often used as “glue” to combine data processing code in
> other libraries, and these might start to require different versions of our
> dependencies. For example, the Guava class exposed in Spark became a
> problem when third-party libraries started requiring a new version of
> Guava: those new libraries just couldn’t work with Spark. Protobuf was
> especially bad because some users wanted to read data stored as Protobufs
> (or in a format that uses Protobuf inside), so they needed a different
> version of the library in their main data processing code.
> >
> > If there was some guarantee that this stuff would remain
> backward-compatible, we’d be in a much better stuff. It’s not that hard to
> keep a storage format backward-compatible: just document the format and
> extend it only in ways that don’t break the meaning of old data (for
> example, add new version numbers or field types that are read in a
> different way). It’s a bit harder for a Java API, but maybe Spark could
> just expose byte arrays directly and work on those if the API is not
> guaranteed to stay stable (that is, we’d still use our own classes to
> manipulate the data internally, and end users could use the Arrow library
> if they want it).
> >
> > Matei
> >
> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
> > >
> > > I think you misunderstood the point of this SPIP. I responded to your
> comments in the SPIP JIRA.
> > >
> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com>
> wrote:
> > > I posted my comment in the JIRA. Main concerns here:
> > >
> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
> 1.0 release someday.
> > > 2. ML/DL systems that can benefits from columnar format are mostly in
> Python.
> > > 3. Simple operations, though benefits vectorization, might not be
> worth the data exchange overhead.
> > >
> > > So would an improved Pandas UDF API would be good enough? For example,
> SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > >
> > > Sorry that I should join the discussion earlier! Hope it is not too
> late:)
> > >
> > > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
> > > +1 (non-binding) for better columnar data processing support.
> > >
> > >
> > >
> > > From: Jules Damji <dm...@comcast.net>
> > > Sent: Friday, April 19, 2019 12:21 PM
> > > To: Bryan Cutler <cu...@gmail.com>
> > > Cc: Dev <de...@spark.apache.org>
> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
> > >
> > >
> > >
> > > + (non-binding)
> > >
> > > Sent from my iPhone
> > >
> > > Pardon the dumb thumb typos :)
> > >
> > >
> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com> wrote:
> > >
> > > +1 (non-binding)
> > >
> > >
> > >
> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote:
> > >
> > > +1 (non-binding).  Looking forward to seeing better support for
> processing columnar data.
> > >
> > >
> > >
> > > Jason
> > >
> > >
> > >
> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
> <tg...@yahoo.com.invalid> wrote:
> > >
> > > Hi everyone,
> > >
> > >
> > >
> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
> > >
> > >
> > >
> > > You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
> > >
> > >
> > >
> > > Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
> > >
> > >
> > >
> > > [ ] +1: Accept the proposal as an official SPIP
> > >
> > > [ ] +0
> > >
> > > [ ] -1: I don't think this is a good idea because ...
> > >
> > >
> > >
> > >
> > >
> > > Thanks!
> > >
> > > Tom Graves
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Xiangrui Meng <me...@gmail.com>.

Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. I
think the SPIP should list a concrete ETL use case (from POC?) that can
benefit from this *public Java/Scala API, *does *vectorization*, and
significantly *boosts the performance *even with data conversion overhead.

The current mid-term success (Pandas UDF) doesn't match the purpose of SPIP
and it can be done without exposing any public APIs.

Depending how much benefit it brings, we might agree that a public
Java/Scala API is needed. Then we might want to step slightly into how. I
saw three options mentioned in the JIRA and discussion threads:

1. Expose `Array[Byte]` in Arrow format. Let user decode it using an Arrow
library.
2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also used
by Spark internals. It makes us hard to change Spark internals in the
future.
4. Expose something like `SparkRecordBatch` that is Arrow-compatible and
maintain conversion between internal `ColumnarBatch` and
`SparkRecordBatch`. It might cause conversion overhead in the future if our
internal becomes different from Arrow.

Note that both 3 and 4 will make many APIs public to be Arrow compatible.
So we should really give concrete ETL cases to prove that it is important
for us to do so.

On Mon, Apr 22, 2019 at 8:27 AM Tom Graves <tg...@yahoo.com> wrote:

>
> Based on there is still discussion and Spark Summit is this week, I'm
> going to extend the vote til Friday the 26th.
>
> Tom
> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans <re...@gmail.com>
> wrote:
>
>
> Yes, it is technically possible for the layout to change.  No, it is not
> going to happen.  It is already baked into several different official
> libraries which are widely used, not just for holding and processing the
> data, but also for transfer of the data between the various
> implementations.  There would have to be a really serious reason to force
> an incompatible change at this point.  So in the worst case, we can version
> the layout and bake that into the API that exposes the internal layout of
> the data.  That way code that wants to program against a JAVA API can do so
> using the API that Spark provides, those who want to interface with
> something that expects the data in arrow format will already have to know
> what version of the format it was programmed against and in the worst case
> if the layout does change we can support the new layout if needed.
>
> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com> wrote:
>
> The Arrow data format is not yet stable, meaning there are no guarantees
> on backwards/forwards compatibility. Once version 1.0 is released, it will
> have those guarantees but it's hard to say when that will be. The remaining
> work to get there can be seen at
> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
> So yes, it is a risk that exposing Spark data as Arrow could cause an issue
> if handled by a different version that is not compatible. That being said,
> changes to format are not taken lightly and are backwards compatible when
> possible. I think it would be fair to mark the APIs exposing Arrow data as
> experimental for the time being, and clearly state the version that must be
> used to be compatible in the docs. Also, adding features like this and
> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
> release. Adding the Arrow dev list to CC.
>
> Bryan
>
> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com>
> wrote:
>
> Okay, that makes sense, but is the Arrow data format stable? If not, we
> risk breakage when Arrow changes in the future and some libraries using
> this feature are begin to use the new Arrow code.
>
> Matei
>
> > On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
> >
> > I want to be clear that this SPIP is not proposing exposing Arrow
> APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
> because of the overlap between the two SPIPs I scaled this one back to
> concentrate just on the columnar processing aspects. Sorry for the
> confusion as I didn't update the JIRA description clearly enough when we
> adjusted it during the discussion on the JIRA.  As part of the columnar
> processing, we plan on providing arrow formatted data, but that will be
> exposed through a Spark owned API.
> >
> > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com>
> wrote:
> > FYI, I’d also be concerned about exposing the Arrow API or format as a
> public API if it’s not yet stable. Is stabilization of the API and format
> coming soon on the roadmap there? Maybe someone can work with the Arrow
> community to make that happen.
> >
> > We’ve been bitten lots of times by API changes forced by external
> libraries even when those were widely popular. For example, we used Guava’s
> Optional for a while, which changed at some point, and we also had issues
> with Protobuf and Scala itself (especially how Scala’s APIs appear in
> Java). API breakage might not be as serious in dynamic languages like
> Python, where you can often keep compatibility with old behaviors, but it
> really hurts in Java and Scala.
> >
> > The problem is especially bad for us because of two aspects of how Spark
> is used:
> >
> > 1) Spark is used for production data transformation jobs that people
> need to keep running for a long time. Nobody wants to make changes to a job
> that’s been working fine and computing something correctly for years just
> to get a bug fix from the latest Spark release or whatever. It’s much
> better if they can upgrade Spark without editing every job.
> >
> > 2) Spark is often used as “glue” to combine data processing code in
> other libraries, and these might start to require different versions of our
> dependencies. For example, the Guava class exposed in Spark became a
> problem when third-party libraries started requiring a new version of
> Guava: those new libraries just couldn’t work with Spark. Protobuf was
> especially bad because some users wanted to read data stored as Protobufs
> (or in a format that uses Protobuf inside), so they needed a different
> version of the library in their main data processing code.
> >
> > If there was some guarantee that this stuff would remain
> backward-compatible, we’d be in a much better stuff. It’s not that hard to
> keep a storage format backward-compatible: just document the format and
> extend it only in ways that don’t break the meaning of old data (for
> example, add new version numbers or field types that are read in a
> different way). It’s a bit harder for a Java API, but maybe Spark could
> just expose byte arrays directly and work on those if the API is not
> guaranteed to stay stable (that is, we’d still use our own classes to
> manipulate the data internally, and end users could use the Arrow library
> if they want it).
> >
> > Matei
> >
> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
> > >
> > > I think you misunderstood the point of this SPIP. I responded to your
> comments in the SPIP JIRA.
> > >
> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com>
> wrote:
> > > I posted my comment in the JIRA. Main concerns here:
> > >
> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
> 1.0 release someday.
> > > 2. ML/DL systems that can benefits from columnar format are mostly in
> Python.
> > > 3. Simple operations, though benefits vectorization, might not be
> worth the data exchange overhead.
> > >
> > > So would an improved Pandas UDF API would be good enough? For example,
> SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > >
> > > Sorry that I should join the discussion earlier! Hope it is not too
> late:)
> > >
> > > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
> > > +1 (non-binding) for better columnar data processing support.
> > >
> > >
> > >
> > > From: Jules Damji <dm...@comcast.net>
> > > Sent: Friday, April 19, 2019 12:21 PM
> > > To: Bryan Cutler <cu...@gmail.com>
> > > Cc: Dev <de...@spark.apache.org>
> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
> > >
> > >
> > >
> > > + (non-binding)
> > >
> > > Sent from my iPhone
> > >
> > > Pardon the dumb thumb typos :)
> > >
> > >
> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com> wrote:
> > >
> > > +1 (non-binding)
> > >
> > >
> > >
> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote:
> > >
> > > +1 (non-binding).  Looking forward to seeing better support for
> processing columnar data.
> > >
> > >
> > >
> > > Jason
> > >
> > >
> > >
> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
> <tg...@yahoo.com.invalid> wrote:
> > >
> > > Hi everyone,
> > >
> > >
> > >
> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
> > >
> > >
> > >
> > > You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
> > >
> > >
> > >
> > > Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
> > >
> > >
> > >
> > > [ ] +1: Accept the proposal as an official SPIP
> > >
> > > [ ] +0
> > >
> > > [ ] -1: I don't think this is a good idea because ...
> > >
> > >
> > >
> > >
> > >
> > > Thanks!
> > >
> > > Tom Graves
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Tom Graves <tg...@yahoo.com.INVALID>.

 
Based on there is still discussion and Spark Summit is this week, I'm going to extend the vote til Friday the 26th.
Tom    On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans <re...@gmail.com> wrote:  
 
 Yes, it is technically possible for the layout to change.  No, it is not going to happen.  It is already baked into several different official libraries which are widely used, not just for holding and processing the data, but also for transfer of the data between the various implementations.  There would have to be a really serious reason to force an incompatible change at this point.  So in the worst case, we can version the layout and bake that into the API that exposes the internal layout of the data.  That way code that wants to program against a JAVA API can do so using the API that Spark provides, those who want to interface with something that expects the data in arrow format will already have to know what version of the format it was programmed against and in the worst case if the layout does change we can support the new layout if needed.
On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com> wrote:

The Arrow data format is not yet stable, meaning there are no guarantees on backwards/forwards compatibility. Once version 1.0 is released, it will have those guarantees but it's hard to say when that will be. The remaining work to get there can be seen at https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone. So yes, it is a risk that exposing Spark data as Arrow could cause an issue if handled by a different version that is not compatible. That being said, changes to format are not taken lightly and are backwards compatible when possible. I think it would be fair to mark the APIs exposing Arrow data as experimental for the time being, and clearly state the version that must be used to be compatible in the docs. Also, adding features like this and SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0 release. Adding the Arrow dev list to CC.
Bryan

On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com> wrote:

Okay, that makes sense, but is the Arrow data format stable? If not, we risk breakage when Arrow changes in the future and some libraries using this feature are begin to use the new Arrow code.

Matei

> On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
> 
> I want to be clear that this SPIP is not proposing exposing Arrow APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and because of the overlap between the two SPIPs I scaled this one back to concentrate just on the columnar processing aspects. Sorry for the confusion as I didn't update the JIRA description clearly enough when we adjusted it during the discussion on the JIRA.  As part of the columnar processing, we plan on providing arrow formatted data, but that will be exposed through a Spark owned API.
> 
> On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com> wrote:
> FYI, I’d also be concerned about exposing the Arrow API or format as a public API if it’s not yet stable. Is stabilization of the API and format coming soon on the roadmap there? Maybe someone can work with the Arrow community to make that happen.
> 
> We’ve been bitten lots of times by API changes forced by external libraries even when those were widely popular. For example, we used Guava’s Optional for a while, which changed at some point, and we also had issues with Protobuf and Scala itself (especially how Scala’s APIs appear in Java). API breakage might not be as serious in dynamic languages like Python, where you can often keep compatibility with old behaviors, but it really hurts in Java and Scala.
> 
> The problem is especially bad for us because of two aspects of how Spark is used:
> 
> 1) Spark is used for production data transformation jobs that people need to keep running for a long time. Nobody wants to make changes to a job that’s been working fine and computing something correctly for years just to get a bug fix from the latest Spark release or whatever. It’s much better if they can upgrade Spark without editing every job.
> 
> 2) Spark is often used as “glue” to combine data processing code in other libraries, and these might start to require different versions of our dependencies. For example, the Guava class exposed in Spark became a problem when third-party libraries started requiring a new version of Guava: those new libraries just couldn’t work with Spark. Protobuf was especially bad because some users wanted to read data stored as Protobufs (or in a format that uses Protobuf inside), so they needed a different version of the library in their main data processing code.
> 
> If there was some guarantee that this stuff would remain backward-compatible, we’d be in a much better stuff. It’s not that hard to keep a storage format backward-compatible: just document the format and extend it only in ways that don’t break the meaning of old data (for example, add new version numbers or field types that are read in a different way). It’s a bit harder for a Java API, but maybe Spark could just expose byte arrays directly and work on those if the API is not guaranteed to stay stable (that is, we’d still use our own classes to manipulate the data internally, and end users could use the Arrow library if they want it).
> 
> Matei
> 
> > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
> > 
> > I think you misunderstood the point of this SPIP. I responded to your comments in the SPIP JIRA.
> > 
> > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com> wrote:
> > I posted my comment in the JIRA. Main concerns here:
> > 
> > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 release someday.
> > 2. ML/DL systems that can benefits from columnar format are mostly in Python.
> > 3. Simple operations, though benefits vectorization, might not be worth the data exchange overhead.
> > 
> > So would an improved Pandas UDF API would be good enough? For example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > 
> > Sorry that I should join the discussion earlier! Hope it is not too late:)
> > 
> > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
> > +1 (non-binding) for better columnar data processing support.
> > 
> >  
> > 
> > From: Jules Damji <dm...@comcast.net> 
> > Sent: Friday, April 19, 2019 12:21 PM
> > To: Bryan Cutler <cu...@gmail.com>
> > Cc: Dev <de...@spark.apache.org>
> > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
> > 
> >  
> > 
> > + (non-binding)
> > 
> > Sent from my iPhone
> > 
> > Pardon the dumb thumb typos :)
> > 
> > 
> > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com> wrote:
> > 
> > +1 (non-binding)
> > 
> >  
> > 
> > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote:
> > 
> > +1 (non-binding).  Looking forward to seeing better support for processing columnar data.
> > 
> >  
> > 
> > Jason
> > 
> >  
> > 
> > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves <tg...@yahoo.com.invalid> wrote:
> > 
> > Hi everyone,
> > 
> >  
> > 
> > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended Columnar Processing Support.  The proposal is to extend the support to allow for more columnar processing.
> > 
> >  
> > 
> > You can find the full proposal in the jira at: https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS thread in the dev mailing list.
> > 
> >  
> > 
> > Please vote as early as you can, I will leave the vote open until next Monday (the 22nd), 2pm CST to give people plenty of time.
> > 
> >  
> > 
> > [ ] +1: Accept the proposal as an official SPIP
> > 
> > [ ] +0
> > 
> > [ ] -1: I don't think this is a good idea because ...
> > 
> >  
> > 
> >  
> > 
> > Thanks!
> > 
> > Tom Graves
> > 
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Tom Graves <tg...@yahoo.com.INVALID>.

 
Based on there is still discussion and Spark Summit is this week, I'm going to extend the vote til Friday the 26th.
Tom    On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans <re...@gmail.com> wrote:  
 
 Yes, it is technically possible for the layout to change.  No, it is not going to happen.  It is already baked into several different official libraries which are widely used, not just for holding and processing the data, but also for transfer of the data between the various implementations.  There would have to be a really serious reason to force an incompatible change at this point.  So in the worst case, we can version the layout and bake that into the API that exposes the internal layout of the data.  That way code that wants to program against a JAVA API can do so using the API that Spark provides, those who want to interface with something that expects the data in arrow format will already have to know what version of the format it was programmed against and in the worst case if the layout does change we can support the new layout if needed.
On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com> wrote:

The Arrow data format is not yet stable, meaning there are no guarantees on backwards/forwards compatibility. Once version 1.0 is released, it will have those guarantees but it's hard to say when that will be. The remaining work to get there can be seen at https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone. So yes, it is a risk that exposing Spark data as Arrow could cause an issue if handled by a different version that is not compatible. That being said, changes to format are not taken lightly and are backwards compatible when possible. I think it would be fair to mark the APIs exposing Arrow data as experimental for the time being, and clearly state the version that must be used to be compatible in the docs. Also, adding features like this and SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0 release. Adding the Arrow dev list to CC.
Bryan

On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com> wrote:

Okay, that makes sense, but is the Arrow data format stable? If not, we risk breakage when Arrow changes in the future and some libraries using this feature are begin to use the new Arrow code.

Matei

> On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
> 
> I want to be clear that this SPIP is not proposing exposing Arrow APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and because of the overlap between the two SPIPs I scaled this one back to concentrate just on the columnar processing aspects. Sorry for the confusion as I didn't update the JIRA description clearly enough when we adjusted it during the discussion on the JIRA.  As part of the columnar processing, we plan on providing arrow formatted data, but that will be exposed through a Spark owned API.
> 
> On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com> wrote:
> FYI, I’d also be concerned about exposing the Arrow API or format as a public API if it’s not yet stable. Is stabilization of the API and format coming soon on the roadmap there? Maybe someone can work with the Arrow community to make that happen.
> 
> We’ve been bitten lots of times by API changes forced by external libraries even when those were widely popular. For example, we used Guava’s Optional for a while, which changed at some point, and we also had issues with Protobuf and Scala itself (especially how Scala’s APIs appear in Java). API breakage might not be as serious in dynamic languages like Python, where you can often keep compatibility with old behaviors, but it really hurts in Java and Scala.
> 
> The problem is especially bad for us because of two aspects of how Spark is used:
> 
> 1) Spark is used for production data transformation jobs that people need to keep running for a long time. Nobody wants to make changes to a job that’s been working fine and computing something correctly for years just to get a bug fix from the latest Spark release or whatever. It’s much better if they can upgrade Spark without editing every job.
> 
> 2) Spark is often used as “glue” to combine data processing code in other libraries, and these might start to require different versions of our dependencies. For example, the Guava class exposed in Spark became a problem when third-party libraries started requiring a new version of Guava: those new libraries just couldn’t work with Spark. Protobuf was especially bad because some users wanted to read data stored as Protobufs (or in a format that uses Protobuf inside), so they needed a different version of the library in their main data processing code.
> 
> If there was some guarantee that this stuff would remain backward-compatible, we’d be in a much better stuff. It’s not that hard to keep a storage format backward-compatible: just document the format and extend it only in ways that don’t break the meaning of old data (for example, add new version numbers or field types that are read in a different way). It’s a bit harder for a Java API, but maybe Spark could just expose byte arrays directly and work on those if the API is not guaranteed to stay stable (that is, we’d still use our own classes to manipulate the data internally, and end users could use the Arrow library if they want it).
> 
> Matei
> 
> > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
> > 
> > I think you misunderstood the point of this SPIP. I responded to your comments in the SPIP JIRA.
> > 
> > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com> wrote:
> > I posted my comment in the JIRA. Main concerns here:
> > 
> > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 release someday.
> > 2. ML/DL systems that can benefits from columnar format are mostly in Python.
> > 3. Simple operations, though benefits vectorization, might not be worth the data exchange overhead.
> > 
> > So would an improved Pandas UDF API would be good enough? For example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > 
> > Sorry that I should join the discussion earlier! Hope it is not too late:)
> > 
> > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
> > +1 (non-binding) for better columnar data processing support.
> > 
> >  
> > 
> > From: Jules Damji <dm...@comcast.net> 
> > Sent: Friday, April 19, 2019 12:21 PM
> > To: Bryan Cutler <cu...@gmail.com>
> > Cc: Dev <de...@spark.apache.org>
> > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
> > 
> >  
> > 
> > + (non-binding)
> > 
> > Sent from my iPhone
> > 
> > Pardon the dumb thumb typos :)
> > 
> > 
> > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com> wrote:
> > 
> > +1 (non-binding)
> > 
> >  
> > 
> > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote:
> > 
> > +1 (non-binding).  Looking forward to seeing better support for processing columnar data.
> > 
> >  
> > 
> > Jason
> > 
> >  
> > 
> > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves <tg...@yahoo.com.invalid> wrote:
> > 
> > Hi everyone,
> > 
> >  
> > 
> > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended Columnar Processing Support.  The proposal is to extend the support to allow for more columnar processing.
> > 
> >  
> > 
> > You can find the full proposal in the jira at: https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS thread in the dev mailing list.
> > 
> >  
> > 
> > Please vote as early as you can, I will leave the vote open until next Monday (the 22nd), 2pm CST to give people plenty of time.
> > 
> >  
> > 
> > [ ] +1: Accept the proposal as an official SPIP
> > 
> > [ ] +0
> > 
> > [ ] -1: I don't think this is a good idea because ...
> > 
> >  
> > 
> >  
> > 
> > Thanks!
> > 
> > Tom Graves
> > 
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Bryan Cutler <cu...@gmail.com>.

I looked at the updated SPIP and I think the reduced scope sounds better.
From the Spark Summit, it seemed like there was a lot of interest in
columnar processing and this would be a good starting point to enable that.
It would be great to hear some other peoples input too.

Bryan

On Tue, Apr 30, 2019 at 7:21 AM Bobby Evans <bo...@apache.org> wrote:

> I wanted to give everyone a heads up that I have updated the SPIP at
> https://issues.apache.org/jira/browse/SPARK-27396 Please take a look and
> add any comments you might have to the JIRA.  I reduced the scope of the
> SPIP to just the non-controversial parts.  In the background, I will be
> trying to work with the Arrow community to get some form of guarantees
> about the stability of the standard.  That should hopefully unblock stable
> APIs so end users can write columnar UDFs in scala/java and ideally get
> efficient Arrow based batch data transfers to external tools as well.
>
> Thanks,
>
> Bobby
>
> On Tue, Apr 23, 2019 at 12:32 PM Matei Zaharia <ma...@gmail.com>
> wrote:
>
> > Just as a note here, if the goal is the format not change, why not make
> > that explicit in a versioning policy? You can always include a format
> > version number and say that future versions may increment the number, but
> > this specific version will always be readable in some specific way. You
> > could also put a timeline on how long old version numbers will be
> > recognized in the official libraries (e.g. 3 years).
> >
> > Matei
> >
> > > On Apr 22, 2019, at 6:36 AM, Bobby Evans <re...@gmail.com> wrote:
> > >
> > > Yes, it is technically possible for the layout to change.  No, it is
> not
> > going to happen.  It is already baked into several different official
> > libraries which are widely used, not just for holding and processing the
> > data, but also for transfer of the data between the various
> > implementations.  There would have to be a really serious reason to force
> > an incompatible change at this point.  So in the worst case, we can
> version
> > the layout and bake that into the API that exposes the internal layout of
> > the data.  That way code that wants to program against a JAVA API can do
> so
> > using the API that Spark provides, those who want to interface with
> > something that expects the data in arrow format will already have to know
> > what version of the format it was programmed against and in the worst
> case
> > if the layout does change we can support the new layout if needed.
> > >
> > > On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com>
> wrote:
> > > The Arrow data format is not yet stable, meaning there are no
> guarantees
> > on backwards/forwards compatibility. Once version 1.0 is released, it
> will
> > have those guarantees but it's hard to say when that will be. The
> remaining
> > work to get there can be seen at
> >
> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone
> .
> > So yes, it is a risk that exposing Spark data as Arrow could cause an
> issue
> > if handled by a different version that is not compatible. That being
> said,
> > changes to format are not taken lightly and are backwards compatible when
> > possible. I think it would be fair to mark the APIs exposing Arrow data
> as
> > experimental for the time being, and clearly state the version that must
> be
> > used to be compatible in the docs. Also, adding features like this and
> > SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
> > release. Adding the Arrow dev list to CC.
> > >
> > > Bryan
> > >
> > > On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <matei.zaharia@gmail.com
> >
> > wrote:
> > > Okay, that makes sense, but is the Arrow data format stable? If not, we
> > risk breakage when Arrow changes in the future and some libraries using
> > this feature are begin to use the new Arrow code.
> > >
> > > Matei
> > >
> > > > On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
> > > >
> > > > I want to be clear that this SPIP is not proposing exposing Arrow
> > APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
> > because of the overlap between the two SPIPs I scaled this one back to
> > concentrate just on the columnar processing aspects. Sorry for the
> > confusion as I didn't update the JIRA description clearly enough when we
> > adjusted it during the discussion on the JIRA.  As part of the columnar
> > processing, we plan on providing arrow formatted data, but that will be
> > exposed through a Spark owned API.
> > > >
> > > > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <
> matei.zaharia@gmail.com>
> > wrote:
> > > > FYI, I’d also be concerned about exposing the Arrow API or format as
> a
> > public API if it’s not yet stable. Is stabilization of the API and format
> > coming soon on the roadmap there? Maybe someone can work with the Arrow
> > community to make that happen.
> > > >
> > > > We’ve been bitten lots of times by API changes forced by external
> > libraries even when those were widely popular. For example, we used
> Guava’s
> > Optional for a while, which changed at some point, and we also had issues
> > with Protobuf and Scala itself (especially how Scala’s APIs appear in
> > Java). API breakage might not be as serious in dynamic languages like
> > Python, where you can often keep compatibility with old behaviors, but it
> > really hurts in Java and Scala.
> > > >
> > > > The problem is especially bad for us because of two aspects of how
> > Spark is used:
> > > >
> > > > 1) Spark is used for production data transformation jobs that people
> > need to keep running for a long time. Nobody wants to make changes to a
> job
> > that’s been working fine and computing something correctly for years just
> > to get a bug fix from the latest Spark release or whatever. It’s much
> > better if they can upgrade Spark without editing every job.
> > > >
> > > > 2) Spark is often used as “glue” to combine data processing code in
> > other libraries, and these might start to require different versions of
> our
> > dependencies. For example, the Guava class exposed in Spark became a
> > problem when third-party libraries started requiring a new version of
> > Guava: those new libraries just couldn’t work with Spark. Protobuf was
> > especially bad because some users wanted to read data stored as Protobufs
> > (or in a format that uses Protobuf inside), so they needed a different
> > version of the library in their main data processing code.
> > > >
> > > > If there was some guarantee that this stuff would remain
> > backward-compatible, we’d be in a much better stuff. It’s not that hard
> to
> > keep a storage format backward-compatible: just document the format and
> > extend it only in ways that don’t break the meaning of old data (for
> > example, add new version numbers or field types that are read in a
> > different way). It’s a bit harder for a Java API, but maybe Spark could
> > just expose byte arrays directly and work on those if the API is not
> > guaranteed to stay stable (that is, we’d still use our own classes to
> > manipulate the data internally, and end users could use the Arrow library
> > if they want it).
> > > >
> > > > Matei
> > > >
> > > > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com>
> wrote:
> > > > >
> > > > > I think you misunderstood the point of this SPIP. I responded to
> > your comments in the SPIP JIRA.
> > > > >
> > > > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com>
> > wrote:
> > > > > I posted my comment in the JIRA. Main concerns here:
> > > > >
> > > > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might
> > have 1.0 release someday.
> > > > > 2. ML/DL systems that can benefits from columnar format are mostly
> > in Python.
> > > > > 3. Simple operations, though benefits vectorization, might not be
> > worth the data exchange overhead.
> > > > >
> > > > > So would an improved Pandas UDF API would be good enough? For
> > example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > > > >
> > > > > Sorry that I should join the discussion earlier! Hope it is not too
> > late:)
> > > > >
> > > > > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
> > > > > +1 (non-binding) for better columnar data processing support.
> > > > >
> > > > >
> > > > >
> > > > > From: Jules Damji <dm...@comcast.net>
> > > > > Sent: Friday, April 19, 2019 12:21 PM
> > > > > To: Bryan Cutler <cu...@gmail.com>
> > > > > Cc: Dev <de...@spark.apache.org>
> > > > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> > Columnar Processing Support
> > > > >
> > > > >
> > > > >
> > > > > + (non-binding)
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > Pardon the dumb thumb typos :)
> > > > >
> > > > >
> > > > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com>
> > wrote:
> > > > >
> > > > > +1 (non-binding)
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org>
> > wrote:
> > > > >
> > > > > +1 (non-binding).  Looking forward to seeing better support for
> > processing columnar data.
> > > > >
> > > > >
> > > > >
> > > > > Jason
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
> > <tg...@yahoo.com.invalid> wrote:
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > >
> > > > >
> > > > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> > extended Columnar Processing Support.  The proposal is to extend the
> > support to allow for more columnar processing.
> > > > >
> > > > >
> > > > >
> > > > > You can find the full proposal in the jira at:
> > https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> > DISCUSS thread in the dev mailing list.
> > > > >
> > > > >
> > > > >
> > > > > Please vote as early as you can, I will leave the vote open until
> > next Monday (the 22nd), 2pm CST to give people plenty of time.
> > > > >
> > > > >
> > > > >
> > > > > [ ] +1: Accept the proposal as an official SPIP
> > > > >
> > > > > [ ] +0
> > > > >
> > > > > [ ] -1: I don't think this is a good idea because ...
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Tom Graves
> > > > >
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> > >
> >
> >
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Bryan Cutler <cu...@gmail.com>.

I looked at the updated SPIP and I think the reduced scope sounds better.
From the Spark Summit, it seemed like there was a lot of interest in
columnar processing and this would be a good starting point to enable that.
It would be great to hear some other peoples input too.

Bryan

On Tue, Apr 30, 2019 at 7:21 AM Bobby Evans <bo...@apache.org> wrote:

> I wanted to give everyone a heads up that I have updated the SPIP at
> https://issues.apache.org/jira/browse/SPARK-27396 Please take a look and
> add any comments you might have to the JIRA.  I reduced the scope of the
> SPIP to just the non-controversial parts.  In the background, I will be
> trying to work with the Arrow community to get some form of guarantees
> about the stability of the standard.  That should hopefully unblock stable
> APIs so end users can write columnar UDFs in scala/java and ideally get
> efficient Arrow based batch data transfers to external tools as well.
>
> Thanks,
>
> Bobby
>
> On Tue, Apr 23, 2019 at 12:32 PM Matei Zaharia <ma...@gmail.com>
> wrote:
>
> > Just as a note here, if the goal is the format not change, why not make
> > that explicit in a versioning policy? You can always include a format
> > version number and say that future versions may increment the number, but
> > this specific version will always be readable in some specific way. You
> > could also put a timeline on how long old version numbers will be
> > recognized in the official libraries (e.g. 3 years).
> >
> > Matei
> >
> > > On Apr 22, 2019, at 6:36 AM, Bobby Evans <re...@gmail.com> wrote:
> > >
> > > Yes, it is technically possible for the layout to change.  No, it is
> not
> > going to happen.  It is already baked into several different official
> > libraries which are widely used, not just for holding and processing the
> > data, but also for transfer of the data between the various
> > implementations.  There would have to be a really serious reason to force
> > an incompatible change at this point.  So in the worst case, we can
> version
> > the layout and bake that into the API that exposes the internal layout of
> > the data.  That way code that wants to program against a JAVA API can do
> so
> > using the API that Spark provides, those who want to interface with
> > something that expects the data in arrow format will already have to know
> > what version of the format it was programmed against and in the worst
> case
> > if the layout does change we can support the new layout if needed.
> > >
> > > On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com>
> wrote:
> > > The Arrow data format is not yet stable, meaning there are no
> guarantees
> > on backwards/forwards compatibility. Once version 1.0 is released, it
> will
> > have those guarantees but it's hard to say when that will be. The
> remaining
> > work to get there can be seen at
> >
> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone
> .
> > So yes, it is a risk that exposing Spark data as Arrow could cause an
> issue
> > if handled by a different version that is not compatible. That being
> said,
> > changes to format are not taken lightly and are backwards compatible when
> > possible. I think it would be fair to mark the APIs exposing Arrow data
> as
> > experimental for the time being, and clearly state the version that must
> be
> > used to be compatible in the docs. Also, adding features like this and
> > SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
> > release. Adding the Arrow dev list to CC.
> > >
> > > Bryan
> > >
> > > On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <matei.zaharia@gmail.com
> >
> > wrote:
> > > Okay, that makes sense, but is the Arrow data format stable? If not, we
> > risk breakage when Arrow changes in the future and some libraries using
> > this feature are begin to use the new Arrow code.
> > >
> > > Matei
> > >
> > > > On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
> > > >
> > > > I want to be clear that this SPIP is not proposing exposing Arrow
> > APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
> > because of the overlap between the two SPIPs I scaled this one back to
> > concentrate just on the columnar processing aspects. Sorry for the
> > confusion as I didn't update the JIRA description clearly enough when we
> > adjusted it during the discussion on the JIRA.  As part of the columnar
> > processing, we plan on providing arrow formatted data, but that will be
> > exposed through a Spark owned API.
> > > >
> > > > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <
> matei.zaharia@gmail.com>
> > wrote:
> > > > FYI, I’d also be concerned about exposing the Arrow API or format as
> a
> > public API if it’s not yet stable. Is stabilization of the API and format
> > coming soon on the roadmap there? Maybe someone can work with the Arrow
> > community to make that happen.
> > > >
> > > > We’ve been bitten lots of times by API changes forced by external
> > libraries even when those were widely popular. For example, we used
> Guava’s
> > Optional for a while, which changed at some point, and we also had issues
> > with Protobuf and Scala itself (especially how Scala’s APIs appear in
> > Java). API breakage might not be as serious in dynamic languages like
> > Python, where you can often keep compatibility with old behaviors, but it
> > really hurts in Java and Scala.
> > > >
> > > > The problem is especially bad for us because of two aspects of how
> > Spark is used:
> > > >
> > > > 1) Spark is used for production data transformation jobs that people
> > need to keep running for a long time. Nobody wants to make changes to a
> job
> > that’s been working fine and computing something correctly for years just
> > to get a bug fix from the latest Spark release or whatever. It’s much
> > better if they can upgrade Spark without editing every job.
> > > >
> > > > 2) Spark is often used as “glue” to combine data processing code in
> > other libraries, and these might start to require different versions of
> our
> > dependencies. For example, the Guava class exposed in Spark became a
> > problem when third-party libraries started requiring a new version of
> > Guava: those new libraries just couldn’t work with Spark. Protobuf was
> > especially bad because some users wanted to read data stored as Protobufs
> > (or in a format that uses Protobuf inside), so they needed a different
> > version of the library in their main data processing code.
> > > >
> > > > If there was some guarantee that this stuff would remain
> > backward-compatible, we’d be in a much better stuff. It’s not that hard
> to
> > keep a storage format backward-compatible: just document the format and
> > extend it only in ways that don’t break the meaning of old data (for
> > example, add new version numbers or field types that are read in a
> > different way). It’s a bit harder for a Java API, but maybe Spark could
> > just expose byte arrays directly and work on those if the API is not
> > guaranteed to stay stable (that is, we’d still use our own classes to
> > manipulate the data internally, and end users could use the Arrow library
> > if they want it).
> > > >
> > > > Matei
> > > >
> > > > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com>
> wrote:
> > > > >
> > > > > I think you misunderstood the point of this SPIP. I responded to
> > your comments in the SPIP JIRA.
> > > > >
> > > > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com>
> > wrote:
> > > > > I posted my comment in the JIRA. Main concerns here:
> > > > >
> > > > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might
> > have 1.0 release someday.
> > > > > 2. ML/DL systems that can benefits from columnar format are mostly
> > in Python.
> > > > > 3. Simple operations, though benefits vectorization, might not be
> > worth the data exchange overhead.
> > > > >
> > > > > So would an improved Pandas UDF API would be good enough? For
> > example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > > > >
> > > > > Sorry that I should join the discussion earlier! Hope it is not too
> > late:)
> > > > >
> > > > > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
> > > > > +1 (non-binding) for better columnar data processing support.
> > > > >
> > > > >
> > > > >
> > > > > From: Jules Damji <dm...@comcast.net>
> > > > > Sent: Friday, April 19, 2019 12:21 PM
> > > > > To: Bryan Cutler <cu...@gmail.com>
> > > > > Cc: Dev <de...@spark.apache.org>
> > > > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> > Columnar Processing Support
> > > > >
> > > > >
> > > > >
> > > > > + (non-binding)
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > Pardon the dumb thumb typos :)
> > > > >
> > > > >
> > > > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com>
> > wrote:
> > > > >
> > > > > +1 (non-binding)
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org>
> > wrote:
> > > > >
> > > > > +1 (non-binding).  Looking forward to seeing better support for
> > processing columnar data.
> > > > >
> > > > >
> > > > >
> > > > > Jason
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
> > <tg...@yahoo.com.invalid> wrote:
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > >
> > > > >
> > > > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> > extended Columnar Processing Support.  The proposal is to extend the
> > support to allow for more columnar processing.
> > > > >
> > > > >
> > > > >
> > > > > You can find the full proposal in the jira at:
> > https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> > DISCUSS thread in the dev mailing list.
> > > > >
> > > > >
> > > > >
> > > > > Please vote as early as you can, I will leave the vote open until
> > next Monday (the 22nd), 2pm CST to give people plenty of time.
> > > > >
> > > > >
> > > > >
> > > > > [ ] +1: Accept the proposal as an official SPIP
> > > > >
> > > > > [ ] +0
> > > > >
> > > > > [ ] -1: I don't think this is a good idea because ...
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Tom Graves
> > > > >
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> > >
> >
> >
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Bobby Evans <bo...@apache.org>.

I wanted to give everyone a heads up that I have updated the SPIP at
https://issues.apache.org/jira/browse/SPARK-27396 Please take a look and
add any comments you might have to the JIRA.  I reduced the scope of the
SPIP to just the non-controversial parts.  In the background, I will be
trying to work with the Arrow community to get some form of guarantees
about the stability of the standard.  That should hopefully unblock stable
APIs so end users can write columnar UDFs in scala/java and ideally get
efficient Arrow based batch data transfers to external tools as well.

Thanks,

Bobby

On Tue, Apr 23, 2019 at 12:32 PM Matei Zaharia <ma...@gmail.com>
wrote:

> Just as a note here, if the goal is the format not change, why not make
> that explicit in a versioning policy? You can always include a format
> version number and say that future versions may increment the number, but
> this specific version will always be readable in some specific way. You
> could also put a timeline on how long old version numbers will be
> recognized in the official libraries (e.g. 3 years).
>
> Matei
>
> > On Apr 22, 2019, at 6:36 AM, Bobby Evans <re...@gmail.com> wrote:
> >
> > Yes, it is technically possible for the layout to change.  No, it is not
> going to happen.  It is already baked into several different official
> libraries which are widely used, not just for holding and processing the
> data, but also for transfer of the data between the various
> implementations.  There would have to be a really serious reason to force
> an incompatible change at this point.  So in the worst case, we can version
> the layout and bake that into the API that exposes the internal layout of
> the data.  That way code that wants to program against a JAVA API can do so
> using the API that Spark provides, those who want to interface with
> something that expects the data in arrow format will already have to know
> what version of the format it was programmed against and in the worst case
> if the layout does change we can support the new layout if needed.
> >
> > On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com> wrote:
> > The Arrow data format is not yet stable, meaning there are no guarantees
> on backwards/forwards compatibility. Once version 1.0 is released, it will
> have those guarantees but it's hard to say when that will be. The remaining
> work to get there can be seen at
> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
> So yes, it is a risk that exposing Spark data as Arrow could cause an issue
> if handled by a different version that is not compatible. That being said,
> changes to format are not taken lightly and are backwards compatible when
> possible. I think it would be fair to mark the APIs exposing Arrow data as
> experimental for the time being, and clearly state the version that must be
> used to be compatible in the docs. Also, adding features like this and
> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
> release. Adding the Arrow dev list to CC.
> >
> > Bryan
> >
> > On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com>
> wrote:
> > Okay, that makes sense, but is the Arrow data format stable? If not, we
> risk breakage when Arrow changes in the future and some libraries using
> this feature are begin to use the new Arrow code.
> >
> > Matei
> >
> > > On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
> > >
> > > I want to be clear that this SPIP is not proposing exposing Arrow
> APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
> because of the overlap between the two SPIPs I scaled this one back to
> concentrate just on the columnar processing aspects. Sorry for the
> confusion as I didn't update the JIRA description clearly enough when we
> adjusted it during the discussion on the JIRA.  As part of the columnar
> processing, we plan on providing arrow formatted data, but that will be
> exposed through a Spark owned API.
> > >
> > > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com>
> wrote:
> > > FYI, I’d also be concerned about exposing the Arrow API or format as a
> public API if it’s not yet stable. Is stabilization of the API and format
> coming soon on the roadmap there? Maybe someone can work with the Arrow
> community to make that happen.
> > >
> > > We’ve been bitten lots of times by API changes forced by external
> libraries even when those were widely popular. For example, we used Guava’s
> Optional for a while, which changed at some point, and we also had issues
> with Protobuf and Scala itself (especially how Scala’s APIs appear in
> Java). API breakage might not be as serious in dynamic languages like
> Python, where you can often keep compatibility with old behaviors, but it
> really hurts in Java and Scala.
> > >
> > > The problem is especially bad for us because of two aspects of how
> Spark is used:
> > >
> > > 1) Spark is used for production data transformation jobs that people
> need to keep running for a long time. Nobody wants to make changes to a job
> that’s been working fine and computing something correctly for years just
> to get a bug fix from the latest Spark release or whatever. It’s much
> better if they can upgrade Spark without editing every job.
> > >
> > > 2) Spark is often used as “glue” to combine data processing code in
> other libraries, and these might start to require different versions of our
> dependencies. For example, the Guava class exposed in Spark became a
> problem when third-party libraries started requiring a new version of
> Guava: those new libraries just couldn’t work with Spark. Protobuf was
> especially bad because some users wanted to read data stored as Protobufs
> (or in a format that uses Protobuf inside), so they needed a different
> version of the library in their main data processing code.
> > >
> > > If there was some guarantee that this stuff would remain
> backward-compatible, we’d be in a much better stuff. It’s not that hard to
> keep a storage format backward-compatible: just document the format and
> extend it only in ways that don’t break the meaning of old data (for
> example, add new version numbers or field types that are read in a
> different way). It’s a bit harder for a Java API, but maybe Spark could
> just expose byte arrays directly and work on those if the API is not
> guaranteed to stay stable (that is, we’d still use our own classes to
> manipulate the data internally, and end users could use the Arrow library
> if they want it).
> > >
> > > Matei
> > >
> > > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
> > > >
> > > > I think you misunderstood the point of this SPIP. I responded to
> your comments in the SPIP JIRA.
> > > >
> > > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com>
> wrote:
> > > > I posted my comment in the JIRA. Main concerns here:
> > > >
> > > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might
> have 1.0 release someday.
> > > > 2. ML/DL systems that can benefits from columnar format are mostly
> in Python.
> > > > 3. Simple operations, though benefits vectorization, might not be
> worth the data exchange overhead.
> > > >
> > > > So would an improved Pandas UDF API would be good enough? For
> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > > >
> > > > Sorry that I should join the discussion earlier! Hope it is not too
> late:)
> > > >
> > > > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
> > > > +1 (non-binding) for better columnar data processing support.
> > > >
> > > >
> > > >
> > > > From: Jules Damji <dm...@comcast.net>
> > > > Sent: Friday, April 19, 2019 12:21 PM
> > > > To: Bryan Cutler <cu...@gmail.com>
> > > > Cc: Dev <de...@spark.apache.org>
> > > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
> > > >
> > > >
> > > >
> > > > + (non-binding)
> > > >
> > > > Sent from my iPhone
> > > >
> > > > Pardon the dumb thumb typos :)
> > > >
> > > >
> > > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com>
> wrote:
> > > >
> > > > +1 (non-binding)
> > > >
> > > >
> > > >
> > > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org>
> wrote:
> > > >
> > > > +1 (non-binding).  Looking forward to seeing better support for
> processing columnar data.
> > > >
> > > >
> > > >
> > > > Jason
> > > >
> > > >
> > > >
> > > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
> <tg...@yahoo.com.invalid> wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > >
> > > >
> > > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
> > > >
> > > >
> > > >
> > > > You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
> > > >
> > > >
> > > >
> > > > Please vote as early as you can, I will leave the vote open until
> next Monday (the 22nd), 2pm CST to give people plenty of time.
> > > >
> > > >
> > > >
> > > > [ ] +1: Accept the proposal as an official SPIP
> > > >
> > > > [ ] +0
> > > >
> > > > [ ] -1: I don't think this is a good idea because ...
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Thanks!
> > > >
> > > > Tom Graves
> > > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >
>
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Bobby Evans <bo...@apache.org>.

I wanted to give everyone a heads up that I have updated the SPIP at
https://issues.apache.org/jira/browse/SPARK-27396 Please take a look and
add any comments you might have to the JIRA.  I reduced the scope of the
SPIP to just the non-controversial parts.  In the background, I will be
trying to work with the Arrow community to get some form of guarantees
about the stability of the standard.  That should hopefully unblock stable
APIs so end users can write columnar UDFs in scala/java and ideally get
efficient Arrow based batch data transfers to external tools as well.

Thanks,

Bobby

On Tue, Apr 23, 2019 at 12:32 PM Matei Zaharia <ma...@gmail.com>
wrote:

> Just as a note here, if the goal is the format not change, why not make
> that explicit in a versioning policy? You can always include a format
> version number and say that future versions may increment the number, but
> this specific version will always be readable in some specific way. You
> could also put a timeline on how long old version numbers will be
> recognized in the official libraries (e.g. 3 years).
>
> Matei
>
> > On Apr 22, 2019, at 6:36 AM, Bobby Evans <re...@gmail.com> wrote:
> >
> > Yes, it is technically possible for the layout to change.  No, it is not
> going to happen.  It is already baked into several different official
> libraries which are widely used, not just for holding and processing the
> data, but also for transfer of the data between the various
> implementations.  There would have to be a really serious reason to force
> an incompatible change at this point.  So in the worst case, we can version
> the layout and bake that into the API that exposes the internal layout of
> the data.  That way code that wants to program against a JAVA API can do so
> using the API that Spark provides, those who want to interface with
> something that expects the data in arrow format will already have to know
> what version of the format it was programmed against and in the worst case
> if the layout does change we can support the new layout if needed.
> >
> > On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com> wrote:
> > The Arrow data format is not yet stable, meaning there are no guarantees
> on backwards/forwards compatibility. Once version 1.0 is released, it will
> have those guarantees but it's hard to say when that will be. The remaining
> work to get there can be seen at
> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
> So yes, it is a risk that exposing Spark data as Arrow could cause an issue
> if handled by a different version that is not compatible. That being said,
> changes to format are not taken lightly and are backwards compatible when
> possible. I think it would be fair to mark the APIs exposing Arrow data as
> experimental for the time being, and clearly state the version that must be
> used to be compatible in the docs. Also, adding features like this and
> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
> release. Adding the Arrow dev list to CC.
> >
> > Bryan
> >
> > On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com>
> wrote:
> > Okay, that makes sense, but is the Arrow data format stable? If not, we
> risk breakage when Arrow changes in the future and some libraries using
> this feature are begin to use the new Arrow code.
> >
> > Matei
> >
> > > On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
> > >
> > > I want to be clear that this SPIP is not proposing exposing Arrow
> APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
> because of the overlap between the two SPIPs I scaled this one back to
> concentrate just on the columnar processing aspects. Sorry for the
> confusion as I didn't update the JIRA description clearly enough when we
> adjusted it during the discussion on the JIRA.  As part of the columnar
> processing, we plan on providing arrow formatted data, but that will be
> exposed through a Spark owned API.
> > >
> > > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com>
> wrote:
> > > FYI, I’d also be concerned about exposing the Arrow API or format as a
> public API if it’s not yet stable. Is stabilization of the API and format
> coming soon on the roadmap there? Maybe someone can work with the Arrow
> community to make that happen.
> > >
> > > We’ve been bitten lots of times by API changes forced by external
> libraries even when those were widely popular. For example, we used Guava’s
> Optional for a while, which changed at some point, and we also had issues
> with Protobuf and Scala itself (especially how Scala’s APIs appear in
> Java). API breakage might not be as serious in dynamic languages like
> Python, where you can often keep compatibility with old behaviors, but it
> really hurts in Java and Scala.
> > >
> > > The problem is especially bad for us because of two aspects of how
> Spark is used:
> > >
> > > 1) Spark is used for production data transformation jobs that people
> need to keep running for a long time. Nobody wants to make changes to a job
> that’s been working fine and computing something correctly for years just
> to get a bug fix from the latest Spark release or whatever. It’s much
> better if they can upgrade Spark without editing every job.
> > >
> > > 2) Spark is often used as “glue” to combine data processing code in
> other libraries, and these might start to require different versions of our
> dependencies. For example, the Guava class exposed in Spark became a
> problem when third-party libraries started requiring a new version of
> Guava: those new libraries just couldn’t work with Spark. Protobuf was
> especially bad because some users wanted to read data stored as Protobufs
> (or in a format that uses Protobuf inside), so they needed a different
> version of the library in their main data processing code.
> > >
> > > If there was some guarantee that this stuff would remain
> backward-compatible, we’d be in a much better stuff. It’s not that hard to
> keep a storage format backward-compatible: just document the format and
> extend it only in ways that don’t break the meaning of old data (for
> example, add new version numbers or field types that are read in a
> different way). It’s a bit harder for a Java API, but maybe Spark could
> just expose byte arrays directly and work on those if the API is not
> guaranteed to stay stable (that is, we’d still use our own classes to
> manipulate the data internally, and end users could use the Arrow library
> if they want it).
> > >
> > > Matei
> > >
> > > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
> > > >
> > > > I think you misunderstood the point of this SPIP. I responded to
> your comments in the SPIP JIRA.
> > > >
> > > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com>
> wrote:
> > > > I posted my comment in the JIRA. Main concerns here:
> > > >
> > > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might
> have 1.0 release someday.
> > > > 2. ML/DL systems that can benefits from columnar format are mostly
> in Python.
> > > > 3. Simple operations, though benefits vectorization, might not be
> worth the data exchange overhead.
> > > >
> > > > So would an improved Pandas UDF API would be good enough? For
> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > > >
> > > > Sorry that I should join the discussion earlier! Hope it is not too
> late:)
> > > >
> > > > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
> > > > +1 (non-binding) for better columnar data processing support.
> > > >
> > > >
> > > >
> > > > From: Jules Damji <dm...@comcast.net>
> > > > Sent: Friday, April 19, 2019 12:21 PM
> > > > To: Bryan Cutler <cu...@gmail.com>
> > > > Cc: Dev <de...@spark.apache.org>
> > > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
> > > >
> > > >
> > > >
> > > > + (non-binding)
> > > >
> > > > Sent from my iPhone
> > > >
> > > > Pardon the dumb thumb typos :)
> > > >
> > > >
> > > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com>
> wrote:
> > > >
> > > > +1 (non-binding)
> > > >
> > > >
> > > >
> > > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org>
> wrote:
> > > >
> > > > +1 (non-binding).  Looking forward to seeing better support for
> processing columnar data.
> > > >
> > > >
> > > >
> > > > Jason
> > > >
> > > >
> > > >
> > > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
> <tg...@yahoo.com.invalid> wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > >
> > > >
> > > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
> > > >
> > > >
> > > >
> > > > You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
> > > >
> > > >
> > > >
> > > > Please vote as early as you can, I will leave the vote open until
> next Monday (the 22nd), 2pm CST to give people plenty of time.
> > > >
> > > >
> > > >
> > > > [ ] +1: Accept the proposal as an official SPIP
> > > >
> > > > [ ] +0
> > > >
> > > > [ ] -1: I don't think this is a good idea because ...
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Thanks!
> > > >
> > > > Tom Graves
> > > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >
>
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Matei Zaharia <ma...@gmail.com>.

Just as a note here, if the goal is the format not change, why not make that explicit in a versioning policy? You can always include a format version number and say that future versions may increment the number, but this specific version will always be readable in some specific way. You could also put a timeline on how long old version numbers will be recognized in the official libraries (e.g. 3 years).

Matei

> On Apr 22, 2019, at 6:36 AM, Bobby Evans <re...@gmail.com> wrote:
> 
> Yes, it is technically possible for the layout to change.  No, it is not going to happen.  It is already baked into several different official libraries which are widely used, not just for holding and processing the data, but also for transfer of the data between the various implementations.  There would have to be a really serious reason to force an incompatible change at this point.  So in the worst case, we can version the layout and bake that into the API that exposes the internal layout of the data.  That way code that wants to program against a JAVA API can do so using the API that Spark provides, those who want to interface with something that expects the data in arrow format will already have to know what version of the format it was programmed against and in the worst case if the layout does change we can support the new layout if needed.
> 
> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com> wrote:
> The Arrow data format is not yet stable, meaning there are no guarantees on backwards/forwards compatibility. Once version 1.0 is released, it will have those guarantees but it's hard to say when that will be. The remaining work to get there can be seen at https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone. So yes, it is a risk that exposing Spark data as Arrow could cause an issue if handled by a different version that is not compatible. That being said, changes to format are not taken lightly and are backwards compatible when possible. I think it would be fair to mark the APIs exposing Arrow data as experimental for the time being, and clearly state the version that must be used to be compatible in the docs. Also, adding features like this and SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0 release. Adding the Arrow dev list to CC.
> 
> Bryan
> 
> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com> wrote:
> Okay, that makes sense, but is the Arrow data format stable? If not, we risk breakage when Arrow changes in the future and some libraries using this feature are begin to use the new Arrow code.
> 
> Matei
> 
> > On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
> > 
> > I want to be clear that this SPIP is not proposing exposing Arrow APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and because of the overlap between the two SPIPs I scaled this one back to concentrate just on the columnar processing aspects. Sorry for the confusion as I didn't update the JIRA description clearly enough when we adjusted it during the discussion on the JIRA.  As part of the columnar processing, we plan on providing arrow formatted data, but that will be exposed through a Spark owned API.
> > 
> > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com> wrote:
> > FYI, I’d also be concerned about exposing the Arrow API or format as a public API if it’s not yet stable. Is stabilization of the API and format coming soon on the roadmap there? Maybe someone can work with the Arrow community to make that happen.
> > 
> > We’ve been bitten lots of times by API changes forced by external libraries even when those were widely popular. For example, we used Guava’s Optional for a while, which changed at some point, and we also had issues with Protobuf and Scala itself (especially how Scala’s APIs appear in Java). API breakage might not be as serious in dynamic languages like Python, where you can often keep compatibility with old behaviors, but it really hurts in Java and Scala.
> > 
> > The problem is especially bad for us because of two aspects of how Spark is used:
> > 
> > 1) Spark is used for production data transformation jobs that people need to keep running for a long time. Nobody wants to make changes to a job that’s been working fine and computing something correctly for years just to get a bug fix from the latest Spark release or whatever. It’s much better if they can upgrade Spark without editing every job.
> > 
> > 2) Spark is often used as “glue” to combine data processing code in other libraries, and these might start to require different versions of our dependencies. For example, the Guava class exposed in Spark became a problem when third-party libraries started requiring a new version of Guava: those new libraries just couldn’t work with Spark. Protobuf was especially bad because some users wanted to read data stored as Protobufs (or in a format that uses Protobuf inside), so they needed a different version of the library in their main data processing code.
> > 
> > If there was some guarantee that this stuff would remain backward-compatible, we’d be in a much better stuff. It’s not that hard to keep a storage format backward-compatible: just document the format and extend it only in ways that don’t break the meaning of old data (for example, add new version numbers or field types that are read in a different way). It’s a bit harder for a Java API, but maybe Spark could just expose byte arrays directly and work on those if the API is not guaranteed to stay stable (that is, we’d still use our own classes to manipulate the data internally, and end users could use the Arrow library if they want it).
> > 
> > Matei
> > 
> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
> > > 
> > > I think you misunderstood the point of this SPIP. I responded to your comments in the SPIP JIRA.
> > > 
> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com> wrote:
> > > I posted my comment in the JIRA. Main concerns here:
> > > 
> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 release someday.
> > > 2. ML/DL systems that can benefits from columnar format are mostly in Python.
> > > 3. Simple operations, though benefits vectorization, might not be worth the data exchange overhead.
> > > 
> > > So would an improved Pandas UDF API would be good enough? For example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > > 
> > > Sorry that I should join the discussion earlier! Hope it is not too late:)
> > > 
> > > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
> > > +1 (non-binding) for better columnar data processing support.
> > > 
> > >  
> > > 
> > > From: Jules Damji <dm...@comcast.net> 
> > > Sent: Friday, April 19, 2019 12:21 PM
> > > To: Bryan Cutler <cu...@gmail.com>
> > > Cc: Dev <de...@spark.apache.org>
> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
> > > 
> > >  
> > > 
> > > + (non-binding)
> > > 
> > > Sent from my iPhone
> > > 
> > > Pardon the dumb thumb typos :)
> > > 
> > > 
> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com> wrote:
> > > 
> > > +1 (non-binding)
> > > 
> > >  
> > > 
> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote:
> > > 
> > > +1 (non-binding).  Looking forward to seeing better support for processing columnar data.
> > > 
> > >  
> > > 
> > > Jason
> > > 
> > >  
> > > 
> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves <tg...@yahoo.com.invalid> wrote:
> > > 
> > > Hi everyone,
> > > 
> > >  
> > > 
> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended Columnar Processing Support.  The proposal is to extend the support to allow for more columnar processing.
> > > 
> > >  
> > > 
> > > You can find the full proposal in the jira at: https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS thread in the dev mailing list.
> > > 
> > >  
> > > 
> > > Please vote as early as you can, I will leave the vote open until next Monday (the 22nd), 2pm CST to give people plenty of time.
> > > 
> > >  
> > > 
> > > [ ] +1: Accept the proposal as an official SPIP
> > > 
> > > [ ] +0
> > > 
> > > [ ] -1: I don't think this is a good idea because ...
> > > 
> > >  
> > > 
> > >  
> > > 
> > > Thanks!
> > > 
> > > Tom Graves
> > > 
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Matei Zaharia <ma...@gmail.com>.

Just as a note here, if the goal is the format not change, why not make that explicit in a versioning policy? You can always include a format version number and say that future versions may increment the number, but this specific version will always be readable in some specific way. You could also put a timeline on how long old version numbers will be recognized in the official libraries (e.g. 3 years).

Matei

> On Apr 22, 2019, at 6:36 AM, Bobby Evans <re...@gmail.com> wrote:
> 
> Yes, it is technically possible for the layout to change.  No, it is not going to happen.  It is already baked into several different official libraries which are widely used, not just for holding and processing the data, but also for transfer of the data between the various implementations.  There would have to be a really serious reason to force an incompatible change at this point.  So in the worst case, we can version the layout and bake that into the API that exposes the internal layout of the data.  That way code that wants to program against a JAVA API can do so using the API that Spark provides, those who want to interface with something that expects the data in arrow format will already have to know what version of the format it was programmed against and in the worst case if the layout does change we can support the new layout if needed.
> 
> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com> wrote:
> The Arrow data format is not yet stable, meaning there are no guarantees on backwards/forwards compatibility. Once version 1.0 is released, it will have those guarantees but it's hard to say when that will be. The remaining work to get there can be seen at https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone. So yes, it is a risk that exposing Spark data as Arrow could cause an issue if handled by a different version that is not compatible. That being said, changes to format are not taken lightly and are backwards compatible when possible. I think it would be fair to mark the APIs exposing Arrow data as experimental for the time being, and clearly state the version that must be used to be compatible in the docs. Also, adding features like this and SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0 release. Adding the Arrow dev list to CC.
> 
> Bryan
> 
> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com> wrote:
> Okay, that makes sense, but is the Arrow data format stable? If not, we risk breakage when Arrow changes in the future and some libraries using this feature are begin to use the new Arrow code.
> 
> Matei
> 
> > On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
> > 
> > I want to be clear that this SPIP is not proposing exposing Arrow APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and because of the overlap between the two SPIPs I scaled this one back to concentrate just on the columnar processing aspects. Sorry for the confusion as I didn't update the JIRA description clearly enough when we adjusted it during the discussion on the JIRA.  As part of the columnar processing, we plan on providing arrow formatted data, but that will be exposed through a Spark owned API.
> > 
> > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com> wrote:
> > FYI, I’d also be concerned about exposing the Arrow API or format as a public API if it’s not yet stable. Is stabilization of the API and format coming soon on the roadmap there? Maybe someone can work with the Arrow community to make that happen.
> > 
> > We’ve been bitten lots of times by API changes forced by external libraries even when those were widely popular. For example, we used Guava’s Optional for a while, which changed at some point, and we also had issues with Protobuf and Scala itself (especially how Scala’s APIs appear in Java). API breakage might not be as serious in dynamic languages like Python, where you can often keep compatibility with old behaviors, but it really hurts in Java and Scala.
> > 
> > The problem is especially bad for us because of two aspects of how Spark is used:
> > 
> > 1) Spark is used for production data transformation jobs that people need to keep running for a long time. Nobody wants to make changes to a job that’s been working fine and computing something correctly for years just to get a bug fix from the latest Spark release or whatever. It’s much better if they can upgrade Spark without editing every job.
> > 
> > 2) Spark is often used as “glue” to combine data processing code in other libraries, and these might start to require different versions of our dependencies. For example, the Guava class exposed in Spark became a problem when third-party libraries started requiring a new version of Guava: those new libraries just couldn’t work with Spark. Protobuf was especially bad because some users wanted to read data stored as Protobufs (or in a format that uses Protobuf inside), so they needed a different version of the library in their main data processing code.
> > 
> > If there was some guarantee that this stuff would remain backward-compatible, we’d be in a much better stuff. It’s not that hard to keep a storage format backward-compatible: just document the format and extend it only in ways that don’t break the meaning of old data (for example, add new version numbers or field types that are read in a different way). It’s a bit harder for a Java API, but maybe Spark could just expose byte arrays directly and work on those if the API is not guaranteed to stay stable (that is, we’d still use our own classes to manipulate the data internally, and end users could use the Arrow library if they want it).
> > 
> > Matei
> > 
> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
> > > 
> > > I think you misunderstood the point of this SPIP. I responded to your comments in the SPIP JIRA.
> > > 
> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com> wrote:
> > > I posted my comment in the JIRA. Main concerns here:
> > > 
> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 release someday.
> > > 2. ML/DL systems that can benefits from columnar format are mostly in Python.
> > > 3. Simple operations, though benefits vectorization, might not be worth the data exchange overhead.
> > > 
> > > So would an improved Pandas UDF API would be good enough? For example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > > 
> > > Sorry that I should join the discussion earlier! Hope it is not too late:)
> > > 
> > > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
> > > +1 (non-binding) for better columnar data processing support.
> > > 
> > >  
> > > 
> > > From: Jules Damji <dm...@comcast.net> 
> > > Sent: Friday, April 19, 2019 12:21 PM
> > > To: Bryan Cutler <cu...@gmail.com>
> > > Cc: Dev <de...@spark.apache.org>
> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
> > > 
> > >  
> > > 
> > > + (non-binding)
> > > 
> > > Sent from my iPhone
> > > 
> > > Pardon the dumb thumb typos :)
> > > 
> > > 
> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com> wrote:
> > > 
> > > +1 (non-binding)
> > > 
> > >  
> > > 
> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote:
> > > 
> > > +1 (non-binding).  Looking forward to seeing better support for processing columnar data.
> > > 
> > >  
> > > 
> > > Jason
> > > 
> > >  
> > > 
> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves <tg...@yahoo.com.invalid> wrote:
> > > 
> > > Hi everyone,
> > > 
> > >  
> > > 
> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended Columnar Processing Support.  The proposal is to extend the support to allow for more columnar processing.
> > > 
> > >  
> > > 
> > > You can find the full proposal in the jira at: https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS thread in the dev mailing list.
> > > 
> > >  
> > > 
> > > Please vote as early as you can, I will leave the vote open until next Monday (the 22nd), 2pm CST to give people plenty of time.
> > > 
> > >  
> > > 
> > > [ ] +1: Accept the proposal as an official SPIP
> > > 
> > > [ ] +0
> > > 
> > > [ ] -1: I don't think this is a good idea because ...
> > > 
> > >  
> > > 
> > >  
> > > 
> > > Thanks!
> > > 
> > > Tom Graves
> > > 
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Bobby Evans <re...@gmail.com>.

Yes, it is technically possible for the layout to change.  No, it is not
going to happen.  It is already baked into several different official
libraries which are widely used, not just for holding and processing the
data, but also for transfer of the data between the various
implementations.  There would have to be a really serious reason to force
an incompatible change at this point.  So in the worst case, we can version
the layout and bake that into the API that exposes the internal layout of
the data.  That way code that wants to program against a JAVA API can do so
using the API that Spark provides, those who want to interface with
something that expects the data in arrow format will already have to know
what version of the format it was programmed against and in the worst case
if the layout does change we can support the new layout if needed.

On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com> wrote:

> The Arrow data format is not yet stable, meaning there are no guarantees
> on backwards/forwards compatibility. Once version 1.0 is released, it will
> have those guarantees but it's hard to say when that will be. The remaining
> work to get there can be seen at
> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
> So yes, it is a risk that exposing Spark data as Arrow could cause an issue
> if handled by a different version that is not compatible. That being said,
> changes to format are not taken lightly and are backwards compatible when
> possible. I think it would be fair to mark the APIs exposing Arrow data as
> experimental for the time being, and clearly state the version that must be
> used to be compatible in the docs. Also, adding features like this and
> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
> release. Adding the Arrow dev list to CC.
>
> Bryan
>
> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> Okay, that makes sense, but is the Arrow data format stable? If not, we
>> risk breakage when Arrow changes in the future and some libraries using
>> this feature are begin to use the new Arrow code.
>>
>> Matei
>>
>> > On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
>> >
>> > I want to be clear that this SPIP is not proposing exposing Arrow
>> APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
>> because of the overlap between the two SPIPs I scaled this one back to
>> concentrate just on the columnar processing aspects. Sorry for the
>> confusion as I didn't update the JIRA description clearly enough when we
>> adjusted it during the discussion on the JIRA.  As part of the columnar
>> processing, we plan on providing arrow formatted data, but that will be
>> exposed through a Spark owned API.
>> >
>> > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com>
>> wrote:
>> > FYI, I’d also be concerned about exposing the Arrow API or format as a
>> public API if it’s not yet stable. Is stabilization of the API and format
>> coming soon on the roadmap there? Maybe someone can work with the Arrow
>> community to make that happen.
>> >
>> > We’ve been bitten lots of times by API changes forced by external
>> libraries even when those were widely popular. For example, we used Guava’s
>> Optional for a while, which changed at some point, and we also had issues
>> with Protobuf and Scala itself (especially how Scala’s APIs appear in
>> Java). API breakage might not be as serious in dynamic languages like
>> Python, where you can often keep compatibility with old behaviors, but it
>> really hurts in Java and Scala.
>> >
>> > The problem is especially bad for us because of two aspects of how
>> Spark is used:
>> >
>> > 1) Spark is used for production data transformation jobs that people
>> need to keep running for a long time. Nobody wants to make changes to a job
>> that’s been working fine and computing something correctly for years just
>> to get a bug fix from the latest Spark release or whatever. It’s much
>> better if they can upgrade Spark without editing every job.
>> >
>> > 2) Spark is often used as “glue” to combine data processing code in
>> other libraries, and these might start to require different versions of our
>> dependencies. For example, the Guava class exposed in Spark became a
>> problem when third-party libraries started requiring a new version of
>> Guava: those new libraries just couldn’t work with Spark. Protobuf was
>> especially bad because some users wanted to read data stored as Protobufs
>> (or in a format that uses Protobuf inside), so they needed a different
>> version of the library in their main data processing code.
>> >
>> > If there was some guarantee that this stuff would remain
>> backward-compatible, we’d be in a much better stuff. It’s not that hard to
>> keep a storage format backward-compatible: just document the format and
>> extend it only in ways that don’t break the meaning of old data (for
>> example, add new version numbers or field types that are read in a
>> different way). It’s a bit harder for a Java API, but maybe Spark could
>> just expose byte arrays directly and work on those if the API is not
>> guaranteed to stay stable (that is, we’d still use our own classes to
>> manipulate the data internally, and end users could use the Arrow library
>> if they want it).
>> >
>> > Matei
>> >
>> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
>> > >
>> > > I think you misunderstood the point of this SPIP. I responded to your
>> comments in the SPIP JIRA.
>> > >
>> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com>
>> wrote:
>> > > I posted my comment in the JIRA. Main concerns here:
>> > >
>> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>> 1.0 release someday.
>> > > 2. ML/DL systems that can benefits from columnar format are mostly in
>> Python.
>> > > 3. Simple operations, though benefits vectorization, might not be
>> worth the data exchange overhead.
>> > >
>> > > So would an improved Pandas UDF API would be good enough? For
>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>> > >
>> > > Sorry that I should join the discussion earlier! Hope it is not too
>> late:)
>> > >
>> > > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
>> > > +1 (non-binding) for better columnar data processing support.
>> > >
>> > >
>> > >
>> > > From: Jules Damji <dm...@comcast.net>
>> > > Sent: Friday, April 19, 2019 12:21 PM
>> > > To: Bryan Cutler <cu...@gmail.com>
>> > > Cc: Dev <de...@spark.apache.org>
>> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>> Columnar Processing Support
>> > >
>> > >
>> > >
>> > > + (non-binding)
>> > >
>> > > Sent from my iPhone
>> > >
>> > > Pardon the dumb thumb typos :)
>> > >
>> > >
>> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com> wrote:
>> > >
>> > > +1 (non-binding)
>> > >
>> > >
>> > >
>> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote:
>> > >
>> > > +1 (non-binding).  Looking forward to seeing better support for
>> processing columnar data.
>> > >
>> > >
>> > >
>> > > Jason
>> > >
>> > >
>> > >
>> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>> <tg...@yahoo.com.invalid> wrote:
>> > >
>> > > Hi everyone,
>> > >
>> > >
>> > >
>> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>> extended Columnar Processing Support.  The proposal is to extend the
>> support to allow for more columnar processing.
>> > >
>> > >
>> > >
>> > > You can find the full proposal in the jira at:
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>> > >
>> > >
>> > >
>> > > Please vote as early as you can, I will leave the vote open until
>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>> > >
>> > >
>> > >
>> > > [ ] +1: Accept the proposal as an official SPIP
>> > >
>> > > [ ] +0
>> > >
>> > > [ ] -1: I don't think this is a good idea because ...
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > Thanks!
>> > >
>> > > Tom Graves
>> > >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Posted by Bobby Evans <re...@gmail.com>.

Yes, it is technically possible for the layout to change.  No, it is not
going to happen.  It is already baked into several different official
libraries which are widely used, not just for holding and processing the
data, but also for transfer of the data between the various
implementations.  There would have to be a really serious reason to force
an incompatible change at this point.  So in the worst case, we can version
the layout and bake that into the API that exposes the internal layout of
the data.  That way code that wants to program against a JAVA API can do so
using the API that Spark provides, those who want to interface with
something that expects the data in arrow format will already have to know
what version of the format it was programmed against and in the worst case
if the layout does change we can support the new layout if needed.

On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cu...@gmail.com> wrote:

> The Arrow data format is not yet stable, meaning there are no guarantees
> on backwards/forwards compatibility. Once version 1.0 is released, it will
> have those guarantees but it's hard to say when that will be. The remaining
> work to get there can be seen at
> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
> So yes, it is a risk that exposing Spark data as Arrow could cause an issue
> if handled by a different version that is not compatible. That being said,
> changes to format are not taken lightly and are backwards compatible when
> possible. I think it would be fair to mark the APIs exposing Arrow data as
> experimental for the time being, and clearly state the version that must be
> used to be compatible in the docs. Also, adding features like this and
> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
> release. Adding the Arrow dev list to CC.
>
> Bryan
>
> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> Okay, that makes sense, but is the Arrow data format stable? If not, we
>> risk breakage when Arrow changes in the future and some libraries using
>> this feature are begin to use the new Arrow code.
>>
>> Matei
>>
>> > On Apr 20, 2019, at 1:39 PM, Bobby Evans <re...@gmail.com> wrote:
>> >
>> > I want to be clear that this SPIP is not proposing exposing Arrow
>> APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
>> because of the overlap between the two SPIPs I scaled this one back to
>> concentrate just on the columnar processing aspects. Sorry for the
>> confusion as I didn't update the JIRA description clearly enough when we
>> adjusted it during the discussion on the JIRA.  As part of the columnar
>> processing, we plan on providing arrow formatted data, but that will be
>> exposed through a Spark owned API.
>> >
>> > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <ma...@gmail.com>
>> wrote:
>> > FYI, I’d also be concerned about exposing the Arrow API or format as a
>> public API if it’s not yet stable. Is stabilization of the API and format
>> coming soon on the roadmap there? Maybe someone can work with the Arrow
>> community to make that happen.
>> >
>> > We’ve been bitten lots of times by API changes forced by external
>> libraries even when those were widely popular. For example, we used Guava’s
>> Optional for a while, which changed at some point, and we also had issues
>> with Protobuf and Scala itself (especially how Scala’s APIs appear in
>> Java). API breakage might not be as serious in dynamic languages like
>> Python, where you can often keep compatibility with old behaviors, but it
>> really hurts in Java and Scala.
>> >
>> > The problem is especially bad for us because of two aspects of how
>> Spark is used:
>> >
>> > 1) Spark is used for production data transformation jobs that people
>> need to keep running for a long time. Nobody wants to make changes to a job
>> that’s been working fine and computing something correctly for years just
>> to get a bug fix from the latest Spark release or whatever. It’s much
>> better if they can upgrade Spark without editing every job.
>> >
>> > 2) Spark is often used as “glue” to combine data processing code in
>> other libraries, and these might start to require different versions of our
>> dependencies. For example, the Guava class exposed in Spark became a
>> problem when third-party libraries started requiring a new version of
>> Guava: those new libraries just couldn’t work with Spark. Protobuf was
>> especially bad because some users wanted to read data stored as Protobufs
>> (or in a format that uses Protobuf inside), so they needed a different
>> version of the library in their main data processing code.
>> >
>> > If there was some guarantee that this stuff would remain
>> backward-compatible, we’d be in a much better stuff. It’s not that hard to
>> keep a storage format backward-compatible: just document the format and
>> extend it only in ways that don’t break the meaning of old data (for
>> example, add new version numbers or field types that are read in a
>> different way). It’s a bit harder for a Java API, but maybe Spark could
>> just expose byte arrays directly and work on those if the API is not
>> guaranteed to stay stable (that is, we’d still use our own classes to
>> manipulate the data internally, and end users could use the Arrow library
>> if they want it).
>> >
>> > Matei
>> >
>> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <re...@gmail.com> wrote:
>> > >
>> > > I think you misunderstood the point of this SPIP. I responded to your
>> comments in the SPIP JIRA.
>> > >
>> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <me...@gmail.com>
>> wrote:
>> > > I posted my comment in the JIRA. Main concerns here:
>> > >
>> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>> 1.0 release someday.
>> > > 2. ML/DL systems that can benefits from columnar format are mostly in
>> Python.
>> > > 3. Simple operations, though benefits vectorization, might not be
>> worth the data exchange overhead.
>> > >
>> > > So would an improved Pandas UDF API would be good enough? For
>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>> > >
>> > > Sorry that I should join the discussion earlier! Hope it is not too
>> late:)
>> > >
>> > > On Fri, Apr 19, 2019 at 1:20 PM <tc...@gmail.com> wrote:
>> > > +1 (non-binding) for better columnar data processing support.
>> > >
>> > >
>> > >
>> > > From: Jules Damji <dm...@comcast.net>
>> > > Sent: Friday, April 19, 2019 12:21 PM
>> > > To: Bryan Cutler <cu...@gmail.com>
>> > > Cc: Dev <de...@spark.apache.org>
>> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>> Columnar Processing Support
>> > >
>> > >
>> > >
>> > > + (non-binding)
>> > >
>> > > Sent from my iPhone
>> > >
>> > > Pardon the dumb thumb typos :)
>> > >
>> > >
>> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cu...@gmail.com> wrote:
>> > >
>> > > +1 (non-binding)
>> > >
>> > >
>> > >
>> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote:
>> > >
>> > > +1 (non-binding).  Looking forward to seeing better support for
>> processing columnar data.
>> > >
>> > >
>> > >
>> > > Jason
>> > >
>> > >
>> > >
>> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>> <tg...@yahoo.com.invalid> wrote:
>> > >
>> > > Hi everyone,
>> > >
>> > >
>> > >
>> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>> extended Columnar Processing Support.  The proposal is to extend the
>> support to allow for more columnar processing.
>> > >
>> > >
>> > >
>> > > You can find the full proposal in the jira at:
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>> > >
>> > >
>> > >
>> > > Please vote as early as you can, I will leave the vote open until
>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>> > >
>> > >
>> > >
>> > > [ ] +1: Accept the proposal as an official SPIP
>> > >
>> > > [ ] +0
>> > >
>> > > [ ] -1: I don't think this is a good idea because ...
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > Thanks!
>> > >
>> > > Tom Graves
>> > >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>