You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@bigtop.apache.org by "MrAsanjar ." <af...@gmail.com> on 2016/04/13 19:41:19 UTC

Adding Apache Arrow to Apache Bigtop project

Hi all,
My name is Amir Sanjar from OpenPOWER foundation. My team and I have been
working on porting and optimizing Apache Arrow for Power8. We believe
Apache Arrow would be a good addition to Apache bigtop. Any thoughts on
that? We will do the required technical work (for both x86 and Power)  but
could appreciate having a mentor.
I also have created https://issues.apache.org/jira/browse/BIGTOP-2389

Re: Adding Apache Arrow to Apache Bigtop project

Posted by "MrAsanjar ." <af...@gmail.com>.

Valid points Cos.

On Wed, Apr 13, 2016 at 1:18 PM, Konstantin Boudnik <co...@apache.org> wrote:

> Just to re-iterate what I've said in the JIRA
>
> Arrow yet to have a single release. And even when it comes out we don't
> know
> how useful it will be; or would it deliver to its already hyped up
> expectations. Perhaps we should wait a little bit and not pile up every
> shiny
> pebble into the stack?
>
> The point of being cautious: once you add something to the stack - you'll
> have
> to support it for life. It isn't even about initial effort, but how
> laborious
> the ongoing support will be. Unless we are sure it is a good value-add,
> let's
> not make the commitment.
>
> Cos
>
> On Wed, Apr 13, 2016 at 12:41PM, MrAsanjar . wrote:
> > Hi all,
> > My name is Amir Sanjar from OpenPOWER foundation. My team and I have been
> > working on porting and optimizing Apache Arrow for Power8. We believe
> > Apache Arrow would be a good addition to Apache bigtop. Any thoughts on
> > that? We will do the required technical work (for both x86 and Power)
> but
> > could appreciate having a mentor.
> > I also have created https://issues.apache.org/jira/browse/BIGTOP-2389
>

Re: Adding Apache Arrow to Apache Bigtop project

Posted by Andrew Musselman <an...@gmail.com>.

One thing we've done at Mahout for managing tentative or user-supported
work in our "examples" section is to just have people contribute a script
that will go to a user-maintained repo for code and installation
instructions.

Could be a way to manage "incubating" new Bigtop components and turn the
integration work outwards to projects who want to be included.

On Sat, Apr 16, 2016 at 1:06 PM, Konstantin Boudnik <co...@apache.org> wrote:

> On Sat, Apr 16, 2016 at 12:10PM, Andrew Purtell wrote:
> > We have integrated other projects into the Bigtop stack while they were
> > still incubating, and even non Apache projects (like Hue), and Arrow
> isn't
> > incubating, it went direct to TLP. I think we are only waiting on them to
> > release an artifact before they'd be a good contender for inclusion.
>
> That's pretty much what I said. (non)incubation isn't a measure of
> technical
> success/stability. But releases provide more than that - in particular a
> certain level of IP clearance.
>
> > But what does integration mean? I think it can mean two things: 1)
> > installalble packages; 2) configuration and deployment support
> >
> > > [Cos] Perhaps we should wait a little bit and not pile up every shiny
> pebble
> > into the stack?
> >
> > This is a valid point, yet we have had justification of getting other
> stuff
> > in without wide uptake (I can give examples if I must) by considering
> > ourselves the "Debian of Big Data". Why not take that maximalist view? If
> > some basic preconditions are met, the answer should be straightforward.
> We
> > can discuss what would be a good set of preconditions.
>
> Once again, it isn't even wide uptake, but about having _at least_ one
> official release.
>
> > For packages, I tend to like:
> >
> > 1. The project is shipping working code
> > 2. The project produces at least one executable system artifact.
>
> I want to comment here on the point, but perhaps my conclusion will be
> essentially the same as your own deliberations below. "executable" isn't a
> requirement in my view, because certain components/layers might simply
> produce
> pluggable functionality to the existing applications (Tez is a good
> example).
>
> Completely agree with the rest on this list.
>
> Cos
>
> > 3. We have an active and interested contributor ready to make the
> necessary
> > patches for inclusion
> > 4. It's not a one-off, it has hopefully multiple integration points with
> > the rest of our stack
> >
> > It's no problem to be liberal in what we accept so long as if some
> > component of Bigtop becomes persistently unmaintained, without volunteers
> > to fix the inevitable issues that crop up, then we have no problems
> > promptly removing it.
> >
> > I don't think Arrow would meet my criteria #2 for packaging. So I'd be
> > skeptical about adding Arrow as a package.
> >
> > For comparison let's look at a couple of other interesting cases and my
> > criteria #2.
> >
> > Phoenix: Phoenix is part a JDBC client, so a library, so does not meet
> > criteria #2. However it is also in part an HBase coprocessor application.
> > When you install the Phoenix package on top of HBase, your HBase
> > installation gets superpowers, it becomes HBase+Phoenix. HBase is
> > collection of multiple executable artifacts and Phoenix inherits them
> once
> > installed. Phoenix meets the criteria.
> >
> > Tez: At first glance Tez might look like just a library consumed by Hive
> > and Pig as an execution engine option. However, Tez is also a MapReduce
> > application. Definitely a gray area, though. MapReduce applications don't
> > look like traditional OS executables. They don't want to be deployed on a
> > local filesystem, they want to live in the distributed FS. Hmm, maybe Tez
> > isn't the best fit. In contrast Spark can be argued to be a YARN
> > application but it does support a standalone model of operation so has
> > daemons so meets my criteria #2.
> >
> > Looking over our existing set of packages maybe I have just argued we
> > shouldn't have DataFu - merely a library of UDFs - but this is just my
> way
> > of looking at things. Kite is a SDK (doesn't meet) but also a collection
> of
> > command line tools (does meet).
> >
> > In Arrow's case, it is clearly only a library meant for consumption by
> > other projects that produce things you can execute. I don't even see gray
> > area things like a MR application for somesuch packaged in the Arrow
> code.
> >
> > The above only relates to packaging. There's another side of Bigtop:
> > configuration and deployment. I think a very helpful Arrow related
> > contribution to Bigtop would be when and if other components are
> consuming
> > Arrow, like perhaps HBase, Phoenix, Hive, Pig, etc. then the
> configuration
> > and deployment sides of Bigtop should support wiring up the Arrow
> in-memory
> > communication channels in the respective component configurations.
> >
> >
> > HOWEVER
> >
> > > [Amir] My name is Amir Sanjar from OpenPOWER foundation. My team and I
> > have been working on porting and optimizing Apache Arrow for Power8.
> >
> > Amir, please note that Bigtop is a framework for integrating various
> > upstream projects. We simply consume artifacts from upstream projects
> here.
> > If you'd like to see Power8 support in Arrow in Bigtop, then Arrow must
> > first produce a release including Power8 support. Then we could look at
> it.
> > You probably know this, I just want to be clear.
> >
> >
> >
> > On Wed, Apr 13, 2016 at 11:18 AM, Konstantin Boudnik <co...@apache.org>
> wrote:
> >
> > > Just to re-iterate what I've said in the JIRA
> > >
> > > Arrow yet to have a single release. And even when it comes out we don't
> > > know
> > > how useful it will be; or would it deliver to its already hyped up
> > > expectations. Perhaps we should wait a little bit and not pile up every
> > > shiny
> > > pebble into the stack?
> > >
> > > The point of being cautious: once you add something to the stack -
> you'll
> > > have
> > > to support it for life. It isn't even about initial effort, but how
> > > laborious
> > > the ongoing support will be. Unless we are sure it is a good value-add,
> > > let's
> > > not make the commitment.
> > >
> > > Cos
> > >
> > > On Wed, Apr 13, 2016 at 12:41PM, MrAsanjar . wrote:
> > > > Hi all,
> > > > My name is Amir Sanjar from OpenPOWER foundation. My team and I have
> been
> > > > working on porting and optimizing Apache Arrow for Power8. We believe
> > > > Apache Arrow would be a good addition to Apache bigtop. Any thoughts
> on
> > > > that? We will do the required technical work (for both x86 and Power)
> > > but
> > > > could appreciate having a mentor.
> > > > I also have created
> https://issues.apache.org/jira/browse/BIGTOP-2389
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
>

Re: Adding Apache Arrow to Apache Bigtop project

Posted by Konstantin Boudnik <co...@apache.org>.

On Sat, Apr 16, 2016 at 12:10PM, Andrew Purtell wrote:
> We have integrated other projects into the Bigtop stack while they were
> still incubating, and even non Apache projects (like Hue), and Arrow isn't
> incubating, it went direct to TLP. I think we are only waiting on them to
> release an artifact before they'd be a good contender for inclusion.

That's pretty much what I said. (non)incubation isn't a measure of technical
success/stability. But releases provide more than that - in particular a
certain level of IP clearance.

> But what does integration mean? I think it can mean two things: 1)
> installalble packages; 2) configuration and deployment support
> 
> > [Cos] Perhaps we should wait a little bit and not pile up every shiny pebble
> into the stack?
> 
> This is a valid point, yet we have had justification of getting other stuff
> in without wide uptake (I can give examples if I must) by considering
> ourselves the "Debian of Big Data". Why not take that maximalist view? If
> some basic preconditions are met, the answer should be straightforward. We
> can discuss what would be a good set of preconditions.

Once again, it isn't even wide uptake, but about having _at least_ one
official release.

> For packages, I tend to like:
> 
> 1. The project is shipping working code
> 2. The project produces at least one executable system artifact.

I want to comment here on the point, but perhaps my conclusion will be 
essentially the same as your own deliberations below. "executable" isn't a
requirement in my view, because certain components/layers might simply produce
pluggable functionality to the existing applications (Tez is a good example).

Completely agree with the rest on this list.

Cos

> 3. We have an active and interested contributor ready to make the necessary
> patches for inclusion
> 4. It's not a one-off, it has hopefully multiple integration points with
> the rest of our stack
> 
> It's no problem to be liberal in what we accept so long as if some
> component of Bigtop becomes persistently unmaintained, without volunteers
> to fix the inevitable issues that crop up, then we have no problems
> promptly removing it.
> 
> I don't think Arrow would meet my criteria #2 for packaging. So I'd be
> skeptical about adding Arrow as a package.
> 
> For comparison let's look at a couple of other interesting cases and my
> criteria #2.
> 
> Phoenix: Phoenix is part a JDBC client, so a library, so does not meet
> criteria #2. However it is also in part an HBase coprocessor application.
> When you install the Phoenix package on top of HBase, your HBase
> installation gets superpowers, it becomes HBase+Phoenix. HBase is
> collection of multiple executable artifacts and Phoenix inherits them once
> installed. Phoenix meets the criteria.
> 
> Tez: At first glance Tez might look like just a library consumed by Hive
> and Pig as an execution engine option. However, Tez is also a MapReduce
> application. Definitely a gray area, though. MapReduce applications don't
> look like traditional OS executables. They don't want to be deployed on a
> local filesystem, they want to live in the distributed FS. Hmm, maybe Tez
> isn't the best fit. In contrast Spark can be argued to be a YARN
> application but it does support a standalone model of operation so has
> daemons so meets my criteria #2.
> 
> Looking over our existing set of packages maybe I have just argued we
> shouldn't have DataFu - merely a library of UDFs - but this is just my way
> of looking at things. Kite is a SDK (doesn't meet) but also a collection of
> command line tools (does meet).
> 
> In Arrow's case, it is clearly only a library meant for consumption by
> other projects that produce things you can execute. I don't even see gray
> area things like a MR application for somesuch packaged in the Arrow code.
> 
> The above only relates to packaging. There's another side of Bigtop:
> configuration and deployment. I think a very helpful Arrow related
> contribution to Bigtop would be when and if other components are consuming
> Arrow, like perhaps HBase, Phoenix, Hive, Pig, etc. then the configuration
> and deployment sides of Bigtop should support wiring up the Arrow in-memory
> communication channels in the respective component configurations.
> 
> 
> HOWEVER
> 
> > [Amir] My name is Amir Sanjar from OpenPOWER foundation. My team and I
> have been working on porting and optimizing Apache Arrow for Power8.
> 
> Amir, please note that Bigtop is a framework for integrating various
> upstream projects. We simply consume artifacts from upstream projects here.
> If you'd like to see Power8 support in Arrow in Bigtop, then Arrow must
> first produce a release including Power8 support. Then we could look at it.
> You probably know this, I just want to be clear.
> 
> 
> 
> On Wed, Apr 13, 2016 at 11:18 AM, Konstantin Boudnik <co...@apache.org> wrote:
> 
> > Just to re-iterate what I've said in the JIRA
> >
> > Arrow yet to have a single release. And even when it comes out we don't
> > know
> > how useful it will be; or would it deliver to its already hyped up
> > expectations. Perhaps we should wait a little bit and not pile up every
> > shiny
> > pebble into the stack?
> >
> > The point of being cautious: once you add something to the stack - you'll
> > have
> > to support it for life. It isn't even about initial effort, but how
> > laborious
> > the ongoing support will be. Unless we are sure it is a good value-add,
> > let's
> > not make the commitment.
> >
> > Cos
> >
> > On Wed, Apr 13, 2016 at 12:41PM, MrAsanjar . wrote:
> > > Hi all,
> > > My name is Amir Sanjar from OpenPOWER foundation. My team and I have been
> > > working on porting and optimizing Apache Arrow for Power8. We believe
> > > Apache Arrow would be a good addition to Apache bigtop. Any thoughts on
> > > that? We will do the required technical work (for both x86 and Power)
> > but
> > > could appreciate having a mentor.
> > > I also have created https://issues.apache.org/jira/browse/BIGTOP-2389
> >
> 
> 
> 
> -- 
> Best regards,
> 
>    - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Re: Adding Apache Arrow to Apache Bigtop project

Posted by Andrew Purtell <ap...@apache.org>.

We have integrated other projects into the Bigtop stack while they were
still incubating, and even non Apache projects (like Hue), and Arrow isn't
incubating, it went direct to TLP. I think we are only waiting on them to
release an artifact before they'd be a good contender for inclusion.

But what does integration mean? I think it can mean two things: 1)
installalble packages; 2) configuration and deployment support

> [Cos] Perhaps we should wait a little bit and not pile up every shiny pebble
into the stack?

This is a valid point, yet we have had justification of getting other stuff
in without wide uptake (I can give examples if I must) by considering
ourselves the "Debian of Big Data". Why not take that maximalist view? If
some basic preconditions are met, the answer should be straightforward. We
can discuss what would be a good set of preconditions.

For packages, I tend to like:

1. The project is shipping working code
2. The project produces at least one executable system artifact.
3. We have an active and interested contributor ready to make the necessary
patches for inclusion
4. It's not a one-off, it has hopefully multiple integration points with
the rest of our stack

It's no problem to be liberal in what we accept so long as if some
component of Bigtop becomes persistently unmaintained, without volunteers
to fix the inevitable issues that crop up, then we have no problems
promptly removing it.

I don't think Arrow would meet my criteria #2 for packaging. So I'd be
skeptical about adding Arrow as a package.

For comparison let's look at a couple of other interesting cases and my
criteria #2.

Phoenix: Phoenix is part a JDBC client, so a library, so does not meet
criteria #2. However it is also in part an HBase coprocessor application.
When you install the Phoenix package on top of HBase, your HBase
installation gets superpowers, it becomes HBase+Phoenix. HBase is
collection of multiple executable artifacts and Phoenix inherits them once
installed. Phoenix meets the criteria.

Tez: At first glance Tez might look like just a library consumed by Hive
and Pig as an execution engine option. However, Tez is also a MapReduce
application. Definitely a gray area, though. MapReduce applications don't
look like traditional OS executables. They don't want to be deployed on a
local filesystem, they want to live in the distributed FS. Hmm, maybe Tez
isn't the best fit. In contrast Spark can be argued to be a YARN
application but it does support a standalone model of operation so has
daemons so meets my criteria #2.

Looking over our existing set of packages maybe I have just argued we
shouldn't have DataFu - merely a library of UDFs - but this is just my way
of looking at things. Kite is a SDK (doesn't meet) but also a collection of
command line tools (does meet).

In Arrow's case, it is clearly only a library meant for consumption by
other projects that produce things you can execute. I don't even see gray
area things like a MR application for somesuch packaged in the Arrow code.

The above only relates to packaging. There's another side of Bigtop:
configuration and deployment. I think a very helpful Arrow related
contribution to Bigtop would be when and if other components are consuming
Arrow, like perhaps HBase, Phoenix, Hive, Pig, etc. then the configuration
and deployment sides of Bigtop should support wiring up the Arrow in-memory
communication channels in the respective component configurations.

HOWEVER

> [Amir] My name is Amir Sanjar from OpenPOWER foundation. My team and I
have been working on porting and optimizing Apache Arrow for Power8.

Amir, please note that Bigtop is a framework for integrating various
upstream projects. We simply consume artifacts from upstream projects here.
If you'd like to see Power8 support in Arrow in Bigtop, then Arrow must
first produce a release including Power8 support. Then we could look at it.
You probably know this, I just want to be clear.

On Wed, Apr 13, 2016 at 11:18 AM, Konstantin Boudnik <co...@apache.org> wrote:

> Just to re-iterate what I've said in the JIRA
>
> Arrow yet to have a single release. And even when it comes out we don't
> know
> how useful it will be; or would it deliver to its already hyped up
> expectations. Perhaps we should wait a little bit and not pile up every
> shiny
> pebble into the stack?
>
> The point of being cautious: once you add something to the stack - you'll
> have
> to support it for life. It isn't even about initial effort, but how
> laborious
> the ongoing support will be. Unless we are sure it is a good value-add,
> let's
> not make the commitment.
>
> Cos
>
> On Wed, Apr 13, 2016 at 12:41PM, MrAsanjar . wrote:
> > Hi all,
> > My name is Amir Sanjar from OpenPOWER foundation. My team and I have been
> > working on porting and optimizing Apache Arrow for Power8. We believe
> > Apache Arrow would be a good addition to Apache bigtop. Any thoughts on
> > that? We will do the required technical work (for both x86 and Power)
> but
> > could appreciate having a mentor.
> > I also have created https://issues.apache.org/jira/browse/BIGTOP-2389
>

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Adding Apache Arrow to Apache Bigtop project

Posted by Konstantin Boudnik <co...@apache.org>.

Just to re-iterate what I've said in the JIRA 

Arrow yet to have a single release. And even when it comes out we don't know
how useful it will be; or would it deliver to its already hyped up
expectations. Perhaps we should wait a little bit and not pile up every shiny
pebble into the stack?

The point of being cautious: once you add something to the stack - you'll have
to support it for life. It isn't even about initial effort, but how laborious
the ongoing support will be. Unless we are sure it is a good value-add, let's
not make the commitment.

Cos

On Wed, Apr 13, 2016 at 12:41PM, MrAsanjar . wrote:
> Hi all,
> My name is Amir Sanjar from OpenPOWER foundation. My team and I have been
> working on porting and optimizing Apache Arrow for Power8. We believe
> Apache Arrow would be a good addition to Apache bigtop. Any thoughts on
> that? We will do the required technical work (for both x86 and Power)  but
> could appreciate having a mentor.
> I also have created https://issues.apache.org/jira/browse/BIGTOP-2389