You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Andrew Purtell <ap...@apache.org> on 2012/09/17 21:22:51 UTC

Re: Announcement of Project Panthera: Better Analytics with SQL, MapReduce and HBase

Hi Jason,

On Mon, Sep 17, 2012 at 6:55 AM, Dai, Jason <ja...@intel.com> wrote:
> I'd like to announce Project Panthera, our open source efforts that showcase better data analytics capabilities on Hadoop/HBase (through both SW and HW improvements), available at https://github.com/intel-hadoop/project-panthera.
[...]
> 2)      A document store (built on top of HBase) for better query processing
>    Under Project Panthera, we will gradually make our implementation of the document store available as an extension to HBase (https://github.com/intel-hadoop/hbase-0.94-panthera). Specifically, today's release provides document store support in HBase by utilizing co-processors, which brings up-to 3x reduction in storage usage and up-to 1.8x speedup in query processing. Going forward, we will also use HBase-6800<https://issues.apache.org/jira/browse/HBASE-6800> as the umbrella JIRA to track our efforts to get the document store idea reviewed and hopefully incorporated into Apache HBase.

Thank you for your interest in contributing to the HBase project. I
have two initial comments/suggestions. These are also at
https://issues.apache.org/jira/browse/HBASE-6800#comment-13457242

1) From the attached document, it appears that the existing
coprocessor framework was sufficient for the implementation of the DOT
system on top, which is great to see. There has been some discussion
in the HBase PMC, documented in the archives of the
dev@hbase.apache.org mailing list, that coprocessor based applications
should begin as independent code contributions, perhaps hosted in a
GitHub repository. In your announcement on general@ I see you have
sort-of done this already at:
https://github.com/intel-hadoop/hbase-0.94-panthera , except this is a
full fork of the HBase source tree with all history of individual
changes lost (a single commit of a source drop). It would be helpful
if only the changes on top of stock HBase code appear here. Otherwise,
what you have done is in effect forked the HBase project, which is not
ideally conducive to contribution.

2) From the design document: "The co-processor framework needs to be
extended to provide observers for the filter operations, similar to
the observers of the data access operations." We would be delighted to
work with you on the necessary coprocessor framework extensions. I'd
recommend a separate JIRA specifically for this. Let's discuss what
Coprocessor API extensions or additions are necessary. Do you have a
proposal?

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Re: Announcement of Project Panthera: Better Analytics with SQL, MapReduce and HBase

Posted by Andrew Purtell <ap...@apache.org>.
Hi Jason,

Please see my replies below inline.

On Monday, September 17, 2012, Dai, Jason wrote:

> Hi Andrew,
>
> See my comments below (I have also replied at
> https://issues.apache.org/jira/browse/HBASE-6800#comment-13457508).
>
> Thanks,
> -Jason
>
> >>>> coprocessor based applications should begin as independent code
> contributions, perhaps hosted in a GitHub repository
> >>>> It would be helpful if only the changes on top of stock HBase code
> appear here.
>
> This could work, though I think we need to figure out how to address
> several implications brought by the proposal, such as:
> (1) How do the users figure out what co-processor applications are stable,
> so that they can use in their production deployment?


This is exactly the motivation for starting all coprocessor based
applications/contributions as external projects. We will have no registry
of "approved" or "stable" coprocessor applications. I'd imagine users would
expect all such apps in the HBase distribution proper to be in such a
state. Beyond that, I don't think the project can have the bandwidth to
track a number of ideas in development. We can't know in advance what
support, interest, or stability any given contribution would have, so
starting as an external project establishes this on its own merit. A
popular and well cared for contribution would eventually be candidate for
inclusion into the HBase source distribution proper. This is my
characterization of what has been discussed and the consensus reached by
the PMC. If others feel this in error, or if we should do something
differently here, please speak up.


> (2) How do we ensure the co-processor applications continue to be
> compatible with the changes in the HBase project, and compatible with each
> other?


We don't. The onus is on the contributor. If at some point the consensus of
the project is to bring in a particular contribution into the ASF HBase
source distribution, then at that point we must insure these things... But
only with what is in the source distribution.


> (3) How do the users get the co-processor applications? They can no longer
> get these from the Apache HBase release, and may need to perform manual
> integrations - not something average business users will do, and the main
> reason that we put the full HBase source tree out


HBase is a mavenized project and your DOT system is a coprocessor
application. There is no technical reason, barring issues with the CP
framework itself, I can see why you have to include and maintain a full
fork of HBase. Simply depend on HBase project artifacts and the complete
DOT application can be compiled as a jar to drop on the classpath of a
HBase installation. Where the CP framework may be insufficient, we can
address that. Or, like Stack says, if there is some other technical reason
(like a patch to core HBase), please list those so we can look at
addressing it. We would definitely like to support your DOT on stock ASF
HBase.


>
> >>>> We would be delighted to work with you on the necessary coprocessor
> framework extensions. I'd recommend a separate JIRA specifically for this.
>
> Yes, we do plan to submit the proposal for observers for the filter
> operations as a separate JIRA (the original plan was to make it a sub task
> of this JIRA).


Sure, that would be great.


>
> -----Original Message-----
> From: Andrew Purtell [mailto:apurtell@apache.org <javascript:;>]
> Sent: Tuesday, September 18, 2012 3:23 AM
> To: dev@hbase.apache.org <javascript:;>; user@hbase.apache.org<javascript:;>;
> Dai, Jason
> Subject: Re: Announcement of Project Panthera: Better Analytics with SQL,
> MapReduce and HBase
>
> Hi Jason,
>
> On Mon, Sep 17, 2012 at 6:55 AM, Dai, Jason <jason.dai@intel.com<javascript:;>>
> wrote:
> > I'd like to announce Project Panthera, our open source efforts that
> showcase better data analytics capabilities on Hadoop/HBase (through both
> SW and HW improvements), available at
> https://github.com/intel-hadoop/project-panthera.
> [...]
> > 2)      A document store (built on top of HBase) for better query
> processing
> >    Under Project Panthera, we will gradually make our implementation of
> the document store available as an extension to HBase (
> https://github.com/intel-hadoop/hbase-0.94-panthera). Specifically,
> today's release provides document store support in HBase by utilizing
> co-processors, which brings up-to 3x reduction in storage usage and up-to
> 1.8x speedup in query processing. Going forward, we will also use
> HBase-6800<https://issues.apache.org/jira/browse/HBASE-6800> as the
> umbrella JIRA to track our efforts to get the document store idea reviewed
> and hopefully incorporated into Apache HBase.
>
> Thank you for your interest in contributing to the HBase project. I have
> two initial comments/suggestions. These are also at
> https://issues.apache.org/jira/browse/HBASE-6800#comment-13457242
>
> 1) From the attached document, it appears that the existing coprocessor
> framework was sufficient for the implementation of the DOT system on top,
> which is great to see. There has been some discussion in the HBase PMC,
> documented in the archives of the dev@hbase.apache.org <javascript:;>mailing list, that coprocessor based applications should begin as
> independent code contributions, perhaps hosted in a GitHub repository. In
> your announcement on general@ I see you have sort-of done this already at:
> https://github.com/intel-hadoop/hbase-0.94-panthera , except this is a
> full fork of the HBase source tree with all history of individual changes
> lost (a single commit of a source drop). It would be helpful if only the
> changes on top of stock HBase code appear here. Otherwise, what you have
> done is in effect forked the HBase project, which is not ideally conducive
> to contribution.
>
> 2) From the design document: "The co-processor framework needs to be
> extended to provide observers for the filter operations, similar to the
> observers of the data access operations." We would be delighted to work
> with you on the necessary coprocessor framework extensions. I'd recommend a
> separate JIRA specifically for this. Let's discuss what Coprocessor API
> extensions or additions are necessary. Do you have a proposal?
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Announcement of Project Panthera: Better Analytics with SQL, MapReduce and HBase

Posted by Andrew Purtell <ap...@apache.org>.
On Monday, September 17, 2012, Stack wrote:

> > (3) How do users get the co-processor applications?

Not sure.  We should work on this.  Should we make it you point your
> cluster at a repository, select a CP, and it then downloads it and
> installs  like an eclipse plugin only hopefully the deploy does not
> require a cluster restart -- of if a restart, its a rolling restart.
> That'd be kinda sweet


That would be pretty cool. We could add that kind of tooling without making
promises about any particular coprocessor (same as the situation with
Eclipse plugins).

     - Andy


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Announcement of Project Panthera: Better Analytics with SQL, MapReduce and HBase

Posted by Stack <st...@duboce.net>.
On Mon, Sep 17, 2012 at 5:44 PM, Dai, Jason <ja...@intel.com> wrote:
> This could work, though I think we need to figure out how to address several implications brought by the proposal, such as:
> (1) How do the users figure out what co-processor applications are stable, so that they can use in their production deployment?

As they would any other piece of software?

Or what are you thinking here Jason?  That you need to deliver the
whole stack -- from Document App down through Coprocessor and on down
through HBase too -- to be able to say your document store is stable?

> (2) How do we ensure the co-processor applications continue to be compatible with the changes in the HBase project, and compatible with each other?

Testing would be the short answer.  Taking on a new HBase version,
you'd run your tests to ensure core works as your Document
applications expects.

Regards compatibility, the project is very careful regards our public
APIs.  They only change rarely, and only if extremely good reason.  If
they do change, they are first deprecated for a release and only
removed on the release subsequent.

Regards Coprocessors in particular, they are not yet part of our
public API.  They are by agreement, more developer-facing at the
moment.  This makes sense for something we are still evolving -- e.g.
sounds like you found that we are missing CP hooks in filters -- and
for a tech that gives you the enough rope to hang your cluster.

So, your CPs, given the caveat above, should remain relatively stable
across HBase versions.  You may have to adjust some as you go across
major versions but even this requirement, post-0.96, should lessen as
all moves up on to protobufs.

Regards intra-CP compatibility, thats beyond core concern.


> (3) How do the users get the co-processor applications?

Not sure.  We should work on this.  Should we make it you point your
cluster at a repository, select a CP, and it then downloads it and
installs  like an eclipse plugin only hopefully the deploy does not
require a cluster restart -- of if a restart, its a rolling restart.
That'd be kinda sweet (we'd have to first figure out the CPs that are
vetted and not going to kill your cluster and/or move CP execution out
of the regionserver process to run beside it so they don't bring the
RS if they go rogue, etc.)

> They can no longer get these from the Apache HBase release, and may need to perform manual integrations - not something average business users will do, and the main reason that we put the full HBase source tree out (several of our users and customers want to get a prototype of DOT to try it out).
>

We don't intend to ship all CPs as part of core.  Its untenable (I can
explain why that would not work but my guess is that you can figure it
for yourself).

A DOT package that bundles HBase is fine for folks to try.   But do
you intend to keep your own fork of hbase or is the intent to move
toward DOT running on a released HBase?  If you'd like to do the
latter, we'd like to help.

Thanks,
St.Ack

Re: Announcement of Project Panthera: Better Analytics with SQL, MapReduce and HBase

Posted by Andrew Purtell <ap...@apache.org>.
Hi Jason,

Please see my replies below inline.

On Monday, September 17, 2012, Dai, Jason wrote:

> Hi Andrew,
>
> See my comments below (I have also replied at
> https://issues.apache.org/jira/browse/HBASE-6800#comment-13457508).
>
> Thanks,
> -Jason
>
> >>>> coprocessor based applications should begin as independent code
> contributions, perhaps hosted in a GitHub repository
> >>>> It would be helpful if only the changes on top of stock HBase code
> appear here.
>
> This could work, though I think we need to figure out how to address
> several implications brought by the proposal, such as:
> (1) How do the users figure out what co-processor applications are stable,
> so that they can use in their production deployment?


This is exactly the motivation for starting all coprocessor based
applications/contributions as external projects. We will have no registry
of "approved" or "stable" coprocessor applications. I'd imagine users would
expect all such apps in the HBase distribution proper to be in such a
state. Beyond that, I don't think the project can have the bandwidth to
track a number of ideas in development. We can't know in advance what
support, interest, or stability any given contribution would have, so
starting as an external project establishes this on its own merit. A
popular and well cared for contribution would eventually be candidate for
inclusion into the HBase source distribution proper. This is my
characterization of what has been discussed and the consensus reached by
the PMC. If others feel this in error, or if we should do something
differently here, please speak up.


> (2) How do we ensure the co-processor applications continue to be
> compatible with the changes in the HBase project, and compatible with each
> other?


We don't. The onus is on the contributor. If at some point the consensus of
the project is to bring in a particular contribution into the ASF HBase
source distribution, then at that point we must insure these things... But
only with what is in the source distribution.


> (3) How do the users get the co-processor applications? They can no longer
> get these from the Apache HBase release, and may need to perform manual
> integrations - not something average business users will do, and the main
> reason that we put the full HBase source tree out


HBase is a mavenized project and your DOT system is a coprocessor
application. There is no technical reason, barring issues with the CP
framework itself, I can see why you have to include and maintain a full
fork of HBase. Simply depend on HBase project artifacts and the complete
DOT application can be compiled as a jar to drop on the classpath of a
HBase installation. Where the CP framework may be insufficient, we can
address that. Or, like Stack says, if there is some other technical reason
(like a patch to core HBase), please list those so we can look at
addressing it. We would definitely like to support your DOT on stock ASF
HBase.


>
> >>>> We would be delighted to work with you on the necessary coprocessor
> framework extensions. I'd recommend a separate JIRA specifically for this.
>
> Yes, we do plan to submit the proposal for observers for the filter
> operations as a separate JIRA (the original plan was to make it a sub task
> of this JIRA).


Sure, that would be great.


>
> -----Original Message-----
> From: Andrew Purtell [mailto:apurtell@apache.org <javascript:;>]
> Sent: Tuesday, September 18, 2012 3:23 AM
> To: dev@hbase.apache.org <javascript:;>; user@hbase.apache.org<javascript:;>;
> Dai, Jason
> Subject: Re: Announcement of Project Panthera: Better Analytics with SQL,
> MapReduce and HBase
>
> Hi Jason,
>
> On Mon, Sep 17, 2012 at 6:55 AM, Dai, Jason <jason.dai@intel.com<javascript:;>>
> wrote:
> > I'd like to announce Project Panthera, our open source efforts that
> showcase better data analytics capabilities on Hadoop/HBase (through both
> SW and HW improvements), available at
> https://github.com/intel-hadoop/project-panthera.
> [...]
> > 2)      A document store (built on top of HBase) for better query
> processing
> >    Under Project Panthera, we will gradually make our implementation of
> the document store available as an extension to HBase (
> https://github.com/intel-hadoop/hbase-0.94-panthera). Specifically,
> today's release provides document store support in HBase by utilizing
> co-processors, which brings up-to 3x reduction in storage usage and up-to
> 1.8x speedup in query processing. Going forward, we will also use
> HBase-6800<https://issues.apache.org/jira/browse/HBASE-6800> as the
> umbrella JIRA to track our efforts to get the document store idea reviewed
> and hopefully incorporated into Apache HBase.
>
> Thank you for your interest in contributing to the HBase project. I have
> two initial comments/suggestions. These are also at
> https://issues.apache.org/jira/browse/HBASE-6800#comment-13457242
>
> 1) From the attached document, it appears that the existing coprocessor
> framework was sufficient for the implementation of the DOT system on top,
> which is great to see. There has been some discussion in the HBase PMC,
> documented in the archives of the dev@hbase.apache.org <javascript:;>mailing list, that coprocessor based applications should begin as
> independent code contributions, perhaps hosted in a GitHub repository. In
> your announcement on general@ I see you have sort-of done this already at:
> https://github.com/intel-hadoop/hbase-0.94-panthera , except this is a
> full fork of the HBase source tree with all history of individual changes
> lost (a single commit of a source drop). It would be helpful if only the
> changes on top of stock HBase code appear here. Otherwise, what you have
> done is in effect forked the HBase project, which is not ideally conducive
> to contribution.
>
> 2) From the design document: "The co-processor framework needs to be
> extended to provide observers for the filter operations, similar to the
> observers of the data access operations." We would be delighted to work
> with you on the necessary coprocessor framework extensions. I'd recommend a
> separate JIRA specifically for this. Let's discuss what Coprocessor API
> extensions or additions are necessary. Do you have a proposal?
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

RE: Announcement of Project Panthera: Better Analytics with SQL, MapReduce and HBase

Posted by "Dai, Jason" <ja...@intel.com>.
Hi Andrew,

See my comments below (I have also replied at https://issues.apache.org/jira/browse/HBASE-6800#comment-13457508).

Thanks,
-Jason

>>>> coprocessor based applications should begin as independent code contributions, perhaps hosted in a GitHub repository
>>>> It would be helpful if only the changes on top of stock HBase code appear here.

This could work, though I think we need to figure out how to address several implications brought by the proposal, such as:
(1) How do the users figure out what co-processor applications are stable, so that they can use in their production deployment?
(2) How do we ensure the co-processor applications continue to be compatible with the changes in the HBase project, and compatible with each other?
(3) How do the users get the co-processor applications? They can no longer get these from the Apache HBase release, and may need to perform manual integrations - not something average business users will do, and the main reason that we put the full HBase source tree out (several of our users and customers want to get a prototype of DOT to try it out).

>>>> We would be delighted to work with you on the necessary coprocessor framework extensions. I'd recommend a separate JIRA specifically for this.

Yes, we do plan to submit the proposal for observers for the filter operations as a separate JIRA (the original plan was to make it a sub task of this JIRA).

-----Original Message-----
From: Andrew Purtell [mailto:apurtell@apache.org] 
Sent: Tuesday, September 18, 2012 3:23 AM
To: dev@hbase.apache.org; user@hbase.apache.org; Dai, Jason
Subject: Re: Announcement of Project Panthera: Better Analytics with SQL, MapReduce and HBase

Hi Jason,

On Mon, Sep 17, 2012 at 6:55 AM, Dai, Jason <ja...@intel.com> wrote:
> I'd like to announce Project Panthera, our open source efforts that showcase better data analytics capabilities on Hadoop/HBase (through both SW and HW improvements), available at https://github.com/intel-hadoop/project-panthera.
[...]
> 2)      A document store (built on top of HBase) for better query processing
>    Under Project Panthera, we will gradually make our implementation of the document store available as an extension to HBase (https://github.com/intel-hadoop/hbase-0.94-panthera). Specifically, today's release provides document store support in HBase by utilizing co-processors, which brings up-to 3x reduction in storage usage and up-to 1.8x speedup in query processing. Going forward, we will also use HBase-6800<https://issues.apache.org/jira/browse/HBASE-6800> as the umbrella JIRA to track our efforts to get the document store idea reviewed and hopefully incorporated into Apache HBase.

Thank you for your interest in contributing to the HBase project. I have two initial comments/suggestions. These are also at
https://issues.apache.org/jira/browse/HBASE-6800#comment-13457242

1) From the attached document, it appears that the existing coprocessor framework was sufficient for the implementation of the DOT system on top, which is great to see. There has been some discussion in the HBase PMC, documented in the archives of the dev@hbase.apache.org mailing list, that coprocessor based applications should begin as independent code contributions, perhaps hosted in a GitHub repository. In your announcement on general@ I see you have sort-of done this already at:
https://github.com/intel-hadoop/hbase-0.94-panthera , except this is a full fork of the HBase source tree with all history of individual changes lost (a single commit of a source drop). It would be helpful if only the changes on top of stock HBase code appear here. Otherwise, what you have done is in effect forked the HBase project, which is not ideally conducive to contribution.

2) From the design document: "The co-processor framework needs to be extended to provide observers for the filter operations, similar to the observers of the data access operations." We would be delighted to work with you on the necessary coprocessor framework extensions. I'd recommend a separate JIRA specifically for this. Let's discuss what Coprocessor API extensions or additions are necessary. Do you have a proposal?

--
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)

RE: Announcement of Project Panthera: Better Analytics with SQL, MapReduce and HBase

Posted by "Dai, Jason" <ja...@intel.com>.
Hi Andrew,

See my comments below (I have also replied at https://issues.apache.org/jira/browse/HBASE-6800#comment-13457508).

Thanks,
-Jason

>>>> coprocessor based applications should begin as independent code contributions, perhaps hosted in a GitHub repository
>>>> It would be helpful if only the changes on top of stock HBase code appear here.

This could work, though I think we need to figure out how to address several implications brought by the proposal, such as:
(1) How do the users figure out what co-processor applications are stable, so that they can use in their production deployment?
(2) How do we ensure the co-processor applications continue to be compatible with the changes in the HBase project, and compatible with each other?
(3) How do the users get the co-processor applications? They can no longer get these from the Apache HBase release, and may need to perform manual integrations - not something average business users will do, and the main reason that we put the full HBase source tree out (several of our users and customers want to get a prototype of DOT to try it out).

>>>> We would be delighted to work with you on the necessary coprocessor framework extensions. I'd recommend a separate JIRA specifically for this.

Yes, we do plan to submit the proposal for observers for the filter operations as a separate JIRA (the original plan was to make it a sub task of this JIRA).

-----Original Message-----
From: Andrew Purtell [mailto:apurtell@apache.org] 
Sent: Tuesday, September 18, 2012 3:23 AM
To: dev@hbase.apache.org; user@hbase.apache.org; Dai, Jason
Subject: Re: Announcement of Project Panthera: Better Analytics with SQL, MapReduce and HBase

Hi Jason,

On Mon, Sep 17, 2012 at 6:55 AM, Dai, Jason <ja...@intel.com> wrote:
> I'd like to announce Project Panthera, our open source efforts that showcase better data analytics capabilities on Hadoop/HBase (through both SW and HW improvements), available at https://github.com/intel-hadoop/project-panthera.
[...]
> 2)      A document store (built on top of HBase) for better query processing
>    Under Project Panthera, we will gradually make our implementation of the document store available as an extension to HBase (https://github.com/intel-hadoop/hbase-0.94-panthera). Specifically, today's release provides document store support in HBase by utilizing co-processors, which brings up-to 3x reduction in storage usage and up-to 1.8x speedup in query processing. Going forward, we will also use HBase-6800<https://issues.apache.org/jira/browse/HBASE-6800> as the umbrella JIRA to track our efforts to get the document store idea reviewed and hopefully incorporated into Apache HBase.

Thank you for your interest in contributing to the HBase project. I have two initial comments/suggestions. These are also at
https://issues.apache.org/jira/browse/HBASE-6800#comment-13457242

1) From the attached document, it appears that the existing coprocessor framework was sufficient for the implementation of the DOT system on top, which is great to see. There has been some discussion in the HBase PMC, documented in the archives of the dev@hbase.apache.org mailing list, that coprocessor based applications should begin as independent code contributions, perhaps hosted in a GitHub repository. In your announcement on general@ I see you have sort-of done this already at:
https://github.com/intel-hadoop/hbase-0.94-panthera , except this is a full fork of the HBase source tree with all history of individual changes lost (a single commit of a source drop). It would be helpful if only the changes on top of stock HBase code appear here. Otherwise, what you have done is in effect forked the HBase project, which is not ideally conducive to contribution.

2) From the design document: "The co-processor framework needs to be extended to provide observers for the filter operations, similar to the observers of the data access operations." We would be delighted to work with you on the necessary coprocessor framework extensions. I'd recommend a separate JIRA specifically for this. Let's discuss what Coprocessor API extensions or additions are necessary. Do you have a proposal?

--
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)