You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Josh Elser <el...@apache.org> on 2018/08/27 18:03:30 UTC

[DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

(bcc: dev@hbase, in case folks there have been waiting for me to send 
this email to dev@phoenix)

Hi,

In case you missed it, there was an HBaseCon event held in Asia 
recently. Stack took some great notes and shared them with the HBase 
community. A few of them touched on Phoenix, directly or in a related 
manner. I think they are good "criticisms" that are beneficial for us to 
hear.

1. The phoenix-$version-client.jar size is prohibitively large

In this day and age, I'm surprised that this is a big issue for people. 
I know have a lot of cruft, most of which coming from hadoop. We have 
gotten better here over recent releases, but I would guess that there is 
more we can do.

2. Can Phoenix be the de-facto schema for SQL on HBase?

We've long asserted "if you have to ask how Phoenix serializes data, you 
shouldn't be do it" (a nod that you have to write lots of code). What if 
we turn that on its head? Could we extract our PDataType serialization, 
composite row-key, column encoding, etc into a minimal API that folks 
with their own itches can use?

With the growing integrations into Phoenix, we could embrace them by 
providing an API to make what they're doing easier. In the same vein, we 
cement ourselves as a cornerstone of doing it "correctly".

3. Better recommendations to users to not attempt certain queries.

We definitively know that there are certain types of queries that 
Phoenix cannot support well (compared to optimal Phoenix use-cases). 
Users very commonly fall into such pitfalls on their own and this leaves 
a bad taste in their mouth (thinking that the product "stinks").

Can we do a better job of telling the user when and why it happened? 
What would such a user-interaction model look like? Can we supplement 
the "why" with instructions of what to do differently (even if in the 
abstract)?

4. Phoenix-Calcite

This was mentioned as a "nice to have". From what I understand, there 
was nothing explicitly from with the implementation or approach, just 
that it was a massive undertaking to continue with little immediate 
gain. Would this be a boon for us to try to continue in some form? Are 
there steps we can take that would help push us along the right path?

Anyways, I'd love to hear everyone's thoughts. While the concerns were 
raised at HBaseCon Asia, the suggestions that accompany them here are 
largely mine ;). Feel free to break them out into their own threads if 
you think that would be better (or say that you disagree with me -- 
that's cool too)!

- Josh

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

Posted by "larsh@apache.org" <la...@apache.org>.
 100% agreement.
A bit worried about "boiling the ocean" and risking not getting done anything.
Speaking of modules. I would *love* if we had a simple HBase abstraction API and then a module for each version of HBase, rather than a different branch each.Most differences are presumably in coprocessors APIs, which should be able to be "wrapped away" with some indirection layer.

-- Lars

    On Monday, September 17, 2018, 8:52:58 AM PDT, Josh Elser <el...@apache.org> wrote:  
 
 Maybe an implementation detail, but I'm a fan of having a devoted Maven 
module to "client-facing" API as opposed to an annotation-based 
approach. I find a separate module helps to catch problematic API design 
faster, and make it crystal clear what users should (and should not) be 
relying upon).

On 9/17/18 1:00 AM, larsh@apache.org wrote:
>  I think we can start by implementing a tighter integration with Spark through DataSource V2.That would make it quickly apparent what parts of Phoenix would need direct access.
> Some parts just need a interface audience declaration (like Phoenix's basic type system) and our agreement that we will change those only according to semantic versioning. Otherwise (like the query plan) will need a bit more thinking. Maybe that's the path to hook Calcite - just making that part up as I write this...
> Perhaps turning the HBase interface into an API might not be so difficult either. That would perhaps be a new client - strictly additional - client API.
> 
> A good Spark interface is in everybody's interest and I think is the best avenue to figure out what's missing/needed.
> -- Lars
> 
>      On Wednesday, September 12, 2018, 12:47:21 PM PDT, Josh Elser <el...@apache.org> wrote:
>  
>  I like it, Lars. I like it very much.
> 
> Just the easy part of doing it... ;)
> 
> On 9/11/18 4:53 PM, larsh@apache.org wrote:
>>    Sorry for coming a bit late to this. I've been thinking about some of lines for a bit.
>> It seems Phoenix serves 4 distinct purposes:
>> 1. Query parsing and compiling.2. A type system3. Query execution4. Efficient HBase interface
>> Each of these is useful by itself, but we do not expose these as stable interfaces.We have seen a lot of need to tie HBase into "higher level" service, such as Spark (and Presto, etc).
>> I think we can get a long way if we separate at least #1 (SQL) from the rest #2, #3, and #4 (Typed HBase Interface - THI).
>> Phoenix is used via SQL (#1), other tools such as Presto, Impala, Drill, Spark, etc, can interface efficiently with HBase via THI (#2, #3, and #4).
>> Thoughts?
>> -- Lars
>>        On Monday, August 27, 2018, 11:03:33 AM PDT, Josh Elser <el...@apache.org> wrote:
>>    
>>    (bcc: dev@hbase, in case folks there have been waiting for me to send
>> this email to dev@phoenix)
>>
>> Hi,
>>
>> In case you missed it, there was an HBaseCon event held in Asia
>> recently. Stack took some great notes and shared them with the HBase
>> community. A few of them touched on Phoenix, directly or in a related
>> manner. I think they are good "criticisms" that are beneficial for us to
>> hear.
>>
>> 1. The phoenix-$version-client.jar size is prohibitively large
>>
>> In this day and age, I'm surprised that this is a big issue for people.
>> I know have a lot of cruft, most of which coming from hadoop. We have
>> gotten better here over recent releases, but I would guess that there is
>> more we can do.
>>
>> 2. Can Phoenix be the de-facto schema for SQL on HBase?
>>
>> We've long asserted "if you have to ask how Phoenix serializes data, you
>> shouldn't be do it" (a nod that you have to write lots of code). What if
>> we turn that on its head? Could we extract our PDataType serialization,
>> composite row-key, column encoding, etc into a minimal API that folks
>> with their own itches can use?
>>
>> With the growing integrations into Phoenix, we could embrace them by
>> providing an API to make what they're doing easier. In the same vein, we
>> cement ourselves as a cornerstone of doing it "correctly".
>>
>> 3. Better recommendations to users to not attempt certain queries.
>>
>> We definitively know that there are certain types of queries that
>> Phoenix cannot support well (compared to optimal Phoenix use-cases).
>> Users very commonly fall into such pitfalls on their own and this leaves
>> a bad taste in their mouth (thinking that the product "stinks").
>>
>> Can we do a better job of telling the user when and why it happened?
>> What would such a user-interaction model look like? Can we supplement
>> the "why" with instructions of what to do differently (even if in the
>> abstract)?
>>
>> 4. Phoenix-Calcite
>>
>> This was mentioned as a "nice to have". From what I understand, there
>> was nothing explicitly from with the implementation or approach, just
>> that it was a massive undertaking to continue with little immediate
>> gain. Would this be a boon for us to try to continue in some form? Are
>> there steps we can take that would help push us along the right path?
>>
>> Anyways, I'd love to hear everyone's thoughts. While the concerns were
>> raised at HBaseCon Asia, the suggestions that accompany them here are
>> largely mine ;). Feel free to break them out into their own threads if
>> you think that would be better (or say that you disagree with me --
>> that's cool too)!
>>
>> - Josh
>>      
>>
>    
> 
  

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

Posted by Josh Elser <el...@apache.org>.
Maybe an implementation detail, but I'm a fan of having a devoted Maven 
module to "client-facing" API as opposed to an annotation-based 
approach. I find a separate module helps to catch problematic API design 
faster, and make it crystal clear what users should (and should not) be 
relying upon).

On 9/17/18 1:00 AM, larsh@apache.org wrote:
>   I think we can start by implementing a tighter integration with Spark through DataSource V2.That would make it quickly apparent what parts of Phoenix would need direct access.
> Some parts just need a interface audience declaration (like Phoenix's basic type system) and our agreement that we will change those only according to semantic versioning. Otherwise (like the query plan) will need a bit more thinking. Maybe that's the path to hook Calcite - just making that part up as I write this...
> Perhaps turning the HBase interface into an API might not be so difficult either. That would perhaps be a new client - strictly additional - client API.
> 
> A good Spark interface is in everybody's interest and I think is the best avenue to figure out what's missing/needed.
> -- Lars
> 
>      On Wednesday, September 12, 2018, 12:47:21 PM PDT, Josh Elser <el...@apache.org> wrote:
>   
>   I like it, Lars. I like it very much.
> 
> Just the easy part of doing it... ;)
> 
> On 9/11/18 4:53 PM, larsh@apache.org wrote:
>>    Sorry for coming a bit late to this. I've been thinking about some of lines for a bit.
>> It seems Phoenix serves 4 distinct purposes:
>> 1. Query parsing and compiling.2. A type system3. Query execution4. Efficient HBase interface
>> Each of these is useful by itself, but we do not expose these as stable interfaces.We have seen a lot of need to tie HBase into "higher level" service, such as Spark (and Presto, etc).
>> I think we can get a long way if we separate at least #1 (SQL) from the rest #2, #3, and #4 (Typed HBase Interface - THI).
>> Phoenix is used via SQL (#1), other tools such as Presto, Impala, Drill, Spark, etc, can interface efficiently with HBase via THI (#2, #3, and #4).
>> Thoughts?
>> -- Lars
>>        On Monday, August 27, 2018, 11:03:33 AM PDT, Josh Elser <el...@apache.org> wrote:
>>    
>>    (bcc: dev@hbase, in case folks there have been waiting for me to send
>> this email to dev@phoenix)
>>
>> Hi,
>>
>> In case you missed it, there was an HBaseCon event held in Asia
>> recently. Stack took some great notes and shared them with the HBase
>> community. A few of them touched on Phoenix, directly or in a related
>> manner. I think they are good "criticisms" that are beneficial for us to
>> hear.
>>
>> 1. The phoenix-$version-client.jar size is prohibitively large
>>
>> In this day and age, I'm surprised that this is a big issue for people.
>> I know have a lot of cruft, most of which coming from hadoop. We have
>> gotten better here over recent releases, but I would guess that there is
>> more we can do.
>>
>> 2. Can Phoenix be the de-facto schema for SQL on HBase?
>>
>> We've long asserted "if you have to ask how Phoenix serializes data, you
>> shouldn't be do it" (a nod that you have to write lots of code). What if
>> we turn that on its head? Could we extract our PDataType serialization,
>> composite row-key, column encoding, etc into a minimal API that folks
>> with their own itches can use?
>>
>> With the growing integrations into Phoenix, we could embrace them by
>> providing an API to make what they're doing easier. In the same vein, we
>> cement ourselves as a cornerstone of doing it "correctly".
>>
>> 3. Better recommendations to users to not attempt certain queries.
>>
>> We definitively know that there are certain types of queries that
>> Phoenix cannot support well (compared to optimal Phoenix use-cases).
>> Users very commonly fall into such pitfalls on their own and this leaves
>> a bad taste in their mouth (thinking that the product "stinks").
>>
>> Can we do a better job of telling the user when and why it happened?
>> What would such a user-interaction model look like? Can we supplement
>> the "why" with instructions of what to do differently (even if in the
>> abstract)?
>>
>> 4. Phoenix-Calcite
>>
>> This was mentioned as a "nice to have". From what I understand, there
>> was nothing explicitly from with the implementation or approach, just
>> that it was a massive undertaking to continue with little immediate
>> gain. Would this be a boon for us to try to continue in some form? Are
>> there steps we can take that would help push us along the right path?
>>
>> Anyways, I'd love to hear everyone's thoughts. While the concerns were
>> raised at HBaseCon Asia, the suggestions that accompany them here are
>> largely mine ;). Feel free to break them out into their own threads if
>> you think that would be better (or say that you disagree with me --
>> that's cool too)!
>>
>> - Josh
>>      
>>
>    
> 

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

Posted by "larsh@apache.org" <la...@apache.org>.
 I think we can start by implementing a tighter integration with Spark through DataSource V2.That would make it quickly apparent what parts of Phoenix would need direct access.
Some parts just need a interface audience declaration (like Phoenix's basic type system) and our agreement that we will change those only according to semantic versioning. Otherwise (like the query plan) will need a bit more thinking. Maybe that's the path to hook Calcite - just making that part up as I write this...
Perhaps turning the HBase interface into an API might not be so difficult either. That would perhaps be a new client - strictly additional - client API.

A good Spark interface is in everybody's interest and I think is the best avenue to figure out what's missing/needed.
-- Lars

    On Wednesday, September 12, 2018, 12:47:21 PM PDT, Josh Elser <el...@apache.org> wrote:  
 
 I like it, Lars. I like it very much.

Just the easy part of doing it... ;)

On 9/11/18 4:53 PM, larsh@apache.org wrote:
>  Sorry for coming a bit late to this. I've been thinking about some of lines for a bit.
> It seems Phoenix serves 4 distinct purposes:
> 1. Query parsing and compiling.2. A type system3. Query execution4. Efficient HBase interface
> Each of these is useful by itself, but we do not expose these as stable interfaces.We have seen a lot of need to tie HBase into "higher level" service, such as Spark (and Presto, etc).
> I think we can get a long way if we separate at least #1 (SQL) from the rest #2, #3, and #4 (Typed HBase Interface - THI).
> Phoenix is used via SQL (#1), other tools such as Presto, Impala, Drill, Spark, etc, can interface efficiently with HBase via THI (#2, #3, and #4).
> Thoughts?
> -- Lars
>      On Monday, August 27, 2018, 11:03:33 AM PDT, Josh Elser <el...@apache.org> wrote:
>  
>  (bcc: dev@hbase, in case folks there have been waiting for me to send
> this email to dev@phoenix)
> 
> Hi,
> 
> In case you missed it, there was an HBaseCon event held in Asia
> recently. Stack took some great notes and shared them with the HBase
> community. A few of them touched on Phoenix, directly or in a related
> manner. I think they are good "criticisms" that are beneficial for us to
> hear.
> 
> 1. The phoenix-$version-client.jar size is prohibitively large
> 
> In this day and age, I'm surprised that this is a big issue for people.
> I know have a lot of cruft, most of which coming from hadoop. We have
> gotten better here over recent releases, but I would guess that there is
> more we can do.
> 
> 2. Can Phoenix be the de-facto schema for SQL on HBase?
> 
> We've long asserted "if you have to ask how Phoenix serializes data, you
> shouldn't be do it" (a nod that you have to write lots of code). What if
> we turn that on its head? Could we extract our PDataType serialization,
> composite row-key, column encoding, etc into a minimal API that folks
> with their own itches can use?
> 
> With the growing integrations into Phoenix, we could embrace them by
> providing an API to make what they're doing easier. In the same vein, we
> cement ourselves as a cornerstone of doing it "correctly".
> 
> 3. Better recommendations to users to not attempt certain queries.
> 
> We definitively know that there are certain types of queries that
> Phoenix cannot support well (compared to optimal Phoenix use-cases).
> Users very commonly fall into such pitfalls on their own and this leaves
> a bad taste in their mouth (thinking that the product "stinks").
> 
> Can we do a better job of telling the user when and why it happened?
> What would such a user-interaction model look like? Can we supplement
> the "why" with instructions of what to do differently (even if in the
> abstract)?
> 
> 4. Phoenix-Calcite
> 
> This was mentioned as a "nice to have". From what I understand, there
> was nothing explicitly from with the implementation or approach, just
> that it was a massive undertaking to continue with little immediate
> gain. Would this be a boon for us to try to continue in some form? Are
> there steps we can take that would help push us along the right path?
> 
> Anyways, I'd love to hear everyone's thoughts. While the concerns were
> raised at HBaseCon Asia, the suggestions that accompany them here are
> largely mine ;). Feel free to break them out into their own threads if
> you think that would be better (or say that you disagree with me --
> that's cool too)!
> 
> - Josh
>    
> 
  

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

Posted by Josh Elser <el...@apache.org>.
I like it, Lars. I like it very much.

Just the easy part of doing it... ;)

On 9/11/18 4:53 PM, larsh@apache.org wrote:
>   Sorry for coming a bit late to this. I've been thinking about some of lines for a bit.
> It seems Phoenix serves 4 distinct purposes:
> 1. Query parsing and compiling.2. A type system3. Query execution4. Efficient HBase interface
> Each of these is useful by itself, but we do not expose these as stable interfaces.We have seen a lot of need to tie HBase into "higher level" service, such as Spark (and Presto, etc).
> I think we can get a long way if we separate at least #1 (SQL) from the rest #2, #3, and #4 (Typed HBase Interface - THI).
> Phoenix is used via SQL (#1), other tools such as Presto, Impala, Drill, Spark, etc, can interface efficiently with HBase via THI (#2, #3, and #4).
> Thoughts?
> -- Lars
>      On Monday, August 27, 2018, 11:03:33 AM PDT, Josh Elser <el...@apache.org> wrote:
>   
>   (bcc: dev@hbase, in case folks there have been waiting for me to send
> this email to dev@phoenix)
> 
> Hi,
> 
> In case you missed it, there was an HBaseCon event held in Asia
> recently. Stack took some great notes and shared them with the HBase
> community. A few of them touched on Phoenix, directly or in a related
> manner. I think they are good "criticisms" that are beneficial for us to
> hear.
> 
> 1. The phoenix-$version-client.jar size is prohibitively large
> 
> In this day and age, I'm surprised that this is a big issue for people.
> I know have a lot of cruft, most of which coming from hadoop. We have
> gotten better here over recent releases, but I would guess that there is
> more we can do.
> 
> 2. Can Phoenix be the de-facto schema for SQL on HBase?
> 
> We've long asserted "if you have to ask how Phoenix serializes data, you
> shouldn't be do it" (a nod that you have to write lots of code). What if
> we turn that on its head? Could we extract our PDataType serialization,
> composite row-key, column encoding, etc into a minimal API that folks
> with their own itches can use?
> 
> With the growing integrations into Phoenix, we could embrace them by
> providing an API to make what they're doing easier. In the same vein, we
> cement ourselves as a cornerstone of doing it "correctly".
> 
> 3. Better recommendations to users to not attempt certain queries.
> 
> We definitively know that there are certain types of queries that
> Phoenix cannot support well (compared to optimal Phoenix use-cases).
> Users very commonly fall into such pitfalls on their own and this leaves
> a bad taste in their mouth (thinking that the product "stinks").
> 
> Can we do a better job of telling the user when and why it happened?
> What would such a user-interaction model look like? Can we supplement
> the "why" with instructions of what to do differently (even if in the
> abstract)?
> 
> 4. Phoenix-Calcite
> 
> This was mentioned as a "nice to have". From what I understand, there
> was nothing explicitly from with the implementation or approach, just
> that it was a massive undertaking to continue with little immediate
> gain. Would this be a boon for us to try to continue in some form? Are
> there steps we can take that would help push us along the right path?
> 
> Anyways, I'd love to hear everyone's thoughts. While the concerns were
> raised at HBaseCon Asia, the suggestions that accompany them here are
> largely mine ;). Feel free to break them out into their own threads if
> you think that would be better (or say that you disagree with me --
> that's cool too)!
> 
> - Josh
>    
> 

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

Posted by "larsh@apache.org" <la...@apache.org>.
 Sorry for coming a bit late to this. I've been thinking about some of lines for a bit.
It seems Phoenix serves 4 distinct purposes:
1. Query parsing and compiling.2. A type system3. Query execution4. Efficient HBase interface
Each of these is useful by itself, but we do not expose these as stable interfaces.We have seen a lot of need to tie HBase into "higher level" service, such as Spark (and Presto, etc).
I think we can get a long way if we separate at least #1 (SQL) from the rest #2, #3, and #4 (Typed HBase Interface - THI).
Phoenix is used via SQL (#1), other tools such as Presto, Impala, Drill, Spark, etc, can interface efficiently with HBase via THI (#2, #3, and #4).
Thoughts?
-- Lars
    On Monday, August 27, 2018, 11:03:33 AM PDT, Josh Elser <el...@apache.org> wrote:  
 
 (bcc: dev@hbase, in case folks there have been waiting for me to send 
this email to dev@phoenix)

Hi,

In case you missed it, there was an HBaseCon event held in Asia 
recently. Stack took some great notes and shared them with the HBase 
community. A few of them touched on Phoenix, directly or in a related 
manner. I think they are good "criticisms" that are beneficial for us to 
hear.

1. The phoenix-$version-client.jar size is prohibitively large

In this day and age, I'm surprised that this is a big issue for people. 
I know have a lot of cruft, most of which coming from hadoop. We have 
gotten better here over recent releases, but I would guess that there is 
more we can do.

2. Can Phoenix be the de-facto schema for SQL on HBase?

We've long asserted "if you have to ask how Phoenix serializes data, you 
shouldn't be do it" (a nod that you have to write lots of code). What if 
we turn that on its head? Could we extract our PDataType serialization, 
composite row-key, column encoding, etc into a minimal API that folks 
with their own itches can use?

With the growing integrations into Phoenix, we could embrace them by 
providing an API to make what they're doing easier. In the same vein, we 
cement ourselves as a cornerstone of doing it "correctly".

3. Better recommendations to users to not attempt certain queries.

We definitively know that there are certain types of queries that 
Phoenix cannot support well (compared to optimal Phoenix use-cases). 
Users very commonly fall into such pitfalls on their own and this leaves 
a bad taste in their mouth (thinking that the product "stinks").

Can we do a better job of telling the user when and why it happened? 
What would such a user-interaction model look like? Can we supplement 
the "why" with instructions of what to do differently (even if in the 
abstract)?

4. Phoenix-Calcite

This was mentioned as a "nice to have". From what I understand, there 
was nothing explicitly from with the implementation or approach, just 
that it was a massive undertaking to continue with little immediate 
gain. Would this be a boon for us to try to continue in some form? Are 
there steps we can take that would help push us along the right path?

Anyways, I'd love to hear everyone's thoughts. While the concerns were 
raised at HBaseCon Asia, the suggestions that accompany them here are 
largely mine ;). Feel free to break them out into their own threads if 
you think that would be better (or say that you disagree with me -- 
that's cool too)!

- Josh
  

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

Posted by Andrew Purtell <ap...@apache.org>.
On Tue, Aug 28, 2018 at 2:01 PM James Taylor <ja...@apache.org> wrote:

> Glad to hear this was discussed at HBaseCon. The most common request I've
> seen asked for is to be able to write Phoenix-compatible data from other,
> non-Phoenix services/projects, mainly because row-by-row updates (even when
> batched) can be a bottleneck. This is not feasible by using low level
> constructs because of all the features provided by Phoenix: secondary
> indexes, composite row keys, encoded columns, storage formats, salting,
> ascending/descending row keys, array support, etc. The most feasible way to
> accomplish writes outside of Phoenix is to use UPSERT VALUES followed by
> PhoenixRuntime#getUncommittedDataIterator to get the Cells that would be
> committed (followed by rolling back the uncommitted data). This maintains
> Phoenix's abstract and minimizes any overhead (the cost of parsing is
> negligible). You can control the frequency of how often the schema is
> pulled over from the server through the UPDATE_CACHE_FREQUENCY declaration.
>
> I haven't seen much demand for bypassing Phoenix JDBC on the read side. If
> you don't want to use Phoenix to query, what's the point in using it?
>

You might have Phoenix clients and HBase clients sharing common data
sources, for whatever reason, we cannot assume what constraints or legacy
issues may present themselves in a given Phoenix or HBase user's
environment. Agree though as a question of prioritization maybe it doesn't
get done until a volunteer does it to scratch a real itch, but at that
point it could be useful to accept the contribution.


> As far as Calicte/Phoenix, it'd be great to see this work picked up. I
> don't think this solves the API problem, though. I good home for this
> adapter would be Apache Drill IMHO. They're up to a new enough version of
> Calcite (and off of their fork) so that this would be feasible and would
> provide immediate benefits on the query side.
>
> Thanks,
> James
>
> On Tue, Aug 28, 2018 at 1:38 PM Andrew Purtell <ap...@apache.org>
> wrote:
>
> > On Mon, Aug 27, 2018 at 11:03 AM Josh Elser <el...@apache.org> wrote:
> >
> > > 2. Can Phoenix be the de-facto schema for SQL on HBase?
> > >
> > > We've long asserted "if you have to ask how Phoenix serializes data,
> you
> > > shouldn't be do it" (a nod that you have to write lots of code). What
> if
> > > we turn that on its head? Could we extract our PDataType serialization,
> > > composite row-key, column encoding, etc into a minimal API that folks
> > > with their own itches can use?
> > >
> > > With the growing integrations into Phoenix, we could embrace them by
> > > providing an API to make what they're doing easier. In the same vein,
> we
> > > cement ourselves as a cornerstone of doing it "correctly"
> > >
> >
> > There have been discussion where I work where it seems this would be a
> > great idea. If data types, row key constructors, and other key and data
> > serialization concerns were a public API, these could be used by
> connectors
> > to Spark or other systems to generate and consume Phoenix compatible
> data.
> > It improves the integration story all around.
> >
> > Another thought for refactoring I've heard is exposing an API for
> > generating query plans without needing the SQL parser. A public API  for
> > programmatically building query plans could used by connectors to Spark
> or
> > other systems when pushing down parts of a parallelized or federated
> query
> > to Phoenix data sources, avoiding unnecessary hacking SQL language
> > generation, string mangling, or (re)parsing overheads. This kind of
> > describes Calcite's raison d'être. If Phoenix is not embedding Calcite as
> > query planner, as it does not currently, it is independently useful to
> have
> > a public API for programmatic query plan construction given the current
> > implementation regardless. If Phoenix were to embed Calcite as query
> > planner, you'd probably get a ton of re-use among internal and external
> > users of the Calcite APIs. I'd think whatever option you might choose
> would
> > be informed by the suitability (or not) of embedding Calcite as Phoenix's
> > query planner, and how soon that might be expected to be feature
> complete.
> > For what it's worth. Again this extends possibilities for integration.
> >
> >
> > > 3. Better recommendations to users to not attempt certain queries.
> > >
> > > We definitively know that there are certain types of queries that
> > > Phoenix cannot support well (compared to optimal Phoenix use-cases).
> > > Users very commonly fall into such pitfalls on their own and this
> leaves
> > > a bad taste in their mouth (thinking that the product "stinks").
> > >
> > > Can we do a better job of telling the user when and why it happened?
> > > What would such a user-interaction model look like? Can we supplement
> > > the "why" with instructions of what to do differently (even if in the
> > > abstract)?
> > >
> > > 4. Phoenix-Calcite
> > >
> > > This was mentioned as a "nice to have". From what I understand, there
> > > was nothing explicitly from with the implementation or approach, just
> > > that it was a massive undertaking to continue with little immediate
> > > gain. Would this be a boon for us to try to continue in some form? Are
> > > there steps we can take that would help push us along the right path?
> > >
> > > Anyways, I'd love to hear everyone's thoughts. While the concerns were
> > > raised at HBaseCon Asia, the suggestions that accompany them here are
> > > largely mine ;). Feel free to break them out into their own threads if
> > > you think that would be better (or say that you disagree with me --
> > > that's cool too)!
> > >
> > > - Josh
> > >
> >
> >
> > --
> > Best regards,
> > Andrew
> >
> > Words like orphans lost among the crosstalk, meaning torn from truth's
> > decrepit hands
> >    - A23, Crosstalk
> >
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

Posted by James Taylor <ja...@apache.org>.
Glad to hear this was discussed at HBaseCon. The most common request I've
seen asked for is to be able to write Phoenix-compatible data from other,
non-Phoenix services/projects, mainly because row-by-row updates (even when
batched) can be a bottleneck. This is not feasible by using low level
constructs because of all the features provided by Phoenix: secondary
indexes, composite row keys, encoded columns, storage formats, salting,
ascending/descending row keys, array support, etc. The most feasible way to
accomplish writes outside of Phoenix is to use UPSERT VALUES followed by
PhoenixRuntime#getUncommittedDataIterator to get the Cells that would be
committed (followed by rolling back the uncommitted data). This maintains
Phoenix's abstract and minimizes any overhead (the cost of parsing is
negligible). You can control the frequency of how often the schema is
pulled over from the server through the UPDATE_CACHE_FREQUENCY declaration.

I haven't seen much demand for bypassing Phoenix JDBC on the read side. If
you don't want to use Phoenix to query, what's the point in using it?

As far as Calicte/Phoenix, it'd be great to see this work picked up. I
don't think this solves the API problem, though. I good home for this
adapter would be Apache Drill IMHO. They're up to a new enough version of
Calcite (and off of their fork) so that this would be feasible and would
provide immediate benefits on the query side.

Thanks,
James

On Tue, Aug 28, 2018 at 1:38 PM Andrew Purtell <ap...@apache.org> wrote:

> On Mon, Aug 27, 2018 at 11:03 AM Josh Elser <el...@apache.org> wrote:
>
> > 2. Can Phoenix be the de-facto schema for SQL on HBase?
> >
> > We've long asserted "if you have to ask how Phoenix serializes data, you
> > shouldn't be do it" (a nod that you have to write lots of code). What if
> > we turn that on its head? Could we extract our PDataType serialization,
> > composite row-key, column encoding, etc into a minimal API that folks
> > with their own itches can use?
> >
> > With the growing integrations into Phoenix, we could embrace them by
> > providing an API to make what they're doing easier. In the same vein, we
> > cement ourselves as a cornerstone of doing it "correctly"
> >
>
> There have been discussion where I work where it seems this would be a
> great idea. If data types, row key constructors, and other key and data
> serialization concerns were a public API, these could be used by connectors
> to Spark or other systems to generate and consume Phoenix compatible data.
> It improves the integration story all around.
>
> Another thought for refactoring I've heard is exposing an API for
> generating query plans without needing the SQL parser. A public API  for
> programmatically building query plans could used by connectors to Spark or
> other systems when pushing down parts of a parallelized or federated query
> to Phoenix data sources, avoiding unnecessary hacking SQL language
> generation, string mangling, or (re)parsing overheads. This kind of
> describes Calcite's raison d'être. If Phoenix is not embedding Calcite as
> query planner, as it does not currently, it is independently useful to have
> a public API for programmatic query plan construction given the current
> implementation regardless. If Phoenix were to embed Calcite as query
> planner, you'd probably get a ton of re-use among internal and external
> users of the Calcite APIs. I'd think whatever option you might choose would
> be informed by the suitability (or not) of embedding Calcite as Phoenix's
> query planner, and how soon that might be expected to be feature complete.
> For what it's worth. Again this extends possibilities for integration.
>
>
> > 3. Better recommendations to users to not attempt certain queries.
> >
> > We definitively know that there are certain types of queries that
> > Phoenix cannot support well (compared to optimal Phoenix use-cases).
> > Users very commonly fall into such pitfalls on their own and this leaves
> > a bad taste in their mouth (thinking that the product "stinks").
> >
> > Can we do a better job of telling the user when and why it happened?
> > What would such a user-interaction model look like? Can we supplement
> > the "why" with instructions of what to do differently (even if in the
> > abstract)?
> >
> > 4. Phoenix-Calcite
> >
> > This was mentioned as a "nice to have". From what I understand, there
> > was nothing explicitly from with the implementation or approach, just
> > that it was a massive undertaking to continue with little immediate
> > gain. Would this be a boon for us to try to continue in some form? Are
> > there steps we can take that would help push us along the right path?
> >
> > Anyways, I'd love to hear everyone's thoughts. While the concerns were
> > raised at HBaseCon Asia, the suggestions that accompany them here are
> > largely mine ;). Feel free to break them out into their own threads if
> > you think that would be better (or say that you disagree with me --
> > that's cool too)!
> >
> > - Josh
> >
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>    - A23, Crosstalk
>

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

Posted by Andrew Purtell <ap...@apache.org>.
On Mon, Aug 27, 2018 at 11:03 AM Josh Elser <el...@apache.org> wrote:

> 2. Can Phoenix be the de-facto schema for SQL on HBase?
>
> We've long asserted "if you have to ask how Phoenix serializes data, you
> shouldn't be do it" (a nod that you have to write lots of code). What if
> we turn that on its head? Could we extract our PDataType serialization,
> composite row-key, column encoding, etc into a minimal API that folks
> with their own itches can use?
>
> With the growing integrations into Phoenix, we could embrace them by
> providing an API to make what they're doing easier. In the same vein, we
> cement ourselves as a cornerstone of doing it "correctly"
>

There have been discussion where I work where it seems this would be a
great idea. If data types, row key constructors, and other key and data
serialization concerns were a public API, these could be used by connectors
to Spark or other systems to generate and consume Phoenix compatible data.
It improves the integration story all around.

Another thought for refactoring I've heard is exposing an API for
generating query plans without needing the SQL parser. A public API  for
programmatically building query plans could used by connectors to Spark or
other systems when pushing down parts of a parallelized or federated query
to Phoenix data sources, avoiding unnecessary hacking SQL language
generation, string mangling, or (re)parsing overheads. This kind of
describes Calcite's raison d'être. If Phoenix is not embedding Calcite as
query planner, as it does not currently, it is independently useful to have
a public API for programmatic query plan construction given the current
implementation regardless. If Phoenix were to embed Calcite as query
planner, you'd probably get a ton of re-use among internal and external
users of the Calcite APIs. I'd think whatever option you might choose would
be informed by the suitability (or not) of embedding Calcite as Phoenix's
query planner, and how soon that might be expected to be feature complete.
For what it's worth. Again this extends possibilities for integration.


> 3. Better recommendations to users to not attempt certain queries.
>
> We definitively know that there are certain types of queries that
> Phoenix cannot support well (compared to optimal Phoenix use-cases).
> Users very commonly fall into such pitfalls on their own and this leaves
> a bad taste in their mouth (thinking that the product "stinks").
>
> Can we do a better job of telling the user when and why it happened?
> What would such a user-interaction model look like? Can we supplement
> the "why" with instructions of what to do differently (even if in the
> abstract)?
>
> 4. Phoenix-Calcite
>
> This was mentioned as a "nice to have". From what I understand, there
> was nothing explicitly from with the implementation or approach, just
> that it was a massive undertaking to continue with little immediate
> gain. Would this be a boon for us to try to continue in some form? Are
> there steps we can take that would help push us along the right path?
>
> Anyways, I'd love to hear everyone's thoughts. While the concerns were
> raised at HBaseCon Asia, the suggestions that accompany them here are
> largely mine ;). Feel free to break them out into their own threads if
> you think that would be better (or say that you disagree with me --
> that's cool too)!
>
> - Josh
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: [DISCUSS] EXPLAIN'ing what we do well (was Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes)

Posted by Thomas D'Silva <td...@salesforce.com>.
I created  PHOENIX-4881 to create a  guardrail config property based on the
bytes scanned.
We already have PHOENIX-1481 to improve the explain plan documentation.

On Tue, Aug 28, 2018 at 1:40 PM, James Taylor <ja...@apache.org>
wrote:

> Thomas' idea is a good one. From the EXPLAIN plan ResultSet, you can
> directly get an estimate of the number of bytes that will be scanned. Take
> a look at this [1] documentation. We need to implement PHOENIX-4735 too (so
> that things are setup well out-of-the-box). We could have a kind of
> guardrail config property that would define the max allowed bytes allowed
> to be read and fail a query that goes over this limit. That would cover 80%
> of the issues IMHO. Other guardrail config properties could cover other
> corner cases.
>
> [1] http://phoenix.apache.org/explainplan.html
>
> On Mon, Aug 27, 2018 at 3:01 PM Josh Elser <el...@apache.org> wrote:
>
> > On 8/27/18 5:03 PM, Thomas D'Silva wrote:
> > >> 3. Better recommendations to users to not attempt certain queries.
> > >>
> > >> We definitively know that there are certain types of queries that
> > Phoenix
> > >> cannot support well (compared to optimal Phoenix use-cases). Users
> very
> > >> commonly fall into such pitfalls on their own and this leaves a bad
> > taste
> > >> in their mouth (thinking that the product "stinks").
> > >>
> > >> Can we do a better job of telling the user when and why it happened?
> > What
> > >> would such a user-interaction model look like? Can we supplement the
> > "why"
> > >> with instructions of what to do differently (even if in the abstract)?
> > >>
> > > Providing relevant feedback before/after a query is run in general is
> > very
> > > hard to do. If stats are enabled we have an estimate of how many
> > rows/bytes
> > > will be scanned.
> > > We could have an optional feature that prevent users from running
> queries
> > > if the rows/bytes scanned are above a certain threshold. We should also
> > > enhance our explain
> > > plan documentationhttp://phoenix.apache.org/explainplan.html  with
> > example
> > > of queries so users know what kinds of queries Phoenix handles well.
> >
> > Breaking this out..
> >
> > Totally agree -- this is by no means "easy". I struggle very often
> > trying to express just _why_ a query that someone is running in Phoenix
> > doesn't run as well as they think it should.
> >
> > Centralizing on the EXPLAIN plan is good. Making sure it's
> > consumable/thorough is probably the lowest hanging fruit. If we can give
> > concrete examples to the kinds of explain plans a user might see, I
> > think that might get use from users/admins.
> >
> > Throwing a random idea out there: with stats and the query plan, can we
> > give a thumbs-up/thumbs-down? If we can, is that useful?
> >
>

Re: [DISCUSS] EXPLAIN'ing what we do well (was Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes)

Posted by James Taylor <ja...@apache.org>.
Thomas' idea is a good one. From the EXPLAIN plan ResultSet, you can
directly get an estimate of the number of bytes that will be scanned. Take
a look at this [1] documentation. We need to implement PHOENIX-4735 too (so
that things are setup well out-of-the-box). We could have a kind of
guardrail config property that would define the max allowed bytes allowed
to be read and fail a query that goes over this limit. That would cover 80%
of the issues IMHO. Other guardrail config properties could cover other
corner cases.

[1] http://phoenix.apache.org/explainplan.html

On Mon, Aug 27, 2018 at 3:01 PM Josh Elser <el...@apache.org> wrote:

> On 8/27/18 5:03 PM, Thomas D'Silva wrote:
> >> 3. Better recommendations to users to not attempt certain queries.
> >>
> >> We definitively know that there are certain types of queries that
> Phoenix
> >> cannot support well (compared to optimal Phoenix use-cases). Users very
> >> commonly fall into such pitfalls on their own and this leaves a bad
> taste
> >> in their mouth (thinking that the product "stinks").
> >>
> >> Can we do a better job of telling the user when and why it happened?
> What
> >> would such a user-interaction model look like? Can we supplement the
> "why"
> >> with instructions of what to do differently (even if in the abstract)?
> >>
> > Providing relevant feedback before/after a query is run in general is
> very
> > hard to do. If stats are enabled we have an estimate of how many
> rows/bytes
> > will be scanned.
> > We could have an optional feature that prevent users from running queries
> > if the rows/bytes scanned are above a certain threshold. We should also
> > enhance our explain
> > plan documentationhttp://phoenix.apache.org/explainplan.html  with
> example
> > of queries so users know what kinds of queries Phoenix handles well.
>
> Breaking this out..
>
> Totally agree -- this is by no means "easy". I struggle very often
> trying to express just _why_ a query that someone is running in Phoenix
> doesn't run as well as they think it should.
>
> Centralizing on the EXPLAIN plan is good. Making sure it's
> consumable/thorough is probably the lowest hanging fruit. If we can give
> concrete examples to the kinds of explain plans a user might see, I
> think that might get use from users/admins.
>
> Throwing a random idea out there: with stats and the query plan, can we
> give a thumbs-up/thumbs-down? If we can, is that useful?
>

[DISCUSS] EXPLAIN'ing what we do well (was Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes)

Posted by Josh Elser <el...@apache.org>.
On 8/27/18 5:03 PM, Thomas D'Silva wrote:
>> 3. Better recommendations to users to not attempt certain queries.
>>
>> We definitively know that there are certain types of queries that Phoenix
>> cannot support well (compared to optimal Phoenix use-cases). Users very
>> commonly fall into such pitfalls on their own and this leaves a bad taste
>> in their mouth (thinking that the product "stinks").
>>
>> Can we do a better job of telling the user when and why it happened? What
>> would such a user-interaction model look like? Can we supplement the "why"
>> with instructions of what to do differently (even if in the abstract)?
>>
> Providing relevant feedback before/after a query is run in general is very
> hard to do. If stats are enabled we have an estimate of how many rows/bytes
> will be scanned.
> We could have an optional feature that prevent users from running queries
> if the rows/bytes scanned are above a certain threshold. We should also
> enhance our explain
> plan documentationhttp://phoenix.apache.org/explainplan.html  with example
> of queries so users know what kinds of queries Phoenix handles well.

Breaking this out..

Totally agree -- this is by no means "easy". I struggle very often 
trying to express just _why_ a query that someone is running in Phoenix 
doesn't run as well as they think it should.

Centralizing on the EXPLAIN plan is good. Making sure it's 
consumable/thorough is probably the lowest hanging fruit. If we can give 
concrete examples to the kinds of explain plans a user might see, I 
think that might get use from users/admins.

Throwing a random idea out there: with stats and the query plan, can we 
give a thumbs-up/thumbs-down? If we can, is that useful?

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

Posted by Nick Dimiduk <nd...@gmail.com>.
On Mon, Aug 27, 2018 at 2:03 PM, Thomas D'Silva <td...@salesforce.com>
wrote:

> >
> >
> > 2. Can Phoenix be the de-facto schema for SQL on HBase?
> >
> > We've long asserted "if you have to ask how Phoenix serializes data, you
> > shouldn't be do it" (a nod that you have to write lots of code). What if
> we
> > turn that on its head? Could we extract our PDataType serialization,
> > composite row-key, column encoding, etc into a minimal API that folks
> with
> > their own itches can use?
> >
> > With the growing integrations into Phoenix, we could embrace them by
> > providing an API to make what they're doing easier. In the same vein, we
> > cement ourselves as a cornerstone of doing it "correctly".
> >
>
> +1 on standardizing the data type and storage format API so that it would
> be easier for other projects to use.
>

Adding my $0.02, since I've thought a good bit about this over the years.

The `DataType` [0] interface in HBase is built this precisely this idea in
mind -- sharing data encoding formats across HBase projects. Phoenix's
`PDataType` implements this interface. Exposing the encoders to 3rd
parties, then, is a matter of those 3rd parties using this interface and
consuming the phoenix-core jar. Maybe we want to break them out into their
own jar to minimize dependencies? That said, Phoenix's smarts about
compound rowkeys and packed column values are beyond simple column
encodings. These may not be as easily exposed to external tools...

I think, realistically, Phoenix would need to expose a number of
schema-related tools together in a package in order to provide "true
interoperability" with other tools. Pick a use case -- I'm fond of
"offline" use-cases, something like building a Phoenix-compatible table
from a MapReduce (or Spark, or Hive, or...) application on a cluster that
doesn't even have HBase available. Then plumb it out the other way, reading
an exported snapshot of a Phoenix table from the same "offline"
environment. It's a pretty extreme case that I think is worth while because
enables a lot of flexibility for users, and would shake out a bunch of
these related issues. I suspect this requires going below the JDBC
interface, but I could be wrong...

-n

[0]:
https://hbase.apache.org/1.2/apidocs/org/apache/hadoop/hbase/types/DataType.html

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

Posted by Thomas D'Silva <td...@salesforce.com>.
>
>
> 2. Can Phoenix be the de-facto schema for SQL on HBase?
>
> We've long asserted "if you have to ask how Phoenix serializes data, you
> shouldn't be do it" (a nod that you have to write lots of code). What if we
> turn that on its head? Could we extract our PDataType serialization,
> composite row-key, column encoding, etc into a minimal API that folks with
> their own itches can use?
>
> With the growing integrations into Phoenix, we could embrace them by
> providing an API to make what they're doing easier. In the same vein, we
> cement ourselves as a cornerstone of doing it "correctly".
>

+1 on standardizing the data type and storage format API so that it would
be easier for other projects to use.


> 3. Better recommendations to users to not attempt certain queries.
>
> We definitively know that there are certain types of queries that Phoenix
> cannot support well (compared to optimal Phoenix use-cases). Users very
> commonly fall into such pitfalls on their own and this leaves a bad taste
> in their mouth (thinking that the product "stinks").
>
> Can we do a better job of telling the user when and why it happened? What
> would such a user-interaction model look like? Can we supplement the "why"
> with instructions of what to do differently (even if in the abstract)?
>

Providing relevant feedback before/after a query is run in general is very
hard to do. If stats are enabled we have an estimate of how many rows/bytes
will be scanned.
We could have an optional feature that prevent users from running queries
if the rows/bytes scanned are above a certain threshold. We should also
enhance our explain
plan documentation http://phoenix.apache.org/explainplan.html with example
of queries so users know what kinds of queries Phoenix handles well.


> 4. Phoenix-Calcite
>
> This was mentioned as a "nice to have". From what I understand, there was
> nothing explicitly from with the implementation or approach, just that it
> was a massive undertaking to continue with little immediate gain. Would
> this be a boon for us to try to continue in some form? Are there steps we
> can take that would help push us along the right path?
>

Maybe Maryanne, Rajeshbabu or Ankit can comment on the feasibility of
proceeding with the calcite integration.
It would be good to standardize our query plan APIs so that we can generate
a query plan from a spark catalyst plan for example.