You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@phoenix.apache.org by "larsh@apache.org" <la...@apache.org> on 2021/03/26 23:20:18 UTC

On duplicate column names

As you may or may not know, Phoenix allows for duplicate column names as long as they are placed in different column families.

You can create a table such as CREATE TABLE t (pk1 ..., x.v1, y.v1, ...).
Now each time you want to refer to v1 you need to qualify it with its column family or you get a AmbiguousColumnException.

Worse, you also do CREATE TABLE t (pk1 ..., v1, x.v1, ...). Now a using v1 in a query will silently resolve to v1 in the default column family and x.v1 does have to be qualified.

As I reason through how this should work in Trino's (formerly Presto) Phoenix connector, it occurs to me that should probably just disallow this.

CREATE TABLE t (pk1 ..., x.v1, y.v2, ...) works just fine and you can refer to both v1 and v2 without the need to qualify them with the column family.

So... In essence: Should we require all columns to be uniquely named, even with multiple column families?

Thanks.

-- Lars

Re: On duplicate column names

Posted by "larsh@apache.org" <la...@apache.org>.

Please comment on the Jira if you have an opinion :)






On Monday, March 29, 2021, 10:48:34 AM PDT, larsh@apache.org <la...@apache.org> wrote: 





Filed https://issues.apache.org/jira/browse/PHOENIX-6433

We can discuss there.
In the end, IMHO, placing columns in distinct column families is a physical optimization, not a logical construct...
Very much like indexes. And just like indexes these physical optimizations should not leak into the actual queries.

-- Lars


On Monday, March 29, 2021, 3:04:08 AM PDT, Istvan Toth <st...@apache.org> wrote: 





I agree that having identical column names is generally a pain, and I have
no problem with disallowing them in statically defined Phoenix columns, as
long as the views on HBase tables and the dynamic column features are not
affected.

We may or may not want to add a property to override it (with the
default being disallowing the duplicate column names)

Istvan



On Sat, Mar 27, 2021 at 5:59 PM larsh@apache.org <la...@apache.org> wrote:

> Viraj, yeah similar discussion.
>
> Istvan, good point. Of course you cannot guard against what folks do at
> the HBase level, and we should still allow that. Just like in PHOENIX-6343.
> But we could disallow creating new tables like this in Phoenix, also like
> PHOENIX-6343.
>
>
> I just think that causes more problems than it solves. While HBase looks
> at things key-value by key-value, Phoenix takes a row-by-row view.
> From that angle, why would you ever want a table with ambiguous column
> names? And neither qualified nor duplicate column names are in the SQL
> standard - AFAIK at least.
>
> In the Trino case I described, it's using the standard JDBC metadata, then
> using the COLUMN_NAM column to read the name of the column. That is the
> JDBC standard.
> In the Phoenix case you now also have to read the COLUMN_FAMILY column,
> then know how to build a qualified column name, and that be able to pass
> this through all the query planning, etc. And there is no possible way to
> hand a qualified column name to Phoenix. Phoenix does *not* allow "cf.cq",
> it has to be cf.cq or "cf"."cq".
> Apparently there's a way in Trino to model this as a structured column.
> But that too does not hit the mark, then you *have* to use structs to
> access any column in Phoenix.
>
> So in Trino (and probably other JDBC clients) we could opt to simply fail
> on tables like these - that's what's happening now... Or come up with
> unnatural constructs to fit this into the SQL model. Trino is just an
> example here.
>
> At least we can document that qualifier names *should* be unique, and it
> otherwise might cause problems with downstream BigData integrations.
>
> -- Lars
>
> On Saturday, March 27, 2021, 2:51:04 AM PDT, Viraj Jasani <
> vjasani@apache.org> wrote:
>
>
>
>
>
> Somewhat similar discussion we had on PHOENIX-6343 and why the duplicate
> column check was restricted to default CF only.
>
>
> On Sat, 27 Mar 2021 at 10:42 AM, Istvan Toth <st...@cloudera.com.invalid>
> wrote:
>
> > The first thing that comes to my mind is that this would limit
> > functionality when defining views on existing raw HBase tables (though
> > aliasing the columns may solve that)
> > Another thing to consider is how this would affect dynamic column use
> cases
> > for either native Phoenix tables of views on HBase tables.
> >
> > Istvan
> >
> > On Sat, Mar 27, 2021 at 12:20 AM larsh@apache.org <la...@apache.org>
> > wrote:
> >
> > > As you may or may not know, Phoenix allows for duplicate column names
> as
> > > long as they are placed in different column families.
> > >
> > > You can create a table such as CREATE TABLE t (pk1 ..., x.v1, y.v1,
> ...).
> > > Now each time you want to refer to v1 you need to qualify it with its
> > > column family or you get a AmbiguousColumnException.
> > >
> > > Worse, you also do CREATE TABLE t (pk1 ..., v1, x.v1, ...). Now a using
> > v1
> > > in a query will silently resolve to v1 in the default column family and
> > > x.v1 does have to be qualified.
> > >
> > > As I reason through how this should work in Trino's (formerly Presto)
> > > Phoenix connector, it occurs to me that should probably just disallow
> > this.
> > >
> > > CREATE TABLE t (pk1 ..., x.v1, y.v2, ...) works just fine and you can
> > > refer to both v1 and v2 without the need to qualify them with the
> column
> > > family.
> > >
> > > So... In essence: Should we require all columns to be uniquely named,
> > even
> > > with multiple column families?
> > >
> > > Thanks.
> > >
> > > -- Lars
> > >
> >
> >
> > --
> > *István Tóth* | Staff Software Engineer
> > stoty@cloudera.com <https://www.cloudera.com>
> > [image: Cloudera] <https://www.cloudera.com/>
> > [image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
> > Cloudera on Facebook] <https://www.facebook.com/cloudera> [image:
> Cloudera
> > on LinkedIn] <https://www.linkedin.com/company/cloudera>
> > <https://www.cloudera.com/>
> > ------------------------------
> >
>

Re: On duplicate column names

Posted by "larsh@apache.org" <la...@apache.org>.

Filed https://issues.apache.org/jira/browse/PHOENIX-6433

We can discuss there.
In the end, IMHO, placing columns in distinct column families is a physical optimization, not a logical construct...
Very much like indexes. And just like indexes these physical optimizations should not leak into the actual queries.

-- Lars


On Monday, March 29, 2021, 3:04:08 AM PDT, Istvan Toth <st...@apache.org> wrote: 





I agree that having identical column names is generally a pain, and I have
no problem with disallowing them in statically defined Phoenix columns, as
long as the views on HBase tables and the dynamic column features are not
affected.

We may or may not want to add a property to override it (with the
default being disallowing the duplicate column names)

Istvan



On Sat, Mar 27, 2021 at 5:59 PM larsh@apache.org <la...@apache.org> wrote:

> Viraj, yeah similar discussion.
>
> Istvan, good point. Of course you cannot guard against what folks do at
> the HBase level, and we should still allow that. Just like in PHOENIX-6343.
> But we could disallow creating new tables like this in Phoenix, also like
> PHOENIX-6343.
>
>
> I just think that causes more problems than it solves. While HBase looks
> at things key-value by key-value, Phoenix takes a row-by-row view.
> From that angle, why would you ever want a table with ambiguous column
> names? And neither qualified nor duplicate column names are in the SQL
> standard - AFAIK at least.
>
> In the Trino case I described, it's using the standard JDBC metadata, then
> using the COLUMN_NAM column to read the name of the column. That is the
> JDBC standard.
> In the Phoenix case you now also have to read the COLUMN_FAMILY column,
> then know how to build a qualified column name, and that be able to pass
> this through all the query planning, etc. And there is no possible way to
> hand a qualified column name to Phoenix. Phoenix does *not* allow "cf.cq",
> it has to be cf.cq or "cf"."cq".
> Apparently there's a way in Trino to model this as a structured column.
> But that too does not hit the mark, then you *have* to use structs to
> access any column in Phoenix.
>
> So in Trino (and probably other JDBC clients) we could opt to simply fail
> on tables like these - that's what's happening now... Or come up with
> unnatural constructs to fit this into the SQL model. Trino is just an
> example here.
>
> At least we can document that qualifier names *should* be unique, and it
> otherwise might cause problems with downstream BigData integrations.
>
> -- Lars
>
> On Saturday, March 27, 2021, 2:51:04 AM PDT, Viraj Jasani <
> vjasani@apache.org> wrote:
>
>
>
>
>
> Somewhat similar discussion we had on PHOENIX-6343 and why the duplicate
> column check was restricted to default CF only.
>
>
> On Sat, 27 Mar 2021 at 10:42 AM, Istvan Toth <st...@cloudera.com.invalid>
> wrote:
>
> > The first thing that comes to my mind is that this would limit
> > functionality when defining views on existing raw HBase tables (though
> > aliasing the columns may solve that)
> > Another thing to consider is how this would affect dynamic column use
> cases
> > for either native Phoenix tables of views on HBase tables.
> >
> > Istvan
> >
> > On Sat, Mar 27, 2021 at 12:20 AM larsh@apache.org <la...@apache.org>
> > wrote:
> >
> > > As you may or may not know, Phoenix allows for duplicate column names
> as
> > > long as they are placed in different column families.
> > >
> > > You can create a table such as CREATE TABLE t (pk1 ..., x.v1, y.v1,
> ...).
> > > Now each time you want to refer to v1 you need to qualify it with its
> > > column family or you get a AmbiguousColumnException.
> > >
> > > Worse, you also do CREATE TABLE t (pk1 ..., v1, x.v1, ...). Now a using
> > v1
> > > in a query will silently resolve to v1 in the default column family and
> > > x.v1 does have to be qualified.
> > >
> > > As I reason through how this should work in Trino's (formerly Presto)
> > > Phoenix connector, it occurs to me that should probably just disallow
> > this.
> > >
> > > CREATE TABLE t (pk1 ..., x.v1, y.v2, ...) works just fine and you can
> > > refer to both v1 and v2 without the need to qualify them with the
> column
> > > family.
> > >
> > > So... In essence: Should we require all columns to be uniquely named,
> > even
> > > with multiple column families?
> > >
> > > Thanks.
> > >
> > > -- Lars
> > >
> >
> >
> > --
> > *István Tóth* | Staff Software Engineer
> > stoty@cloudera.com <https://www.cloudera.com>
> > [image: Cloudera] <https://www.cloudera.com/>
> > [image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
> > Cloudera on Facebook] <https://www.facebook.com/cloudera> [image:
> Cloudera
> > on LinkedIn] <https://www.linkedin.com/company/cloudera>
> > <https://www.cloudera.com/>
> > ------------------------------
> >
>

Re: On duplicate column names

Posted by Istvan Toth <st...@apache.org>.

I agree that having identical column names is generally a pain, and I have
no problem with disallowing them in statically defined Phoenix columns, as
long as the views on HBase tables and the dynamic column features are not
affected.

We may or may not want to add a property to override it (with the
default being disallowing the duplicate column names)

Istvan



On Sat, Mar 27, 2021 at 5:59 PM larsh@apache.org <la...@apache.org> wrote:

> Viraj, yeah similar discussion.
>
> Istvan, good point. Of course you cannot guard against what folks do at
> the HBase level, and we should still allow that. Just like in PHOENIX-6343.
> But we could disallow creating new tables like this in Phoenix, also like
> PHOENIX-6343.
>
>
> I just think that causes more problems than it solves. While HBase looks
> at things key-value by key-value, Phoenix takes a row-by-row view.
> From that angle, why would you ever want a table with ambiguous column
> names? And neither qualified nor duplicate column names are in the SQL
> standard - AFAIK at least.
>
> In the Trino case I described, it's using the standard JDBC metadata, then
> using the COLUMN_NAM column to read the name of the column. That is the
> JDBC standard.
> In the Phoenix case you now also have to read the COLUMN_FAMILY column,
> then know how to build a qualified column name, and that be able to pass
> this through all the query planning, etc. And there is no possible way to
> hand a qualified column name to Phoenix. Phoenix does *not* allow "cf.cq",
> it has to be cf.cq or "cf"."cq".
> Apparently there's a way in Trino to model this as a structured column.
> But that too does not hit the mark, then you *have* to use structs to
> access any column in Phoenix.
>
> So in Trino (and probably other JDBC clients) we could opt to simply fail
> on tables like these - that's what's happening now... Or come up with
> unnatural constructs to fit this into the SQL model. Trino is just an
> example here.
>
> At least we can document that qualifier names *should* be unique, and it
> otherwise might cause problems with downstream BigData integrations.
>
> -- Lars
>
> On Saturday, March 27, 2021, 2:51:04 AM PDT, Viraj Jasani <
> vjasani@apache.org> wrote:
>
>
>
>
>
> Somewhat similar discussion we had on PHOENIX-6343 and why the duplicate
> column check was restricted to default CF only.
>
>
> On Sat, 27 Mar 2021 at 10:42 AM, Istvan Toth <st...@cloudera.com.invalid>
> wrote:
>
> > The first thing that comes to my mind is that this would limit
> > functionality when defining views on existing raw HBase tables (though
> > aliasing the columns may solve that)
> > Another thing to consider is how this would affect dynamic column use
> cases
> > for either native Phoenix tables of views on HBase tables.
> >
> > Istvan
> >
> > On Sat, Mar 27, 2021 at 12:20 AM larsh@apache.org <la...@apache.org>
> > wrote:
> >
> > > As you may or may not know, Phoenix allows for duplicate column names
> as
> > > long as they are placed in different column families.
> > >
> > > You can create a table such as CREATE TABLE t (pk1 ..., x.v1, y.v1,
> ...).
> > > Now each time you want to refer to v1 you need to qualify it with its
> > > column family or you get a AmbiguousColumnException.
> > >
> > > Worse, you also do CREATE TABLE t (pk1 ..., v1, x.v1, ...). Now a using
> > v1
> > > in a query will silently resolve to v1 in the default column family and
> > > x.v1 does have to be qualified.
> > >
> > > As I reason through how this should work in Trino's (formerly Presto)
> > > Phoenix connector, it occurs to me that should probably just disallow
> > this.
> > >
> > > CREATE TABLE t (pk1 ..., x.v1, y.v2, ...) works just fine and you can
> > > refer to both v1 and v2 without the need to qualify them with the
> column
> > > family.
> > >
> > > So... In essence: Should we require all columns to be uniquely named,
> > even
> > > with multiple column families?
> > >
> > > Thanks.
> > >
> > > -- Lars
> > >
> >
> >
> > --
> > *István Tóth* | Staff Software Engineer
> > stoty@cloudera.com <https://www.cloudera.com>
> > [image: Cloudera] <https://www.cloudera.com/>
> > [image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
> > Cloudera on Facebook] <https://www.facebook.com/cloudera> [image:
> Cloudera
> > on LinkedIn] <https://www.linkedin.com/company/cloudera>
> > <https://www.cloudera.com/>
> > ------------------------------
> >
>

Re: On duplicate column names

Posted by "larsh@apache.org" <la...@apache.org>.

Viraj, yeah similar discussion.

Istvan, good point. Of course you cannot guard against what folks do at the HBase level, and we should still allow that. Just like in PHOENIX-6343.
But we could disallow creating new tables like this in Phoenix, also like PHOENIX-6343.

I just think that causes more problems than it solves. While HBase looks at things key-value by key-value, Phoenix takes a row-by-row view.
From that angle, why would you ever want a table with ambiguous column names? And neither qualified nor duplicate column names are in the SQL standard - AFAIK at least.

In the Trino case I described, it's using the standard JDBC metadata, then using the COLUMN_NAM column to read the name of the column. That is the JDBC standard.
In the Phoenix case you now also have to read the COLUMN_FAMILY column, then know how to build a qualified column name, and that be able to pass this through all the query planning, etc. And there is no possible way to hand a qualified column name to Phoenix. Phoenix does *not* allow "cf.cq", it has to be cf.cq or "cf"."cq".
Apparently there's a way in Trino to model this as a structured column. But that too does not hit the mark, then you *have* to use structs to access any column in Phoenix.

So in Trino (and probably other JDBC clients) we could opt to simply fail on tables like these - that's what's happening now... Or come up with unnatural constructs to fit this into the SQL model. Trino is just an example here.

At least we can document that qualifier names *should* be unique, and it otherwise might cause problems with downstream BigData integrations.

-- Lars

On Saturday, March 27, 2021, 2:51:04 AM PDT, Viraj Jasani <vj...@apache.org> wrote: 

Somewhat similar discussion we had on PHOENIX-6343 and why the duplicate
column check was restricted to default CF only.

On Sat, 27 Mar 2021 at 10:42 AM, Istvan Toth <st...@cloudera.com.invalid>
wrote:

> The first thing that comes to my mind is that this would limit
> functionality when defining views on existing raw HBase tables (though
> aliasing the columns may solve that)
> Another thing to consider is how this would affect dynamic column use cases
> for either native Phoenix tables of views on HBase tables.
>
> Istvan
>
> On Sat, Mar 27, 2021 at 12:20 AM larsh@apache.org <la...@apache.org>
> wrote:
>
> > As you may or may not know, Phoenix allows for duplicate column names as
> > long as they are placed in different column families.
> >
> > You can create a table such as CREATE TABLE t (pk1 ..., x.v1, y.v1, ...).
> > Now each time you want to refer to v1 you need to qualify it with its
> > column family or you get a AmbiguousColumnException.
> >
> > Worse, you also do CREATE TABLE t (pk1 ..., v1, x.v1, ...). Now a using
> v1
> > in a query will silently resolve to v1 in the default column family and
> > x.v1 does have to be qualified.
> >
> > As I reason through how this should work in Trino's (formerly Presto)
> > Phoenix connector, it occurs to me that should probably just disallow
> this.
> >
> > CREATE TABLE t (pk1 ..., x.v1, y.v2, ...) works just fine and you can
> > refer to both v1 and v2 without the need to qualify them with the column
> > family.
> >
> > So... In essence: Should we require all columns to be uniquely named,
> even
> > with multiple column families?
> >
> > Thanks.
> >
> > -- Lars
> >
>
>
> --
> *István Tóth* | Staff Software Engineer
> stoty@cloudera.com <https://www.cloudera.com>
> [image: Cloudera] <https://www.cloudera.com/>
> [image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
> Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: Cloudera
> on LinkedIn] <https://www.linkedin.com/company/cloudera>
> <https://www.cloudera.com/>
> ------------------------------
>

Re: On duplicate column names

Posted by Viraj Jasani <vj...@apache.org>.

Somewhat similar discussion we had on PHOENIX-6343 and why the duplicate
column check was restricted to default CF only.


On Sat, 27 Mar 2021 at 10:42 AM, Istvan Toth <st...@cloudera.com.invalid>
wrote:

> The first thing that comes to my mind is that this would limit
> functionality when defining views on existing raw HBase tables (though
> aliasing the columns may solve that)
> Another thing to consider is how this would affect dynamic column use cases
> for either native Phoenix tables of views on HBase tables.
>
> Istvan
>
> On Sat, Mar 27, 2021 at 12:20 AM larsh@apache.org <la...@apache.org>
> wrote:
>
> > As you may or may not know, Phoenix allows for duplicate column names as
> > long as they are placed in different column families.
> >
> > You can create a table such as CREATE TABLE t (pk1 ..., x.v1, y.v1, ...).
> > Now each time you want to refer to v1 you need to qualify it with its
> > column family or you get a AmbiguousColumnException.
> >
> > Worse, you also do CREATE TABLE t (pk1 ..., v1, x.v1, ...). Now a using
> v1
> > in a query will silently resolve to v1 in the default column family and
> > x.v1 does have to be qualified.
> >
> > As I reason through how this should work in Trino's (formerly Presto)
> > Phoenix connector, it occurs to me that should probably just disallow
> this.
> >
> > CREATE TABLE t (pk1 ..., x.v1, y.v2, ...) works just fine and you can
> > refer to both v1 and v2 without the need to qualify them with the column
> > family.
> >
> > So... In essence: Should we require all columns to be uniquely named,
> even
> > with multiple column families?
> >
> > Thanks.
> >
> > -- Lars
> >
>
>
> --
> *István Tóth* | Staff Software Engineer
> stoty@cloudera.com <https://www.cloudera.com>
> [image: Cloudera] <https://www.cloudera.com/>
> [image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
> Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: Cloudera
> on LinkedIn] <https://www.linkedin.com/company/cloudera>
> <https://www.cloudera.com/>
> ------------------------------
>

Re: On duplicate column names

Posted by Istvan Toth <st...@cloudera.com.INVALID>.

The first thing that comes to my mind is that this would limit
functionality when defining views on existing raw HBase tables (though
aliasing the columns may solve that)
Another thing to consider is how this would affect dynamic column use cases
for either native Phoenix tables of views on HBase tables.

Istvan

On Sat, Mar 27, 2021 at 12:20 AM larsh@apache.org <la...@apache.org> wrote:

> As you may or may not know, Phoenix allows for duplicate column names as
> long as they are placed in different column families.
>
> You can create a table such as CREATE TABLE t (pk1 ..., x.v1, y.v1, ...).
> Now each time you want to refer to v1 you need to qualify it with its
> column family or you get a AmbiguousColumnException.
>
> Worse, you also do CREATE TABLE t (pk1 ..., v1, x.v1, ...). Now a using v1
> in a query will silently resolve to v1 in the default column family and
> x.v1 does have to be qualified.
>
> As I reason through how this should work in Trino's (formerly Presto)
> Phoenix connector, it occurs to me that should probably just disallow this.
>
> CREATE TABLE t (pk1 ..., x.v1, y.v2, ...) works just fine and you can
> refer to both v1 and v2 without the need to qualify them with the column
> family.
>
> So... In essence: Should we require all columns to be uniquely named, even
> with multiple column families?
>
> Thanks.
>
> -- Lars
>


-- 
*István Tóth* | Staff Software Engineer
stoty@cloudera.com <https://www.cloudera.com>
[image: Cloudera] <https://www.cloudera.com/>
[image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: Cloudera
on LinkedIn] <https://www.linkedin.com/company/cloudera>
<https://www.cloudera.com/>
------------------------------