You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by 杨晗 <yh...@163.com> on 2015/12/21 14:58:20 UTC

Why do we need an empty column when doing upsert?

hi all:
     I'm reading phoenix src code recently, and i found PRowImpl.toRowMutations() always adds an empty column named '_0' for non-delete upsert. Why?
     I read the comment but i didn't quite understand it. Might someone give me an example that illustates why an empty column is ALWAYS necessary?


    Further more, I have to access a phoenix table by both phoenix client and hbase API in some cases. If i do not add this empty column explicitly by hbase API, it is ok  if i read this row by phoenix? 


Thanks
-William

Re: Re: Re: Why do we need an empty column when doing upsert?

Posted by James Taylor <ja...@apache.org>.
Yes, thanks, Gabriel - very well thought out answer and another good
candidate to add to our FAQs if anyone is up for it.

On Tue, Dec 22, 2015 at 12:39 AM, William <yh...@163.com> wrote:

> Hi Gabriel,
>
>
> Thanks a lot for your reply.
> I was confused. I misunderstood 'not null' constraint as 'required at each
> upsert'.
>
>
> -William
>
>
> At 2015-12-22 15:25:40, "Gabriel Reid" <ga...@gmail.com> wrote:
> >Hi William,
> >
> >Yes, all your tests there look correct, and it looks like you've got a
> >good understanding on the underlying principles of how scans are
> >working in Phoenix vs HBase.
> >
> >About the use of "not null" constraints, you're correct about data
> >being written by non-Phoenix clients being an issue. However, the
> >bigger issue is doing upsert statements in general. Phoenix doesn't
> >know if an UPSERT statement is updating an existing row, or adding a
> >new row. If an update of an existing row is being done, then it would
> >be possible to only upsert a subset of the total columns for the row,
> >but an upsert of a new row would have to include all columns.
> >
> >It would be possible to determine if each upsert was a new row by
> >reading each row before writing it, but this would be very costly in
> >terms of performance.
> >
> >- Gabriel
> >
> >On Tue, Dec 22, 2015 at 3:22 AM, William <yh...@163.com> wrote:
> >> Hi Gabriel:
> >>
> >>
> >> Now I think I understand why an empty column is necessary. I did the
> following tests:
> >>
> >>
> >> 1. by HBASE native API: after I deleted the last attribute column of a
> specified row, the whole row was deleted.
> >>     and this is exactly what you said 'the row key is not stored at all
> unless there is at least one column stored.'
> >>
> >>
> >> 2. by Phoenix client:
> >>     * create table tt ( pk varchar not null primary key, cf.b varchar,
> cf.c varchar);
> >>     * upsert into tt values ( 'this is pk');
> >>       now, do "scan 'tt' " over hbase shell, I got a row with an empty
> column '_0', and this row would not have been existed if phoenix did not
> insert this empty column and of course hbase native API will not allow u to
> do this.
> >>
> >> 3. by phoenix client:
> >>     even if we do insert not null non pk column in an upsert, an empty
> column is ALWAYS necessary because columns can be deleted.
> >>    *  upsert into tt values ( 'this is pk', 'bbbbb', 'ccccc');
> >>    *  upsert into tt values ( 'this is pk', null, null);
> >>    * or alter table tt drop column cf.b, cf.c
> >>       if we do not insert the empty column at 1st upsert, then after
> 2nd upsert, the whole row will disappear. But for SQL, we expect a row only
> with the PK value.
> >>
> >>
> >> 4. why we cannot add not null constraint on non pk column ?
> >>     I tried this : create table tt (pk varchar not null primary key,
> cf.b varchar not null);
> >>     and it raised an exception : Invalid not null constraint on non
> primary key column columnName=TT.B
> >>     I figured out one reason : the data of a phoenix table might be
> written by hbase native api, which does not support not null constraint on
> non pk column.
> >>    Is it right? Is it the primary reason?
> >>
> >>
> >> thanks
> >> - William
> >>
> >> At 2015-12-21 22:46:30, "Gabriel Reid" <ga...@gmail.com> wrote:
> >>>Hi William,
> >>>
> >>>The empty column is needed to ensure that a given column is available
> >>>for all rows.
> >>>
> >>>As you may know, data is stored in HBase as KeyValues, meaning that
> >>>the full row key is stored for each column value. This also implies
> >>>that the row key is not stored at all unless there is at least one
> >>>column stored.
> >>>
> >>>Now consider JDBC row which has an integer primary key, and several
> >>>columns which are all null. In order to be able to store the primary
> >>>key, a KeyValue needs to be stored to show that the row is present at
> >>>all. This column is represented by the empty column that you've
> >>>noticed. This allows doing a "SELECT * FROM TABLE" and receiving
> >>>records for all rows, even those whose non-pk columns are null.
> >>>
> >>>The same issue comes up even if only one column is null for some (or
> >>>all) records. A scan over Phoenix will include the empty column to
> >>>ensure that rows that only consist of the primary key (and have null
> >>>for all non-key columns) will be included in a scan result.
> >>>
> >>>- Gabriel
> >>>
> >>>On Mon, Dec 21, 2015 at 2:58 PM, 杨晗 <yh...@163.com> wrote:
> >>>> hi all:
> >>>>      I'm reading phoenix src code recently, and i found
> PRowImpl.toRowMutations() always adds an empty column named '_0' for
> non-delete upsert. Why?
> >>>>      I read the comment but i didn't quite understand it. Might
> someone give me an example that illustates why an empty column is ALWAYS
> necessary?
> >>>>
> >>>>
> >>>>     Further more, I have to access a phoenix table by both phoenix
> client and hbase API in some cases. If i do not add this empty column
> explicitly by hbase API, it is ok  if i read this row by phoenix?
> >>>>
> >>>>
> >>>> Thanks
> >>>> -William
>

Re:Re: Re: Why do we need an empty column when doing upsert?

Posted by William <yh...@163.com>.
Hi Gabriel,


Thanks a lot for your reply. 
I was confused. I misunderstood 'not null' constraint as 'required at each upsert'. 


-William


At 2015-12-22 15:25:40, "Gabriel Reid" <ga...@gmail.com> wrote:
>Hi William,
>
>Yes, all your tests there look correct, and it looks like you've got a
>good understanding on the underlying principles of how scans are
>working in Phoenix vs HBase.
>
>About the use of "not null" constraints, you're correct about data
>being written by non-Phoenix clients being an issue. However, the
>bigger issue is doing upsert statements in general. Phoenix doesn't
>know if an UPSERT statement is updating an existing row, or adding a
>new row. If an update of an existing row is being done, then it would
>be possible to only upsert a subset of the total columns for the row,
>but an upsert of a new row would have to include all columns.
>
>It would be possible to determine if each upsert was a new row by
>reading each row before writing it, but this would be very costly in
>terms of performance.
>
>- Gabriel
>
>On Tue, Dec 22, 2015 at 3:22 AM, William <yh...@163.com> wrote:
>> Hi Gabriel:
>>
>>
>> Now I think I understand why an empty column is necessary. I did the following tests:
>>
>>
>> 1. by HBASE native API: after I deleted the last attribute column of a specified row, the whole row was deleted.
>>     and this is exactly what you said 'the row key is not stored at all unless there is at least one column stored.'
>>
>>
>> 2. by Phoenix client:
>>     * create table tt ( pk varchar not null primary key, cf.b varchar, cf.c varchar);
>>     * upsert into tt values ( 'this is pk');
>>       now, do "scan 'tt' " over hbase shell, I got a row with an empty column '_0', and this row would not have been existed if phoenix did not insert this empty column and of course hbase native API will not allow u to do this.
>>
>> 3. by phoenix client:
>>     even if we do insert not null non pk column in an upsert, an empty column is ALWAYS necessary because columns can be deleted.
>>    *  upsert into tt values ( 'this is pk', 'bbbbb', 'ccccc');
>>    *  upsert into tt values ( 'this is pk', null, null);
>>    * or alter table tt drop column cf.b, cf.c
>>       if we do not insert the empty column at 1st upsert, then after 2nd upsert, the whole row will disappear. But for SQL, we expect a row only with the PK value.
>>
>>
>> 4. why we cannot add not null constraint on non pk column ?
>>     I tried this : create table tt (pk varchar not null primary key, cf.b varchar not null);
>>     and it raised an exception : Invalid not null constraint on non primary key column columnName=TT.B
>>     I figured out one reason : the data of a phoenix table might be written by hbase native api, which does not support not null constraint on non pk column.
>>    Is it right? Is it the primary reason?
>>
>>
>> thanks
>> - William
>>
>> At 2015-12-21 22:46:30, "Gabriel Reid" <ga...@gmail.com> wrote:
>>>Hi William,
>>>
>>>The empty column is needed to ensure that a given column is available
>>>for all rows.
>>>
>>>As you may know, data is stored in HBase as KeyValues, meaning that
>>>the full row key is stored for each column value. This also implies
>>>that the row key is not stored at all unless there is at least one
>>>column stored.
>>>
>>>Now consider JDBC row which has an integer primary key, and several
>>>columns which are all null. In order to be able to store the primary
>>>key, a KeyValue needs to be stored to show that the row is present at
>>>all. This column is represented by the empty column that you've
>>>noticed. This allows doing a "SELECT * FROM TABLE" and receiving
>>>records for all rows, even those whose non-pk columns are null.
>>>
>>>The same issue comes up even if only one column is null for some (or
>>>all) records. A scan over Phoenix will include the empty column to
>>>ensure that rows that only consist of the primary key (and have null
>>>for all non-key columns) will be included in a scan result.
>>>
>>>- Gabriel
>>>
>>>On Mon, Dec 21, 2015 at 2:58 PM, 杨晗 <yh...@163.com> wrote:
>>>> hi all:
>>>>      I'm reading phoenix src code recently, and i found PRowImpl.toRowMutations() always adds an empty column named '_0' for non-delete upsert. Why?
>>>>      I read the comment but i didn't quite understand it. Might someone give me an example that illustates why an empty column is ALWAYS necessary?
>>>>
>>>>
>>>>     Further more, I have to access a phoenix table by both phoenix client and hbase API in some cases. If i do not add this empty column explicitly by hbase API, it is ok  if i read this row by phoenix?
>>>>
>>>>
>>>> Thanks
>>>> -William

Re: Re: Why do we need an empty column when doing upsert?

Posted by Gabriel Reid <ga...@gmail.com>.
Hi William,

Yes, all your tests there look correct, and it looks like you've got a
good understanding on the underlying principles of how scans are
working in Phoenix vs HBase.

About the use of "not null" constraints, you're correct about data
being written by non-Phoenix clients being an issue. However, the
bigger issue is doing upsert statements in general. Phoenix doesn't
know if an UPSERT statement is updating an existing row, or adding a
new row. If an update of an existing row is being done, then it would
be possible to only upsert a subset of the total columns for the row,
but an upsert of a new row would have to include all columns.

It would be possible to determine if each upsert was a new row by
reading each row before writing it, but this would be very costly in
terms of performance.

- Gabriel

On Tue, Dec 22, 2015 at 3:22 AM, William <yh...@163.com> wrote:
> Hi Gabriel:
>
>
> Now I think I understand why an empty column is necessary. I did the following tests:
>
>
> 1. by HBASE native API: after I deleted the last attribute column of a specified row, the whole row was deleted.
>     and this is exactly what you said 'the row key is not stored at all unless there is at least one column stored.'
>
>
> 2. by Phoenix client:
>     * create table tt ( pk varchar not null primary key, cf.b varchar, cf.c varchar);
>     * upsert into tt values ( 'this is pk');
>       now, do "scan 'tt' " over hbase shell, I got a row with an empty column '_0', and this row would not have been existed if phoenix did not insert this empty column and of course hbase native API will not allow u to do this.
>
> 3. by phoenix client:
>     even if we do insert not null non pk column in an upsert, an empty column is ALWAYS necessary because columns can be deleted.
>    *  upsert into tt values ( 'this is pk', 'bbbbb', 'ccccc');
>    *  upsert into tt values ( 'this is pk', null, null);
>    * or alter table tt drop column cf.b, cf.c
>       if we do not insert the empty column at 1st upsert, then after 2nd upsert, the whole row will disappear. But for SQL, we expect a row only with the PK value.
>
>
> 4. why we cannot add not null constraint on non pk column ?
>     I tried this : create table tt (pk varchar not null primary key, cf.b varchar not null);
>     and it raised an exception : Invalid not null constraint on non primary key column columnName=TT.B
>     I figured out one reason : the data of a phoenix table might be written by hbase native api, which does not support not null constraint on non pk column.
>    Is it right? Is it the primary reason?
>
>
> thanks
> - William
>
> At 2015-12-21 22:46:30, "Gabriel Reid" <ga...@gmail.com> wrote:
>>Hi William,
>>
>>The empty column is needed to ensure that a given column is available
>>for all rows.
>>
>>As you may know, data is stored in HBase as KeyValues, meaning that
>>the full row key is stored for each column value. This also implies
>>that the row key is not stored at all unless there is at least one
>>column stored.
>>
>>Now consider JDBC row which has an integer primary key, and several
>>columns which are all null. In order to be able to store the primary
>>key, a KeyValue needs to be stored to show that the row is present at
>>all. This column is represented by the empty column that you've
>>noticed. This allows doing a "SELECT * FROM TABLE" and receiving
>>records for all rows, even those whose non-pk columns are null.
>>
>>The same issue comes up even if only one column is null for some (or
>>all) records. A scan over Phoenix will include the empty column to
>>ensure that rows that only consist of the primary key (and have null
>>for all non-key columns) will be included in a scan result.
>>
>>- Gabriel
>>
>>On Mon, Dec 21, 2015 at 2:58 PM, 杨晗 <yh...@163.com> wrote:
>>> hi all:
>>>      I'm reading phoenix src code recently, and i found PRowImpl.toRowMutations() always adds an empty column named '_0' for non-delete upsert. Why?
>>>      I read the comment but i didn't quite understand it. Might someone give me an example that illustates why an empty column is ALWAYS necessary?
>>>
>>>
>>>     Further more, I have to access a phoenix table by both phoenix client and hbase API in some cases. If i do not add this empty column explicitly by hbase API, it is ok  if i read this row by phoenix?
>>>
>>>
>>> Thanks
>>> -William

Re:Re: Why do we need an empty column when doing upsert?

Posted by William <yh...@163.com>.
Hi Gabriel:


Now I think I understand why an empty column is necessary. I did the following tests:


1. by HBASE native API: after I deleted the last attribute column of a specified row, the whole row was deleted. 
    and this is exactly what you said 'the row key is not stored at all unless there is at least one column stored.'


2. by Phoenix client: 
    * create table tt ( pk varchar not null primary key, cf.b varchar, cf.c varchar);
    * upsert into tt values ( 'this is pk');
      now, do "scan 'tt' " over hbase shell, I got a row with an empty column '_0', and this row would not have been existed if phoenix did not insert this empty column and of course hbase native API will not allow u to do this.
      
3. by phoenix client: 
    even if we do insert not null non pk column in an upsert, an empty column is ALWAYS necessary because columns can be deleted. 
   *  upsert into tt values ( 'this is pk', 'bbbbb', 'ccccc');
   *  upsert into tt values ( 'this is pk', null, null);
   * or alter table tt drop column cf.b, cf.c
      if we do not insert the empty column at 1st upsert, then after 2nd upsert, the whole row will disappear. But for SQL, we expect a row only with the PK value.


4. why we cannot add not null constraint on non pk column ?
    I tried this : create table tt (pk varchar not null primary key, cf.b varchar not null);
    and it raised an exception : Invalid not null constraint on non primary key column columnName=TT.B
    I figured out one reason : the data of a phoenix table might be written by hbase native api, which does not support not null constraint on non pk column.
   Is it right? Is it the primary reason? 


thanks
- William

At 2015-12-21 22:46:30, "Gabriel Reid" <ga...@gmail.com> wrote:
>Hi William,
>
>The empty column is needed to ensure that a given column is available
>for all rows.
>
>As you may know, data is stored in HBase as KeyValues, meaning that
>the full row key is stored for each column value. This also implies
>that the row key is not stored at all unless there is at least one
>column stored.
>
>Now consider JDBC row which has an integer primary key, and several
>columns which are all null. In order to be able to store the primary
>key, a KeyValue needs to be stored to show that the row is present at
>all. This column is represented by the empty column that you've
>noticed. This allows doing a "SELECT * FROM TABLE" and receiving
>records for all rows, even those whose non-pk columns are null.
>
>The same issue comes up even if only one column is null for some (or
>all) records. A scan over Phoenix will include the empty column to
>ensure that rows that only consist of the primary key (and have null
>for all non-key columns) will be included in a scan result.
>
>- Gabriel
>
>On Mon, Dec 21, 2015 at 2:58 PM, 杨晗 <yh...@163.com> wrote:
>> hi all:
>>      I'm reading phoenix src code recently, and i found PRowImpl.toRowMutations() always adds an empty column named '_0' for non-delete upsert. Why?
>>      I read the comment but i didn't quite understand it. Might someone give me an example that illustates why an empty column is ALWAYS necessary?
>>
>>
>>     Further more, I have to access a phoenix table by both phoenix client and hbase API in some cases. If i do not add this empty column explicitly by hbase API, it is ok  if i read this row by phoenix?
>>
>>
>> Thanks
>> -William

Re: Why do we need an empty column when doing upsert?

Posted by Gabriel Reid <ga...@gmail.com>.
Hi William,

The empty column is needed to ensure that a given column is available
for all rows.

As you may know, data is stored in HBase as KeyValues, meaning that
the full row key is stored for each column value. This also implies
that the row key is not stored at all unless there is at least one
column stored.

Now consider JDBC row which has an integer primary key, and several
columns which are all null. In order to be able to store the primary
key, a KeyValue needs to be stored to show that the row is present at
all. This column is represented by the empty column that you've
noticed. This allows doing a "SELECT * FROM TABLE" and receiving
records for all rows, even those whose non-pk columns are null.

The same issue comes up even if only one column is null for some (or
all) records. A scan over Phoenix will include the empty column to
ensure that rows that only consist of the primary key (and have null
for all non-key columns) will be included in a scan result.

- Gabriel

On Mon, Dec 21, 2015 at 2:58 PM, 杨晗 <yh...@163.com> wrote:
> hi all:
>      I'm reading phoenix src code recently, and i found PRowImpl.toRowMutations() always adds an empty column named '_0' for non-delete upsert. Why?
>      I read the comment but i didn't quite understand it. Might someone give me an example that illustates why an empty column is ALWAYS necessary?
>
>
>     Further more, I have to access a phoenix table by both phoenix client and hbase API in some cases. If i do not add this empty column explicitly by hbase API, it is ok  if i read this row by phoenix?
>
>
> Thanks
> -William