You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@phoenix.apache.org by William <yh...@163.com> on 2016/08/17 05:15:10 UTC

Some questions about secondary index

Hi all,
  I've been reading the code of secondary index recently and i found it very hard to understand. Here are some questions:
  1. there are 5 classes defined in package 'org.apache.phoenix.hbase.index.covered.example', but it seems that these classes are only referenced in tests.
      If that's true, then why not putting them into IT/test directory?
      If not, then what are they used for? 


  2. class IndexMemStore. 
      I read the comment at the header of this class many times but I still cannot get the point. What is the 'out-of-order' scenario? 
      I read the comment of CoveredColumnIndexer too, it might have showed me an 'example' of this scenario.  The comments:


 Taking the simple case, assume we do a single column in a group. Then if we get an out of order
 update, we need to check the current state of that column in the current row. If the current row
 is older, we can issue a delete as normal. If the current row is newer, however, we then have to
 issue a delete for the index update at the time of the current row. This ensures that the index
 update made for the 'future' time still covers the existing row.


      So, If I delete an existing row of the data table with ts = 10, while the existing row has a ts of 20 which is 'newer' than the current operation, then, we call the current Delete operation is 'back-in-time' or 'out-of-order'? What makes me confused is the solution of this scenario: just issue the delete with the ts of the existing row, which means issuing a Delete with ts = 20 ? Am i right? 
     In my opinion, if a Delete is back in time, we can just ignore it or issue an index Delete simply with the same ts.  Why are we using such a complex way to generating the index update?  
     The 'roll back' operation in NonTxIndexBuilder, and IndexUpdateManager#fixUpCurrentUpdates(), I cannot see the purpose of these facilities. I think I must have missed  something very important, which might be some core concept or design. May someone provide me an easier way to understand these code?


Thanks.
William

Re:Re: Some questions about secondary index (reformat the tables)

Posted by William <yh...@163.com>.

Hi  James,
Thanks a lot for the slides, I studied it in the weekends, and I was starting to understand the internals. But I still have some questions on the organization of the data.
Since the data table has 3 rows :

 |  ROW     |  FAMILY  |   QUALIFIER  | TS |   VALUE |
 |----------| ---------|--------------|----|---------|
 | Row1     | Fam      | Qual         | 10 | val1    |
 | Row1     | Fam2     | Qual2        | 12 | val2    |
 | Row1     | Fam      | Qual         | 13 | val3    |

Q1: 
there are two columns in Row1 that share the same column name (Qual), but with different TS. I believe that phoenix tables have VERSIONS set to one at creation.
so only the one column version with TS of 13 is available to the scanner. The only scenario where Qual with TS = 10  available to scanners is to executing a scanner with Scan#setRaw(true) called.
I checked the code and that's exactly what we are doing in reading the data table. Why? Why not just reading the latest version? Moreover, Deletes will be scanned too, what should we do with Deletes in the data table?
Back in 2013 when the secondary index is developed, phoenix had not supported transaction with tephra, so I believe that this has nothing to do with the transactional table, doesn't it?

Q2:
I don't understand where some of the rows in the index table came from.
The rows in index table:

 |  ROW               |  FAMILY  |   QUALIFIER             | TS   |
 |--------------------|----------|-------------------------|------|
 | Val1|Row1          |  Index   | Fam:Qual                | 10   |
 | Val1|Val2|Row1     |  Index   | Fam:Qual, Fam2:Qual2    | 12   |
 | Val3|Val2|Row1     |  Index   | Fam:Qual, Fam2:Qual2    | 13   |
When creating an index table, there must be a group of columns in the data table being indexed in the index table. In this example, what columns in the data table are indexed? Fam:Qual, or Fam2:Qual2 ? Or even both?
None of these answers fit the current behavior, neither local index or global index, with or without INCLUDEd columns. 
So let's assume that the indexed column is Fam:Qual, then the 1st row in the index table is related to the 1st row in the data table.
But what are the 2nd and 3rd row used for? Why we put multi values here? We don't store the include values in row key, but in values. 

Thanks for your patience of answering such detailed questions. 
William.

At 2016-08-19 07:54:52, "James Taylor" <ja...@apache.org> wrote:
>Hi William,
>I think those classes demonstrate how to use mutable secondary indexes
>directly with HBase (i.e. outside of Phoenix). I agree, they could be moved
>into the IT directory.
>
>You might take a look at this[1] presentation (also linked way down on the
>bottom of our secondary index page), starting from slide 31. It has some
>examples of out of order handling. It's not an easy problem.
>
>Thanks,
>James
>
>[1]
>http://www.slideshare.net/jesse_yates/phoenix-secondary-indexing-la-hug-sept-9th-2013
>
>On Tue, Aug 16, 2016 at 10:15 PM, William <yh...@163.com> wrote:
>
>> Hi all,
>>   I've been reading the code of secondary index recently and i found it
>> very hard to understand. Here are some questions:
>>   1. there are 5 classes defined in package 'org.apache.phoenix.hbase.index.covered.example',
>> but it seems that these classes are only referenced in tests.
>>       If that's true, then why not putting them into IT/test directory?
>>       If not, then what are they used for?
>>
>>
>>   2. class IndexMemStore.
>>       I read the comment at the header of this class many times but I
>> still cannot get the point. What is the 'out-of-order' scenario?
>>       I read the comment of CoveredColumnIndexer too, it might have showed
>> me an 'example' of this scenario.  The comments:
>>
>>
>>  Taking the simple case, assume we do a single column in a group. Then if
>> we get an out of order
>>  update, we need to check the current state of that column in the current
>> row. If the current row
>>  is older, we can issue a delete as normal. If the current row is newer,
>> however, we then have to
>>  issue a delete for the index update at the time of the current row. This
>> ensures that the index
>>  update made for the 'future' time still covers the existing row.
>>
>>
>>       So, If I delete an existing row of the data table with ts = 10,
>> while the existing row has a ts of 20 which is 'newer' than the current
>> operation, then, we call the current Delete operation is 'back-in-time' or
>> 'out-of-order'? What makes me confused is the solution of this scenario:
>> just issue the delete with the ts of the existing row, which means issuing
>> a Delete with ts = 20 ? Am i right?
>>      In my opinion, if a Delete is back in time, we can just ignore it or
>> issue an index Delete simply with the same ts.  Why are we using such a
>> complex way to generating the index update?
>>      The 'roll back' operation in NonTxIndexBuilder, and
>> IndexUpdateManager#fixUpCurrentUpdates(), I cannot see the purpose of
>> these facilities. I think I must have missed  something very important,
>> which might be some core concept or design. May someone provide me an
>> easier way to understand these code?
>>
>>
>> Thanks.
>> William

Re:Re: Some questions about secondary index

Posted by William <yh...@163.com>.

Hi  James,
Thanks a lot for the slides, I studied it in the weekends, and I was starting to understand the internals. But I still have some questions on the organization of the data.
Since the data table has 3 rows :

 |  ROW     |  FAMILY  |   QUALIFIER  | TS |   VALUE |
 |-------------| -------------|--------------------|------|-------------|
 | Row1     | Fam         | Qual               | 10   | val1       |
 | Row1     | Fam2       | Qual2             | 12   | val2       |
 | Row1     | Fam         | Qual               | 13   | val3       |

Q1: 
there are two columns in Row1 that share the same column name (Qual), but with different TS. I believe that phoenix tables have VERSIONS set to one at creation.
so only the one column version with TS of 13 is available to the scanner. The only scenario where Qual with TS = 10  available to scanners is to executing a scanner with Scan#setRaw(true) called.
I checked the code and that's exactly what we are doing in reading the data table. Why? Why not just reading the latest version? Moreover, Deletes will be scanned too, what should we do with Deletes in the data table?
Back in 2013 when the secondary index is developed, phoenix had not supported transaction with tephra, so I believe that this has nothing to do with the transactional table, doesn't it?

Q2:
I don't understand where some of the rows in the index table came from.
The rows in index table:

 |  ROW                    |  FAMILY  |   QUALIFIER                    | TS |
 |-----------------------   | -------------|-----------------------------------|------|
 | Val1|Row1             |  Index     | Fam:Qual                         | 10   |
 | Val1|Val2|Row1     |  Index     | Fam:Qual, Fam2:Qual2    | 12   |
 | Val3|Val2|Row1     |  Index     | Fam:Qual, Fam2:Qual2    | 13   |
When creating an index table, there must be a group of columns in the data table being indexed in the index table. In this example, what columns in the data table are indexed? Fam:Qual, or Fam2:Qual2 ? Or even both?
None of these answers fit the current behavior, neither local index or global index, with or without INCLUDEd columns. 
So let's assume that the indexed column is Fam:Qual, then the 1st row in the index table is related to the 1st row in the data table.
But what are the 2nd and 3rd row used for? Why we put multi values here? We don't store the include values in row key, but in values. 

Thanks for your patience of answering such detailed questions. 
William.

At 2016-08-19 07:54:52, "James Taylor" <ja...@apache.org> wrote:
>Hi William,
>I think those classes demonstrate how to use mutable secondary indexes
>directly with HBase (i.e. outside of Phoenix). I agree, they could be moved
>into the IT directory.
>
>You might take a look at this[1] presentation (also linked way down on the
>bottom of our secondary index page), starting from slide 31. It has some
>examples of out of order handling. It's not an easy problem.
>
>Thanks,
>James
>
>[1]
>http://www.slideshare.net/jesse_yates/phoenix-secondary-indexing-la-hug-sept-9th-2013
>
>On Tue, Aug 16, 2016 at 10:15 PM, William <yh...@163.com> wrote:
>
>> Hi all,
>>   I've been reading the code of secondary index recently and i found it
>> very hard to understand. Here are some questions:
>>   1. there are 5 classes defined in package 'org.apache.phoenix.hbase.index.covered.example',
>> but it seems that these classes are only referenced in tests.
>>       If that's true, then why not putting them into IT/test directory?
>>       If not, then what are they used for?
>>
>>
>>   2. class IndexMemStore.
>>       I read the comment at the header of this class many times but I
>> still cannot get the point. What is the 'out-of-order' scenario?
>>       I read the comment of CoveredColumnIndexer too, it might have showed
>> me an 'example' of this scenario.  The comments:
>>
>>
>>  Taking the simple case, assume we do a single column in a group. Then if
>> we get an out of order
>>  update, we need to check the current state of that column in the current
>> row. If the current row
>>  is older, we can issue a delete as normal. If the current row is newer,
>> however, we then have to
>>  issue a delete for the index update at the time of the current row. This
>> ensures that the index
>>  update made for the 'future' time still covers the existing row.
>>
>>
>>       So, If I delete an existing row of the data table with ts = 10,
>> while the existing row has a ts of 20 which is 'newer' than the current
>> operation, then, we call the current Delete operation is 'back-in-time' or
>> 'out-of-order'? What makes me confused is the solution of this scenario:
>> just issue the delete with the ts of the existing row, which means issuing
>> a Delete with ts = 20 ? Am i right?
>>      In my opinion, if a Delete is back in time, we can just ignore it or
>> issue an index Delete simply with the same ts.  Why are we using such a
>> complex way to generating the index update?
>>      The 'roll back' operation in NonTxIndexBuilder, and
>> IndexUpdateManager#fixUpCurrentUpdates(), I cannot see the purpose of
>> these facilities. I think I must have missed  something very important,
>> which might be some core concept or design. May someone provide me an
>> easier way to understand these code?
>>
>>
>> Thanks.
>> William

Re: Some questions about secondary index

Posted by James Taylor <ja...@apache.org>.

Hi William,
I think those classes demonstrate how to use mutable secondary indexes
directly with HBase (i.e. outside of Phoenix). I agree, they could be moved
into the IT directory.

You might take a look at this[1] presentation (also linked way down on the
bottom of our secondary index page), starting from slide 31. It has some
examples of out of order handling. It's not an easy problem.

Thanks,
James

[1]
http://www.slideshare.net/jesse_yates/phoenix-secondary-indexing-la-hug-sept-9th-2013

On Tue, Aug 16, 2016 at 10:15 PM, William <yh...@163.com> wrote:

> Hi all,
>   I've been reading the code of secondary index recently and i found it
> very hard to understand. Here are some questions:
>   1. there are 5 classes defined in package 'org.apache.phoenix.hbase.index.covered.example',
> but it seems that these classes are only referenced in tests.
>       If that's true, then why not putting them into IT/test directory?
>       If not, then what are they used for?
>
>
>   2. class IndexMemStore.
>       I read the comment at the header of this class many times but I
> still cannot get the point. What is the 'out-of-order' scenario?
>       I read the comment of CoveredColumnIndexer too, it might have showed
> me an 'example' of this scenario.  The comments:
>
>
>  Taking the simple case, assume we do a single column in a group. Then if
> we get an out of order
>  update, we need to check the current state of that column in the current
> row. If the current row
>  is older, we can issue a delete as normal. If the current row is newer,
> however, we then have to
>  issue a delete for the index update at the time of the current row. This
> ensures that the index
>  update made for the 'future' time still covers the existing row.
>
>
>       So, If I delete an existing row of the data table with ts = 10,
> while the existing row has a ts of 20 which is 'newer' than the current
> operation, then, we call the current Delete operation is 'back-in-time' or
> 'out-of-order'? What makes me confused is the solution of this scenario:
> just issue the delete with the ts of the existing row, which means issuing
> a Delete with ts = 20 ? Am i right?
>      In my opinion, if a Delete is back in time, we can just ignore it or
> issue an index Delete simply with the same ts.  Why are we using such a
> complex way to generating the index update?
>      The 'roll back' operation in NonTxIndexBuilder, and
> IndexUpdateManager#fixUpCurrentUpdates(), I cannot see the purpose of
> these facilities. I think I must have missed  something very important,
> which might be some core concept or design. May someone provide me an
> easier way to understand these code?
>
>
> Thanks.
> William