You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "Ankit Singhal (JIRA)" <ji...@apache.org> on 2016/01/06 08:25:39 UTC

[jira] [Updated] (PHOENIX-2143) Use guidepost bytes instead of region name in stats primary key

     [ https://issues.apache.org/jira/browse/PHOENIX-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ankit Singhal updated PHOENIX-2143:
-----------------------------------
    Attachment: PHOENIX-2143_wip_2.patch

[~giacomotaylor], wip patch with the review comments.

I have two queries:-

1. how to upgrade system.stats table
| Approaches I was thinking:-
  --a) drop stats table and create a new one, but mutation is not allowed on system tables.
  --b) delete entries from system.catalog table for system.stats table , delete all entries from system.stats and add new KVs for new system.stats in system.catalog?
 --c) delete entries from system.catalog table and drop system.stats table by using hbaseAdmin and re-create with create table command

2. Second, how to store rowcounts for the regions where no stats are collected because they could be smaller than the guidePostSize. or we don't want these rowcounts at all?



> Use guidepost bytes instead of region name in stats primary key
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-2143
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2143
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: Ankit Singhal
>         Attachments: PHOENIX-2143_wip.patch, PHOENIX-2143_wip_2.patch
>
>
> Our current SYSTEM.STATS table uses the region name as the last column in the primary key constraint. Instead, we should use the MIN_KEY column (which corresponds to the region start key). The advantage would be that the stats would then be ordered by region start key allowing us to approximate the number of guideposts which would be traversed given the start/stop row of a scan:
> {code}
> SELECT SUM(guide_posts_count) FROM SYSTEM.STATS WHERE min_key > :1 AND min_key < :2
> {code}
> where :1 is the start row and :2 is the stop row of the scan. With an UNNEST operator for ARRAYs, we could get a better approximation.
> As part of the upgrade to the new Phoenix version containing this fix, stats could simply be dropped and they'd be recalculated with the new schema.
> An alternative, even more granular approach would be to *not* use arrays to store the guide posts, but instead store them as individual rows with a schema like this.
> |PHYSICAL_NAME|VARCHAR|
> |COLUMN_FAMILY|VARCHAR|
> |GUIDE_POST_KEY|VARBINARY|
> In this alternative, the maintenance during compaction is higher, though, as you'd need to run a separate query to do the deletion of the old guideposts, followed by a commit of the new guideposts. The other disadvantage (besides requiring multiple queries) is that this couldn't be done transactionally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)