You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Duane Moore (JIRA)" <ji...@apache.org> on 2010/07/18 19:29:52 UTC

[jira] Commented: (HBASE-32) [hbase] Add row count estimator

    [ https://issues.apache.org/jira/browse/HBASE-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889639#action_12889639 ] 

Duane Moore commented on HBASE-32:
----------------------------------

Wondering about the state of this issue.  Is this a feature that could be generalized into providing other aggregation functions like min(), max(), sum(), etc.?  Presumably these aggregators would work during data ingest and be specifiable per column or column-family.  We are working with a proprietary NOSQL system currently that provides this functionality and it would be highly desirable in HBase.  Would be interested to see if there is any active development towards this end.  If not, I can look into providing a possible implementation.

> [hbase] Add row count estimator
> -------------------------------
>
>                 Key: HBASE-32
>                 URL: https://issues.apache.org/jira/browse/HBASE-32
>             Project: HBase
>          Issue Type: New Feature
>          Components: client
>            Reporter: stack
>            Priority: Minor
>         Attachments: 2291_v01.patch, Keying.java
>
>
> Internally we have a little tool that will do a rough estimate of how many rows there are in a dataHbase.  It keeps getting larger and larger partitions running scanners until it turns up > N occupied rows.  Once it has a number > N, it multiples by the partition size to get an approximate row count.  
> This issue is about generalizing this feature so it could sit in the general hbase install.  It would look something like:
> {code}
> long getApproximateRowCount(final Text startRow, final Text endRow, final long minimumCountPerPartition, final long maximumPartitionSize)
> {code}
> Larger minimumCountPerPartition and maximumPartitionSize values would make the count more accurate but would mean the method ran longer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.