You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Ted Yu (JIRA)" <ji...@apache.org> on 2011/01/29 01:40:43 UTC

[jira] Created: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Allow RowCounter to retrieve multiple versions of rows
------------------------------------------------------

                 Key: HBASE-3488
                 URL: https://issues.apache.org/jira/browse/HBASE-3488
             Project: HBase
          Issue Type: Bug
          Components: util
    Affects Versions: 0.90.0
            Reporter: Ted Yu
             Fix For: 0.92.0


Currently RowCounter only retrieves latest version for each row.
Some applications would store multiple versions for the same row.

RowCounter should accept a new parameter for the number of versions to return.
Scan object would be configured with version parameter.
Then the following API should be called:
{code}
  public NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> getMap() {
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Updated] (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Subbu M Iyer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Subbu M Iyer updated HBASE-3488:
--------------------------------

    Attachment: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Subbu M Iyer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012150#comment-13012150 ] 

Subbu M Iyer commented on HBASE-3488:
-------------------------------------

First draft of CellCounter submitted for review. Warning: Very first draft. :-):-)


Test Setup
===========

1. Table planets with 5 CFs.
2. All CFs are setup with Max versions of 10.
3. Total 5 rows. First row has all 5 CFs and rest 4 rows has only CF1.
4. R1->CF1[c1-c9] (9 Quals), CF2[c21] (1 qual), CF3[c31] (1 qual), 
   CF4[c41] (1 qual), CF5[c51] (1 qual)
5. R2->CF1[c1-c9], R3->CF1[c1-c8], R4->CF1[cf9-> 9 versions], R5->CF1[c1-c5]

========================================================================

CellCounter lists the following stats:
======================================
1. Total number of rows in the table
2. Total number of CFs across all rows
3. Total qualifiers across all rows
4. Total occurances of each CF
5. Total occurances of each qualifier
6. Total number of versions of each qualifier.

==========================================================================
Running the CellCounter on the above setup produces the following report:
==========================================================================

Total Families Across all Rows  9
Total Qualifiers across all Rows        36
Total ROWS      5
cf1     5
cf1:col1        4
cf1:col2        4
cf1:col3        4
cf1:col4        4
cf1:col5        4
cf1:col6        3
cf1:col7        3
cf1:col8        3
cf1:col9        3
cf2     1
cf2:col21       1
cf3     1
cf3:col31       1
cf4     1
cf4:col41       1
cf5     1
cf5:col51       1
row_11:cf1:col1_Versions        1
row_11:cf1:col2_Versions        1
row_11:cf1:col3_Versions        1
row_11:cf1:col4_Versions        1
row_11:cf1:col5_Versions        1
row_11:cf1:col6_Versions        1
row_11:cf1:col7_Versions        1
row_11:cf1:col8_Versions        1
row_11:cf1:col9_Versions        1
row_11:cf2:col21_Versions       1
row_11:cf3:col31_Versions       1
row_11:cf4:col41_Versions       1
row_11:cf5:col51_Versions       1
row_22:cf1:col1_Versions        1
row_22:cf1:col2_Versions        1
row_22:cf1:col3_Versions        1
row_22:cf1:col4_Versions        1
row_22:cf1:col5_Versions        1
row_22:cf1:col6_Versions        1
row_22:cf1:col7_Versions        1
row_22:cf1:col8_Versions        1
row_22:cf1:col9_Versions        1
row_33:cf1:col1_Versions        1
row_33:cf1:col2_Versions        1
row_33:cf1:col3_Versions        1
row_33:cf1:col4_Versions        1
row_33:cf1:col5_Versions        1
row_33:cf1:col6_Versions        1
row_33:cf1:col7_Versions        1
row_33:cf1:col8_Versions        1
row_44:cf1:col9_Versions        9
row_55:cf1:col1_Versions        1
row_55:cf1:col2_Versions        1
row_55:cf1:col3_Versions        1
row_55:cf1:col4_Versions        1
row_55:cf1:col5_Versions        1
==============================================================================
 

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988475#action_12988475 ] 

Jonathan Gray commented on HBASE-3488:
--------------------------------------

I still don't understand.

Regardless of how many versions there are, setting maxVersions to 1 is fine.  That says just return the latest.

You don't need more than one version to be returned for the row to exist.  All you need is the latest version of a single column (FirstKeyOnlyFilter + maxVersions=1).

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988476#action_12988476 ] 

Jonathan Gray commented on HBASE-3488:
--------------------------------------

If a row has one column with 1000 versions, we only need the latest version of that column to know that the row exists and thus increment the row count by one, no?

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Subbu M Iyer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015525#comment-13015525 ] 

Subbu M Iyer commented on HBASE-3488:
-------------------------------------

1. To run the cellcounter with default row/cf/qualifier separator ':' for table 'planet'

java -cp ./conf:./hbase-0.91.0-SNAPSHOT.jar:./lib/hadoop-0.20.1-core.jar:./lib/commons-logging-1.1.1.jar:./lib/commons-cli-1.2.jar:./lib/zookeeper-3.3.2.jar:./lib/log4j-1.2.16.jar:./lib/commons-httpclient-3.1.jar org.apache.hadoop.hbase.mapreduce.CellCounter planet /work/HBaseExport/cellcounter30


2. To run the cellcounter with row/cf/qualifier separator '%' for table 'planet'

java -cp ./conf:./hbase-0.91.0-SNAPSHOT.jar:./lib/hadoop-0.20.1-core.jar:./lib/commons-logging-1.1.1.jar:./lib/commons-cli-1.2.jar:./lib/zookeeper-3.3.2.jar:./lib/log4j-1.2.16.jar:./lib/commons-httpclient-3.1.jar org.apache.hadoop.hbase.mapreduce.CellCounter planet /work/HBaseExport/cellcounter31 % 

3. To run the cellcounter with default row/cf/qualifier separator '%' for table 'planet' with a prefix filter row_55

java -cp ./conf:./hbase-0.91.0-SNAPSHOT.jar:./lib/hadoop-0.20.1-core.jar:./lib/commons-logging-1.1.1.jar:./lib/commons-cli-1.2.jar:./lib/zookeeper-3.3.2.jar:./lib/log4j-1.2.16.jar:./lib/commons-httpclient-3.1.jar org.apache.hadoop.hbase.mapreduce.CellCounter planet /work/HBaseExport/cellcounter31 % row_55 

4. To run the cellcounter with row/cf/qualifier separator '%' for table 'planet' with a regex filter ^11

java -cp ./conf:./hbase-0.91.0-SNAPSHOT.jar:./lib/hadoop-0.20.1-core.jar:./lib/commons-logging-1.1.1.jar:./lib/commons-cli-1.2.jar:./lib/zookeeper-3.3.2.jar:./lib/log4j-1.2.16.jar:./lib/commons-httpclient-3.1.jar org.apache.hadoop.hbase.mapreduce.CellCounter planet /work/HBaseExport/cellcounter31 % ^11

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch, 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988391#action_12988391 ] 

Jonathan Gray commented on HBASE-3488:
--------------------------------------

And is this a bug as filed?

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter.
> Then the following API should be called:
> {code}
>   public NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> getMap() {
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Updated] (HBASE-3488) Add CellCounter to count multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-3488:
--------------------------

    Summary: Add CellCounter to count multiple versions of rows  (was: Allow RowCounter to retrieve multiple versions of rows)

> Add CellCounter to count multiple versions of rows
> --------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch, 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3488) Add CellCounter to count multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015589#comment-13015589 ] 

Ted Yu commented on HBASE-3488:
-------------------------------

Sure I can.

> Add CellCounter to count multiple versions of rows
> --------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>            Assignee: Subbu M Iyer
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch, 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3488) Add CellCounter to count multiple versions of rows

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015587#comment-13015587 ] 

stack commented on HBASE-3488:
------------------------------

Its a feature Ted.  Hard to argue that features should be backported.  Can you apply it to your local hbase version?

> Add CellCounter to count multiple versions of rows
> --------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>            Assignee: Subbu M Iyer
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch, 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3488) Add CellCounter to count multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056724#comment-13056724 ] 

Ted Yu commented on HBASE-3488:
-------------------------------

Integrated to 0.90 branch.

> Add CellCounter to count multiple versions of rows
> --------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>            Assignee: Subbu M Iyer
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch, 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015534#comment-13015534 ] 

stack commented on HBASE-3488:
------------------------------

I'm going to apply this (will wait a little in case Ted wants to do a last review).

I will also add it to our 'driver' so it shows as one of the MR jobs hbase ships with; see src/main/java/org/apache/hadoop/hbase/mapreduce/Driver.java

Subbu, for future, its 2011 when it comes to copyright notices in src files (smile) but also, by convention we wrap lines at 80 characters.  I can fix this on commit, np, but just going forward.



> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch, 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Subbu M Iyer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Subbu M Iyer updated HBASE-3488:
--------------------------------

    Attachment: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch, 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-3488) Add CellCounter to count multiple versions of rows

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-3488.
--------------------------

      Resolution: Fixed
        Assignee: Subbu M Iyer
    Hadoop Flags: [Reviewed]

Committed to TRUNK.  Thanks for nice patch Subbu.  I added mention of this task to our mapreduce Driver too.

> Add CellCounter to count multiple versions of rows
> --------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>            Assignee: Subbu M Iyer
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch, 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014974#comment-13014974 ] 

Ted Yu commented on HBASE-3488:
-------------------------------

Since there may be many rows in the table, we should provide optional command-line argument to filter row keys.


> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014756#comment-13014756 ] 

stack commented on HBASE-3488:
------------------------------

License says 2008.

Class comment seems wrong.  Says mapper-only job but I see a Reducer declared and specified in the job setup.

Otherwise, patch looks great to me.  Nice functionality.  Could make for crazy report on big table but could be just what the doctor ordered diagnosing state of an hbase table.

Do you want to fix above Subbu or should I on commit?

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3488) Add CellCounter to count multiple versions of rows

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016230#comment-13016230 ] 

Hudson commented on HBASE-3488:
-------------------------------

Integrated in HBase-TRUNK #1831 (See [https://hudson.apache.org/hudson/job/HBase-TRUNK/1831/])
    

> Add CellCounter to count multiple versions of rows
> --------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>            Assignee: Subbu M Iyer
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch, 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Subbu M Iyer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007332#comment-13007332 ] 

Subbu M Iyer commented on HBASE-3488:
-------------------------------------

Ted/Jonathan:

Just want to better understand the solution we are proposing so that I can work on this issue.

1. We shall create a CellCounter that kind of extends the RowCounter and provides additional functionality to count all the versions of all rows instead of just the latest version?

2. From Jonathan's comment we also want "stats like # of rows, total # of columns / avg columns per row, total # of versions / avg versions per column / avg versions per row, etc" ?

Please let me know. 

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015544#comment-13015544 ] 

Ted Yu commented on HBASE-3488:
-------------------------------

I think version 2 looks good.
Thanks for the hard work Subbu.

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch, 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008529#comment-13008529 ] 

Ted Yu commented on HBASE-3488:
-------------------------------

@Subbu:
>>     CF2 with columns c1,c2  (2 cells. c1=2 versions and c2=3 versions)
Can you rephrase the setup so that ck (k = 1..20) is uniquely associated with CF's ?

>> Number of distinct Cells = c1-c20 = 20
The term Cell isn't strictly defined in hbase. Here it means column. However, column is always associated with some column family. I don't think it makes much sense listing CF1:c1 and CF2:c1 as one 'Cell'.


> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988413#action_12988413 ] 

Ted Yu commented on HBASE-3488:
-------------------------------

This JIRA was filed as improvement. We expect different values to co-exist (through versions) for the same row key in our table.

In RowCounter, I see:
{code}
    Scan scan = new Scan();
{code}
This ctor assigns 1 to maxVersions.

It is desirable to assign other value as maxVersions so that we know the correct number of rows in the table.

Description of FirstKeyOnlyFilter usage above holds.

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter.
> Then the following API should be called:
> {code}
>   public NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> getMap() {
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-3488:
--------------------------

    Description: 
Currently RowCounter only retrieves latest version for each row.
Some applications would store multiple versions for the same row.

RowCounter should accept a new parameter for the number of versions to return.
Scan object would be configured with version parameter (for scan.maxVersions).
Then the following API should be called:
{code}
  public KeyValue[] raw() {
{code}


  was:
Currently RowCounter only retrieves latest version for each row.
Some applications would store multiple versions for the same row.

RowCounter should accept a new parameter for the number of versions to return.
Scan object would be configured with version parameter.
Then the following API should be called:
{code}
  public NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> getMap() {
{code}



> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988526#action_12988526 ] 

Ted Yu commented on HBASE-3488:
-------------------------------

CellCounter is a good name.
SingleColumnValueFilter is able to to test values of previous versions (timestamps). User must supply a value in the ctor against which the test is performed.
Should SingleColumnFilter be created which considers all column values matching ?


> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Subbu M Iyer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015491#comment-13015491 ] 

Subbu M Iyer commented on HBASE-3488:
-------------------------------------

Submitted version 2 of the patch.

1. Fixed the comments per Stack's suggestion.
2. Added a Regex/Prefix based row filter to limit the counter to work on a smaller subset of rows
3. Changed the Bytes.toString to Bytes.toStringBinary.
4. Parameterized the row/cf/qualifier separator string.  

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch, 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988480#action_12988480 ] 

Ted Yu commented on HBASE-3488:
-------------------------------

There may be confusion arising from combining versions to the current row count concept.

I am open to naming a new utility which counts versions for the same row.

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Subbu M Iyer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008522#comment-13008522 ] 

Subbu M Iyer commented on HBASE-3488:
-------------------------------------

Test Setup

For simplicity, let's assume that we have a table with following data setup.

R1 -> CF1 with columns c1-c5. (5 cells all with 3 versions)

R2 -> CF1 with columns c1     (1 cell one version)
      CF2 with columns c1,c2  (2 cells. c1=2 versions and c2=3 versions)
      CF3 with columns c1-c3  (3 cells one version)
      CF4 with columns c1-c4  (4 cells one version)

R3 -> CF1 with columns c1-c9  (9 cells one version)
      CF2 with columns c10-c20 (10 cells one version)
      
Running the CellCounter program will print the following stats:

Total Number of rows = 3
Number of distinct CF = CF1-CF4 = 4
Number of distinct Cells = c1-c20 = 20
Total number of Cells (across all rows and CFs) = 34
Avg Number of CFs per Row = 4/3 = 1.33
Avg number of Cells per CF = 34/4 = 8.5

Versions:
CF1:c1 = 3 versions
CF1:c2 = 3 versions
CF1:c3 = 3 versions
CF1:c4 = 3 versions
CF1:c5 = 3 versions

CF2:c1 = 2 versions
CF2:c2 = 3 versions

all other CF:c combination = 1 version.

Ted: Is this your expectation? Please let me know.



> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007547#comment-13007547 ] 

stack commented on HBASE-3488:
------------------------------

What Ted says, or, refactor RowCounter#createSubmittableJob breaking it up into smaller pieces that can be reused by CellCounter's createSubmittableJob.. then you perhaps could subclass?

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3488) Add CellCounter to count multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015569#comment-13015569 ] 

Ted Yu commented on HBASE-3488:
-------------------------------

Is it possible to port this to 0.90.3 ?
We need this feature. It wouldn't introduce regression.

> Add CellCounter to count multiple versions of rows
> --------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>            Assignee: Subbu M Iyer
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch, 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Subbu M Iyer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015189#comment-13015189 ] 

Subbu M Iyer commented on HBASE-3488:
-------------------------------------

Ted/Stack:

Thanks a lot for your review comments. I will surely address everything called out by you guys.

thanks again.

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007569#comment-13007569 ] 

Ted Yu commented on HBASE-3488:
-------------------------------

I agree with the refactoring mentioned by Stack.
But I would approach a JIRA at the difficult part first - namely the filter.

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Subbu M Iyer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008668#comment-13008668 ] 

Subbu M Iyer commented on HBASE-3488:
-------------------------------------

Ted:

Would you consider R1:CF1:name and R2:CF1:name as two distinct columns or same one's for the purpose of counting? I agree that R1:CF1:name and R1:CF2:name should be considered as two distinct columns as they belong to different CF's.

>>Can you rephrase the setup so that ck (k = 1..20) is uniquely associated with CF's ?
Do you mean to assume that all rows (R1..Rn) will have a CF, CF1 with ck where (k = 1..20)

Also, how do you want the individual column version's listed as they are distinct per R:CF:column? We list all versions?


> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3488) Add CellCounter to count multiple versions of rows

Posted by "Subbu M Iyer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016212#comment-13016212 ] 

Subbu M Iyer commented on HBASE-3488:
-------------------------------------

Ted/Stack:

thanks a lot for your valuable review comments.

I don't think I understand when you say you want to port it to 0.90.3? What does that require exactly? will be more than happy to address that.


> Add CellCounter to count multiple versions of rows
> --------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>            Assignee: Subbu M Iyer
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch, 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988390#action_12988390 ] 

Jonathan Gray commented on HBASE-3488:
--------------------------------------

So would the idea be to not actually count rows but to count either columns or versions of columns?  As I recall, most of the row counting stuff is using FirstKeyOnlyFilter and is optimized to count unique rows regardless if they have one version of one column or a millions versions of a million columns.

Also, I don't recommend the {{Result.getMap()}} API.  It's a convenience method but it's not especially performant (it iterates all the keys, parses stuff, allocates new byte[]s, and builds up the map).  Instead you should just use {{Result.raw()}} and operate on the list of KeyValues returned.

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter.
> Then the following API should be called:
> {code}
>   public NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> getMap() {
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (HBASE-3488) Add CellCounter to count multiple versions of rows

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016220#comment-13016220 ] 

stack commented on HBASE-3488:
------------------------------

@Subbu Your attached patch should be fine if we want to bring it over into 0.90.  Don't worry about it.

> Add CellCounter to count multiple versions of rows
> --------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>            Assignee: Subbu M Iyer
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows-version2.patch, 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007339#comment-13007339 ] 

Ted Yu commented on HBASE-3488:
-------------------------------

CellCounter should be created. It doesn't extend RowCounter because FirstKeyOnlyFilter won't be used.

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008687#comment-13008687 ] 

Ted Yu commented on HBASE-3488:
-------------------------------

R1:CF1:name and R2:CF1:name are the same column.

>> Do you mean to assume that all rows (R1..Rn) will have a CF, CF1 with ck where (k = 1..20)
I meant you can associate columns ck (k=1..m) with CF1 and columns cl (l=m+1..n) with CF2 for clarity. In reality columns of different column families may have the same name.

Listing count of versions should be enough for individual columns.

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988483#action_12988483 ] 

Jonathan Gray commented on HBASE-3488:
--------------------------------------

So, you want to count columns/versions not unique rows, correct?  (Yes, this would probably need a new name.  It's not a row counter any longer.  Also, FirstKeyOnlyFilter would not work in this case)

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988484#action_12988484 ] 

Jonathan Gray commented on HBASE-3488:
--------------------------------------

You want some kind of {{CellCounter}} or some such?  I think something like that could be useful, generating stats like # of rows, total # of columns / avg columns per row, total # of versions / avg versions per column / avg versions per row, etc...

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (HBASE-3488) Allow RowCounter to retrieve multiple versions of rows

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014971#comment-13014971 ] 

Ted Yu commented on HBASE-3488:
-------------------------------

I think Bytes.toStringBinary() should be used in place of Bytes.toString() so that better formatting in the output is achieved.
Separator ":" is hardcoded. We'd better provide a command-line argument for user to customize the separator.

> Allow RowCounter to retrieve multiple versions of rows
> ------------------------------------------------------
>
>                 Key: HBASE-3488
>                 URL: https://issues.apache.org/jira/browse/HBASE-3488
>             Project: HBase
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.90.0
>            Reporter: Ted Yu
>             Fix For: 0.92.0
>
>         Attachments: 3488-Allow_RowCounter_to_retrieve_multiple_versions_of_rows.patch
>
>
> Currently RowCounter only retrieves latest version for each row.
> Some applications would store multiple versions for the same row.
> RowCounter should accept a new parameter for the number of versions to return.
> Scan object would be configured with version parameter (for scan.maxVersions).
> Then the following API should be called:
> {code}
>   public KeyValue[] raw() {
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira