You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Lars George (JIRA)" <ji...@apache.org> on 2009/06/04 21:29:07 UTC

[jira] Created: (HBASE-1481) Add fast row key only scanning

Add fast row key only scanning
------------------------------

                 Key: HBASE-1481
                 URL: https://issues.apache.org/jira/browse/HBASE-1481
             Project: Hadoop HBase
          Issue Type: Improvement
    Affects Versions: 0.19.3
            Reporter: Lars George
            Priority: Minor
             Fix For: 0.20.0, 0.21.0


Instead of requiring a user to set up a scanner with any column and scan the table to gather all row keys while ignoring the column value we should have a fast and lightweight scanner that for example takes a "null" for the column list and then simply returns only the matching keys of all non-empty or deleted rows. Filters should still be applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1481) Add fast row key only scanning

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762359#action_12762359 ] 

stack commented on HBASE-1481:
------------------------------

+1

The filter is a little odd...but patch looks good and if it works, go commit I'd say.

> Add fast row key only scanning
> ------------------------------
>
>                 Key: HBASE-1481
>                 URL: https://issues.apache.org/jira/browse/HBASE-1481
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.19.3
>            Reporter: Lars George
>            Assignee: Jonathan Gray
>            Priority: Minor
>             Fix For: 0.20.1, 0.21.0
>
>         Attachments: HBASE-1481-v1.patch, HBASE-1481-v2.patch
>
>
> Instead of requiring a user to set up a scanner with any column and scan the table to gather all row keys while ignoring the column value we should have a fast and lightweight scanner that for example takes a "null" for the column list and then simply returns only the matching keys of all non-empty or deleted rows. Filters should still be applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HBASE-1481) Add fast row key only scanning

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray reassigned HBASE-1481:
------------------------------------

    Assignee: Jonathan Gray

> Add fast row key only scanning
> ------------------------------
>
>                 Key: HBASE-1481
>                 URL: https://issues.apache.org/jira/browse/HBASE-1481
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.19.3
>            Reporter: Lars George
>            Assignee: Jonathan Gray
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HBASE-1481-v1.patch
>
>
> Instead of requiring a user to set up a scanner with any column and scan the table to gather all row keys while ignoring the column value we should have a fast and lightweight scanner that for example takes a "null" for the column list and then simply returns only the matching keys of all non-empty or deleted rows. Filters should still be applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1481) Add fast row key only scanning

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1481:
-------------------------

    Fix Version/s:     (was: 0.20.1)
     Release Note: Moving this new 'feature' to 0.21.0.

> Add fast row key only scanning
> ------------------------------
>
>                 Key: HBASE-1481
>                 URL: https://issues.apache.org/jira/browse/HBASE-1481
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.19.3
>            Reporter: Lars George
>            Priority: Minor
>             Fix For: 0.21.0
>
>
> Instead of requiring a user to set up a scanner with any column and scan the table to gather all row keys while ignoring the column value we should have a fast and lightweight scanner that for example takes a "null" for the column list and then simply returns only the matching keys of all non-empty or deleted rows. Filters should still be applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1481) Add fast row key only scanning

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1481:
---------------------------------

    Status: Patch Available  (was: Open)

Seems to work.  Is this enough for this issue for now?  We should revisit additional optimizations once we have better hfile seeking.

> Add fast row key only scanning
> ------------------------------
>
>                 Key: HBASE-1481
>                 URL: https://issues.apache.org/jira/browse/HBASE-1481
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.19.3
>            Reporter: Lars George
>            Assignee: Jonathan Gray
>            Priority: Minor
>             Fix For: 0.20.1, 0.21.0
>
>         Attachments: HBASE-1481-v1.patch, HBASE-1481-v2.patch
>
>
> Instead of requiring a user to set up a scanner with any column and scan the table to gather all row keys while ignoring the column value we should have a fast and lightweight scanner that for example takes a "null" for the column list and then simply returns only the matching keys of all non-empty or deleted rows. Filters should still be applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1481) Add fast row key only scanning

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1481:
---------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Committed to branch and trunk

> Add fast row key only scanning
> ------------------------------
>
>                 Key: HBASE-1481
>                 URL: https://issues.apache.org/jira/browse/HBASE-1481
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.19.3
>            Reporter: Lars George
>            Assignee: Jonathan Gray
>            Priority: Minor
>             Fix For: 0.20.1, 0.21.0
>
>         Attachments: HBASE-1481-v1.patch, HBASE-1481-v2.patch
>
>
> Instead of requiring a user to set up a scanner with any column and scan the table to gather all row keys while ignoring the column value we should have a fast and lightweight scanner that for example takes a "null" for the column list and then simply returns only the matching keys of all non-empty or deleted rows. Filters should still be applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1481) Add fast row key only scanning

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1481:
---------------------------------

    Fix Version/s:     (was: 0.20.0)
                   0.20.1

> Add fast row key only scanning
> ------------------------------
>
>                 Key: HBASE-1481
>                 URL: https://issues.apache.org/jira/browse/HBASE-1481
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.19.3
>            Reporter: Lars George
>            Priority: Minor
>             Fix For: 0.20.1, 0.21.0
>
>
> Instead of requiring a user to set up a scanner with any column and scan the table to gather all row keys while ignoring the column value we should have a fast and lightweight scanner that for example takes a "null" for the column list and then simply returns only the matching keys of all non-empty or deleted rows. Filters should still be applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1481) Add fast row key only scanning

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725941#action_12725941 ] 

Jonathan Gray commented on HBASE-1481:
--------------------------------------

You say that filters should still be applicable.  Running the filter checks means disassembling the entire KV and would get rid of much of the potential optimizations.

I saw the core use case of this being row counting.  Other uses for something like this might be satisfied elsewhere moving forward.

What use cases are you thinking?  Basically, COUNT() operations with filters to allow counting various things?

> Add fast row key only scanning
> ------------------------------
>
>                 Key: HBASE-1481
>                 URL: https://issues.apache.org/jira/browse/HBASE-1481
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.19.3
>            Reporter: Lars George
>            Priority: Minor
>             Fix For: 0.20.1, 0.21.0
>
>
> Instead of requiring a user to set up a scanner with any column and scan the table to gather all row keys while ignoring the column value we should have a fast and lightweight scanner that for example takes a "null" for the column list and then simply returns only the matching keys of all non-empty or deleted rows. Filters should still be applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1481) Add fast row key only scanning

Posted by "Lars George (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726588#action_12726588 ] 

Lars George commented on HBASE-1481:
------------------------------------

This was filed back then after a discussion on IRC between - I think - you and me and maybe Stack. For me this was for a faster row counting. Especially in the shell, which is quite slow because it is a sequential scan. 

For the faster MR variant we have the RowCounter driver class in the hbase.jar, which now uses as many mappers as the table has regions. The legacy "mapred" one seemed to have default to the 1 mapper setting from the hadoop configs.

With that in place I am not sure if we need this issue still being open. 


> Add fast row key only scanning
> ------------------------------
>
>                 Key: HBASE-1481
>                 URL: https://issues.apache.org/jira/browse/HBASE-1481
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.19.3
>            Reporter: Lars George
>            Priority: Minor
>             Fix For: 0.20.1, 0.21.0
>
>
> Instead of requiring a user to set up a scanner with any column and scan the table to gather all row keys while ignoring the column value we should have a fast and lightweight scanner that for example takes a "null" for the column list and then simply returns only the matching keys of all non-empty or deleted rows. Filters should still be applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1481) Add fast row key only scanning

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726597#action_12726597 ] 

Jonathan Gray commented on HBASE-1481:
--------------------------------------

Nothing has really been added though.  Row counting via MR has always been possible, we have been doing it here for a long time and the same performance issue still exists (even with mappers = regions).

The idea behind this issue is to not have to return everything.

Let's just keep it open for now and see what comes of it for 0.20.1.  This might get swept into another larger issue.

> Add fast row key only scanning
> ------------------------------
>
>                 Key: HBASE-1481
>                 URL: https://issues.apache.org/jira/browse/HBASE-1481
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.19.3
>            Reporter: Lars George
>            Priority: Minor
>             Fix For: 0.20.1, 0.21.0
>
>
> Instead of requiring a user to set up a scanner with any column and scan the table to gather all row keys while ignoring the column value we should have a fast and lightweight scanner that for example takes a "null" for the column list and then simply returns only the matching keys of all non-empty or deleted rows. Filters should still be applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1481) Add fast row key only scanning

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1481:
---------------------------------

    Attachment: HBASE-1481-v2.patch

Adds usage of FirstKeyOnlyFilter to RowCounter mapreduce job and count function in shell.

> Add fast row key only scanning
> ------------------------------
>
>                 Key: HBASE-1481
>                 URL: https://issues.apache.org/jira/browse/HBASE-1481
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.19.3
>            Reporter: Lars George
>            Assignee: Jonathan Gray
>            Priority: Minor
>             Fix For: 0.20.1, 0.21.0
>
>         Attachments: HBASE-1481-v1.patch, HBASE-1481-v2.patch
>
>
> Instead of requiring a user to set up a scanner with any column and scan the table to gather all row keys while ignoring the column value we should have a fast and lightweight scanner that for example takes a "null" for the column list and then simply returns only the matching keys of all non-empty or deleted rows. Filters should still be applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1481) Add fast row key only scanning

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1481:
---------------------------------

    Fix Version/s: 0.20.1
     Release Note:   (was: Moving this new 'feature' to 0.21.0.)

This patch is actually against 0.20.1.  Looks like release of it is going to get pushed back a bit, so I'm for putting this in to 0.20.1 since it's such a simple addition (and quite useful).

Patch also includes unit test.  Going to look at making another patch that actually adds this in to any built-in row counting mechanisms.

> Add fast row key only scanning
> ------------------------------
>
>                 Key: HBASE-1481
>                 URL: https://issues.apache.org/jira/browse/HBASE-1481
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.19.3
>            Reporter: Lars George
>            Assignee: Jonathan Gray
>            Priority: Minor
>             Fix For: 0.20.1, 0.21.0
>
>         Attachments: HBASE-1481-v1.patch
>
>
> Instead of requiring a user to set up a scanner with any column and scan the table to gather all row keys while ignoring the column value we should have a fast and lightweight scanner that for example takes a "null" for the column list and then simply returns only the matching keys of all non-empty or deleted rows. Filters should still be applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1481) Add fast row key only scanning

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761067#action_12761067 ] 

Jonathan Gray commented on HBASE-1481:
--------------------------------------

I have an idea for this.  Easy enough for 0.20.1 and 0.21.  Patch coming soon.

> Add fast row key only scanning
> ------------------------------
>
>                 Key: HBASE-1481
>                 URL: https://issues.apache.org/jira/browse/HBASE-1481
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.19.3
>            Reporter: Lars George
>            Priority: Minor
>             Fix For: 0.21.0
>
>
> Instead of requiring a user to set up a scanner with any column and scan the table to gather all row keys while ignoring the column value we should have a fast and lightweight scanner that for example takes a "null" for the column list and then simply returns only the matching keys of all non-empty or deleted rows. Filters should still be applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1481) Add fast row key only scanning

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1481:
---------------------------------

    Attachment: HBASE-1481-v1.patch

Patch adds a new filter called FirstKeyOnlyFilter.  It's extremely simple, but this does generally accomplish what we want.

The only further optimizations to row counting I can think of:

- prevent sending back even an entire KV per row (all we really need is the count, but this breaks the API)
- once we work at issues like HBASE-1517, we should seek to the next row after we look at the first KV (if we have a million columns in a row, we don't need to iterate all of them to do a row count)

The latter issue gets me thinking about what filters could do to push that kind of information to the QueryMatcher....

> Add fast row key only scanning
> ------------------------------
>
>                 Key: HBASE-1481
>                 URL: https://issues.apache.org/jira/browse/HBASE-1481
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.19.3
>            Reporter: Lars George
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HBASE-1481-v1.patch
>
>
> Instead of requiring a user to set up a scanner with any column and scan the table to gather all row keys while ignoring the column value we should have a fast and lightweight scanner that for example takes a "null" for the column list and then simply returns only the matching keys of all non-empty or deleted rows. Filters should still be applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.