You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Chris K Wensel (JIRA)" <ji...@apache.org> on 2009/07/02 19:01:47 UTC

[jira] Created: (HBASE-1605) TableInputFormat should support 'limit'

TableInputFormat should support 'limit'
---------------------------------------

                 Key: HBASE-1605
                 URL: https://issues.apache.org/jira/browse/HBASE-1605
             Project: Hadoop HBase
          Issue Type: Improvement
          Components: mapred
            Reporter: Chris K Wensel


Would be useful if TableInputFormat could be passed a 'limit' property value that limited the total result set to the value of 'limit'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1605) TableInputFormat should support 'limit'

Posted by "Chris K Wensel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726607#action_12726607 ] 

Chris K Wensel commented on HBASE-1605:
---------------------------------------

Good questions.

In SQL, LIMIT returns the first N rows of the result set. and is typically used with OFFSET to allow pagination.

In Cascading, the Limit Operation only allows each task to see N/M rows (accounting for remainders). no notion of OFFSET as limit in this case is really used for unit/integration testing or sampling.

re HBase, you guys should choose a model that makes most sense for typical hbase consumer applications. but allowing for an even load across many mappers, but orthogonally limiting the total number of rows processed is what I'm after.

having this work with a Filter would also be very nice. i.e. give me the 1k rows that satisfy this condition. but I guess if i want the first 1k rows that satisfy the filter, we might be limited to a single region (and single mapper as I see the code now).

so maybe there are two modes. sample and result. sample returns 'random' N rows (top N/M from regions). result turns ordered N rows (from a region by virtue).

anyways, just throwing that out there. current use case would be happy with either. though 'result' is probably the most useful coupled with HBASE-1172.




> TableInputFormat should support 'limit'
> ---------------------------------------
>
>                 Key: HBASE-1605
>                 URL: https://issues.apache.org/jira/browse/HBASE-1605
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris K Wensel
>
> Would be useful if TableInputFormat could be passed a 'limit' property value that limited the total result set to the value of 'limit'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1605) TableInputFormat should support 'limit'

Posted by "Lars George (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726579#action_12726579 ] 

Lars George commented on HBASE-1605:
------------------------------------

Hey Chris, is this a dupe of HBASE-1172? Could you please check?

> TableInputFormat should support 'limit'
> ---------------------------------------
>
>                 Key: HBASE-1605
>                 URL: https://issues.apache.org/jira/browse/HBASE-1605
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris K Wensel
>
> Would be useful if TableInputFormat could be passed a 'limit' property value that limited the total result set to the value of 'limit'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1605) TableInputFormat should support 'limit'

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726598#action_12726598 ] 

Jonathan Gray commented on HBASE-1605:
--------------------------------------

I think I understand what you're saying now Chris.

But what you want is a distributed Filter?  To be able to properly limit (from the start, I assume?) you need to sequentially scan.  So a single mapper would make this work.

Can you clarify what exactly the behavior you're looking for is?  You want the first N rows from the table?  Or you want the first N rows from each region of the table?

> TableInputFormat should support 'limit'
> ---------------------------------------
>
>                 Key: HBASE-1605
>                 URL: https://issues.apache.org/jira/browse/HBASE-1605
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris K Wensel
>
> Would be useful if TableInputFormat could be passed a 'limit' property value that limited the total result set to the value of 'limit'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1605) TableInputFormat should support 'limit'

Posted by "Chris K Wensel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726557#action_12726557 ] 

Chris K Wensel commented on HBASE-1605:
---------------------------------------

Since .20, the number of splits is the number of regions in a table. So the 'limit' value will need to be divided up between the splits, including the remainders.

> TableInputFormat should support 'limit'
> ---------------------------------------
>
>                 Key: HBASE-1605
>                 URL: https://issues.apache.org/jira/browse/HBASE-1605
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris K Wensel
>
> Would be useful if TableInputFormat could be passed a 'limit' property value that limited the total result set to the value of 'limit'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1605) TableInputFormat should support 'limit'

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726609#action_12726609 ] 

Jonathan Gray commented on HBASE-1605:
--------------------------------------

The 'random' N rows is possible, and what you would get now I think if you just used the limit filter (you would specify N/M for filter not N if you wanted total to be N).  If you use one mapper, then you get the ordered N rows.

We do lots of OFFSET, LIMIT queries on hbase.  The issue is, you must always run that style query sequentially, especially if you have other filters that are being called before the paging/limiting filter.

> TableInputFormat should support 'limit'
> ---------------------------------------
>
>                 Key: HBASE-1605
>                 URL: https://issues.apache.org/jira/browse/HBASE-1605
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris K Wensel
>
> Would be useful if TableInputFormat could be passed a 'limit' property value that limited the total result set to the value of 'limit'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1605) TableInputFormat should support 'limit'

Posted by "Chris K Wensel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726587#action_12726587 ] 

Chris K Wensel commented on HBASE-1605:
---------------------------------------

Nope, doesn't look like it. I'm not calling for a configurable number of mappers/splits. 

But asking the total number of rows returned across all splits be exactly equal to an optional 'limit' value.

And pointing out the current impl, as I see it in trunk, sets the splits to the number of regions in a table thus implying that a single global Filter across all splits won't work as limit may not be evenly divided among the splits.

It will be nice to specify the number of mappers/splits and have it honored though.

> TableInputFormat should support 'limit'
> ---------------------------------------
>
>                 Key: HBASE-1605
>                 URL: https://issues.apache.org/jira/browse/HBASE-1605
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris K Wensel
>
> Would be useful if TableInputFormat could be passed a 'limit' property value that limited the total result set to the value of 'limit'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.