You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "stack (JIRA)" <ji...@apache.org> on 2007/05/29 19:12:15 UTC

[jira] Created: (HADOOP-1439) Add endRow parameter to HClient#obtainScanner

Add endRow parameter to HClient#obtainScanner
---------------------------------------------

                 Key: HADOOP-1439
                 URL: https://issues.apache.org/jira/browse/HADOOP-1439
             Project: Hadoop
          Issue Type: Improvement
          Components: contrib/hbase
            Reporter: stack
            Assignee: stack
            Priority: Minor


Currently the HClient#obtainScanner looks like this:

{code}
public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
{code}

Add an overload that allows specification of endRow:

{code}
public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
{code}

Use Case: Table contains the whole web.  Client just wants to scan google's pages.  Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1439) Add endRow parameter to HClient#obtainScanner

Posted by "James Kennedy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508929 ] 

James Kennedy commented on HADOOP-1439:
---------------------------------------

Oh, one thing I forgot to add in the limitations above:

Column criteria can only apply to columns included int he results. You cannot retrieve COL1, COL2 where COL3 = 'XYZ'
This is because the filtering is happening at the HScanner level and for e.g. the lower level scanner for  COL3 is not employed and so all COL3's values appear as null.


> Add endRow parameter to HClient#obtainScanner
> ---------------------------------------------
>
>                 Key: HADOOP-1439
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1439
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>
> Currently the HClient#obtainScanner looks like this:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
> {code}
> Add an overload that allows specification of endRow:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
> {code}
> Use Case: Table contains the whole web.  Client just wants to scan google's pages.  Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1439) Add endRow parameter to HClient#obtainScanner

Posted by "James Kennedy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508928 ] 

James Kennedy commented on HADOOP-1439:
---------------------------------------

Deal. I don't think i'll get to it today but certainly before Monday, tomorrow likely.

> Add endRow parameter to HClient#obtainScanner
> ---------------------------------------------
>
>                 Key: HADOOP-1439
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1439
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>
> Currently the HClient#obtainScanner looks like this:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
> {code}
> Add an overload that allows specification of endRow:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
> {code}
> Use Case: Table contains the whole web.  Client just wants to scan google's pages.  Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1439) Add endRow parameter to HClient#obtainScanner

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508698 ] 

Jim Kellerman commented on HADOOP-1439:
---------------------------------------

It seems to me that if I specify a filter that is row key filter, then if the filter finds a match, next() keeps returning values so long as the row filter matches. Once it stops matching, the filter should close out the scanner since there will be no additional rows that match that filter.

In this particular case, I am talking about row key filters based on >, =, < and not regexp filters, because a regexp can potentially match any row.


> Add endRow parameter to HClient#obtainScanner
> ---------------------------------------------
>
>                 Key: HADOOP-1439
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1439
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>
> Currently the HClient#obtainScanner looks like this:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
> {code}
> Add an overload that allows specification of endRow:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
> {code}
> Use Case: Table contains the whole web.  Client just wants to scan google's pages.  Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1439) Add endRow parameter to HClient#obtainScanner

Posted by "James Kennedy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508701 ] 

James Kennedy commented on HADOOP-1439:
---------------------------------------

Right, so in the case of >, =, < type RowFilters you're quite right. More generally a RowFilter implementing those functions or otherwise may need to signal the scanner to stop altogether for whatever reason, even when the target rows are not located in a single consecutive chunk like >, =. <.  e.g. reached a maximum of nonconsecutive matched rows.

I'll implement this mechanism, clean up, and re-post the Hadoop-1531 patch when i get a chance.

That will make RowFilter more conducive to the EndRow filtering needed for this task. But as I said there will still be a little overhead vs. implementing an explicit endRow param to the scanner. 

> Add endRow parameter to HClient#obtainScanner
> ---------------------------------------------
>
>                 Key: HADOOP-1439
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1439
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>
> Currently the HClient#obtainScanner looks like this:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
> {code}
> Add an overload that allows specification of endRow:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
> {code}
> Use Case: Table contains the whole web.  Client just wants to scan google's pages.  Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1439) Add endRow parameter to HClient#obtainScanner

Posted by "James Kennedy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508658 ] 

James Kennedy commented on HADOOP-1439:
---------------------------------------

Michael suggested that when finished, Hadoop-1531, RowFilters, may be used to achieve the above functionality.

As the RowFilter impl is right now, using a regexp on each key encountered may be an expensive way to do it.

In the above example, even if the endRow functionality works, how do you know where the end row is? how do you know when you leave the google domain?

It seems to me that there may be several restrictions a user may want to apply to row-keys:
1) Specify a range. Use start/end keys assuming you know what they are.
2) Specify a range, use a start key and a "page size".  This is useful for retrieving data in pages, e.g. displaying to UI as user clicks next/last page.
3) Specify a criteria. e.g. regular expressions or more basic string comparison.

Fortunately my RowFilterInterface design can be used to generalize the above.  In the Google example, I could create a custom RowFilter implementation that can do domain name comparison more efficiently than general regular expression matching.  Pass that via the client as you would any other RowFilter impl.  Only thing to make sure of is that the custom impl is in the classpath of the HRegionServer too.

For start/end range, you could have a custom RowFilter that checks for an exact match on the end key. But this won't be as efficient as an explicit endRow parameter because:
A) when RowFilter is not null, HRegion#HScanner is always going to have a little more overhead even if the filter() implementation itself always just returns false.
B) The filter isn't currently designed to stop the scanner when a certain criteria is reached. When it encounters the endRow, it will just loop through the rest of the rows, filtering them all out, until it reaches the end of the HRegion.

I think start/page range has the same issues.  Only difference is that it requires scan-lifetime state to count number of (unfiltered?) rows encountered.  Still requires stop condition trigger.

If i add that stop condition trigger functionality to the RowFilterInterface and update HScanner to use it. We could have a number of built-in RowFilter implementations that deal with restrictions like those above.

WRT simple restrictions like start/end/page parameters there will still be a, perhaps small, trade-off between performance and generality depending on if we implement them independently or via RowFilterInterface.











> Add endRow parameter to HClient#obtainScanner
> ---------------------------------------------
>
>                 Key: HADOOP-1439
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1439
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>
> Currently the HClient#obtainScanner looks like this:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
> {code}
> Add an overload that allows specification of endRow:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
> {code}
> Use Case: Table contains the whole web.  Client just wants to scan google's pages.  Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-1439) Add endRow parameter to HClient#obtainScanner

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HADOOP-1439.
---------------------------

    Resolution: Won't Fix

Resolving as "won't fix"  Will add endRow using new mechanism just-added by 'HADOOP-1531 Add RowFilter to HRegion.HScanner'

> Add endRow parameter to HClient#obtainScanner
> ---------------------------------------------
>
>                 Key: HADOOP-1439
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1439
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>
> Currently the HClient#obtainScanner looks like this:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
> {code}
> Add an overload that allows specification of endRow:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
> {code}
> Use Case: Table contains the whole web.  Client just wants to scan google's pages.  Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1439) Add endRow parameter to HClient#obtainScanner

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508864 ] 

stack commented on HADOOP-1439:
-------------------------------

This comment applies to this issue and to hadoop-1531.

After the exposition above, I'm now of the opinion that the endRow parameter will be little used.  Better for now to have a set of filters available for the client to choose from.  If 'performance' becomes an issue, we can backfill the endRow parameter later.  

We can divide the work if you'd like. I need the endRow functionality *a tout de suite*.   If you add the 'stop condition trigger' to the interface I can work on a couple of filter implementations and their tests.


> Add endRow parameter to HClient#obtainScanner
> ---------------------------------------------
>
>                 Key: HADOOP-1439
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1439
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>
> Currently the HClient#obtainScanner looks like this:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
> {code}
> Add an overload that allows specification of endRow:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
> {code}
> Use Case: Table contains the whole web.  Client just wants to scan google's pages.  Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.