You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2010/11/18 20:44:13 UTC

[jira] Created: (HBASE-3247) Changes API: API for pulling edits from HBase

Changes API: API for pulling edits from HBase
---------------------------------------------

                 Key: HBASE-3247
                 URL: https://issues.apache.org/jira/browse/HBASE-3247
             Project: HBase
          Issue Type: Task
            Reporter: stack


Talking to Shay from Elastic Search, he was asking where the Changes API is in HBase.  Talking more -- there was a bit of beer involved so apologize up front -- he wants to be able to bootstrap an index and thereafter ask HBase for changes since time t.  We thought he could tie into the replication stream, but rather he wants to be able to pull rather than have it pushed to him (in case he crashes, etc. so on recovery he can start pulling again from last good edit received).  He could do the bootstrap with a Scan.  Thereafter, requests to pull from hbase would pass a marker of some  sort.  HBase would then give out edits that came in after this marker, in batches, along with an updated marker.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3247) Changes API: API for pulling edits from HBase

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935574#action_12935574 ] 

Jonathan Gray commented on HBASE-3247:
--------------------------------------

Scanning requires you to look at all the data (or at least, more than just the data you need).  I think that would prove far to inefficient for something like keeping a search index up to date which you expect to be as "realtime" as possible.

This is about only needing to see the deltas.

> Changes API: API for pulling edits from HBase
> ---------------------------------------------
>
>                 Key: HBASE-3247
>                 URL: https://issues.apache.org/jira/browse/HBASE-3247
>             Project: HBase
>          Issue Type: Task
>            Reporter: stack
>
> Talking to Shay from Elastic Search, he was asking where the Changes API is in HBase.  Talking more -- there was a bit of beer involved so apologize up front -- he wants to be able to bootstrap an index and thereafter ask HBase for changes since time t.  We thought he could tie into the replication stream, but rather he wants to be able to pull rather than have it pushed to him (in case he crashes, etc. so on recovery he can start pulling again from last good edit received).  He could do the bootstrap with a Scan.  Thereafter, requests to pull from hbase would pass a marker of some  sort.  HBase would then give out edits that came in after this marker, in batches, along with an updated marker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3247) Changes API: API for pulling edits from HBase

Posted by "Steven Noels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934192#action_12934192 ] 

Steven Noels commented on HBASE-3247:
-------------------------------------

If this is really about robust (and distributed) pulling, wouldn't the RowLog mechanism as implemented in Lily would be a more solid approach - to the point that RowLog would leave in-row-process-status data in a non-user-visible column?

I'm wondering, as I'm seeing a proliferation of alternative yet overlapping approaches to a certain number of issues (secondary indexes, change listening) which in the end could confuse new users.

> Changes API: API for pulling edits from HBase
> ---------------------------------------------
>
>                 Key: HBASE-3247
>                 URL: https://issues.apache.org/jira/browse/HBASE-3247
>             Project: HBase
>          Issue Type: Task
>            Reporter: stack
>
> Talking to Shay from Elastic Search, he was asking where the Changes API is in HBase.  Talking more -- there was a bit of beer involved so apologize up front -- he wants to be able to bootstrap an index and thereafter ask HBase for changes since time t.  We thought he could tie into the replication stream, but rather he wants to be able to pull rather than have it pushed to him (in case he crashes, etc. so on recovery he can start pulling again from last good edit received).  He could do the bootstrap with a Scan.  Thereafter, requests to pull from hbase would pass a marker of some  sort.  HBase would then give out edits that came in after this marker, in batches, along with an updated marker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3247) Changes API: API for pulling edits from HBase

Posted by "Steven Noels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935552#action_12935552 ] 

Steven Noels commented on HBASE-3247:
-------------------------------------

Well, we're doing mapreduce for initial SOLR population, which might be a bit too involved compared with soemthing like a Changes API. I reckon our Indexer could be made configurable to connect to ES as well. i'll have Evert look into this issue, and comment to it, he just did a writeup on the Rowlog on our blog today: http://outerthought.org/blog/449-ot.html

The thing I would object to if I were a non-Lily person, would be that we need tracking/status data in user-visible columns.

> Changes API: API for pulling edits from HBase
> ---------------------------------------------
>
>                 Key: HBASE-3247
>                 URL: https://issues.apache.org/jira/browse/HBASE-3247
>             Project: HBase
>          Issue Type: Task
>            Reporter: stack
>
> Talking to Shay from Elastic Search, he was asking where the Changes API is in HBase.  Talking more -- there was a bit of beer involved so apologize up front -- he wants to be able to bootstrap an index and thereafter ask HBase for changes since time t.  We thought he could tie into the replication stream, but rather he wants to be able to pull rather than have it pushed to him (in case he crashes, etc. so on recovery he can start pulling again from last good edit received).  He could do the bootstrap with a Scan.  Thereafter, requests to pull from hbase would pass a marker of some  sort.  HBase would then give out edits that came in after this marker, in batches, along with an updated marker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3247) Changes API: API for pulling edits from HBase

Posted by "Evert Arckens (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935712#action_12935712 ] 

Evert Arckens commented on HBASE-3247:
--------------------------------------

@stack
With the Rowlog you can register a subscription and then all messages that are put on the rowlog will be kept for that subscription. If you then also register a listener (cfr RowLogMessageListener) on that subscription, the rowlog processor will start feeding the messages to the listener.
If you can make a bulk load that only processes data that was changed before a certain point in time, you can let that run and in the meanwhile let the rowlog record all changes that are done after that point.

Looking a bit further at how the Indexer in Lily uses the rowlog (http://docs.outerthought.org/lily-docs-current/415-lily.html) :
When the indexer recieves a message it will use the record's current data and put that data in the index (IndexUpdater is the listener that is registered on the rowlog).
An index rebuild will use map reduce to go over all the data again and update the index.
It is allowed for both the bulk index rebuild and the index updater through the rowlog to run in parallel. Both will look at the current data of the record and put that in the index. So there is no need for a transition point from bulk to incremental.
The indexer is written specifically to put Lily records into a Solr index. It is not designed yet to plug-in another index. But it should be do-able to use this same framework to have something non-Lily on the one hand and a non-Solr index on the other. If we look at the classes in the framework : the IndexUpdater is the implementation of the RowLogMessageListener which has knowledge about lily-records and decides 'what' to index. The Indexer class is responsible for mapping the Lily-schema onto the Solr-schema and maintains the communication with Solr.

> Changes API: API for pulling edits from HBase
> ---------------------------------------------
>
>                 Key: HBASE-3247
>                 URL: https://issues.apache.org/jira/browse/HBASE-3247
>             Project: HBase
>          Issue Type: Task
>            Reporter: stack
>
> Talking to Shay from Elastic Search, he was asking where the Changes API is in HBase.  Talking more -- there was a bit of beer involved so apologize up front -- he wants to be able to bootstrap an index and thereafter ask HBase for changes since time t.  We thought he could tie into the replication stream, but rather he wants to be able to pull rather than have it pushed to him (in case he crashes, etc. so on recovery he can start pulling again from last good edit received).  He could do the bootstrap with a Scan.  Thereafter, requests to pull from hbase would pass a marker of some  sort.  HBase would then give out edits that came in after this marker, in batches, along with an updated marker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3247) Changes API: API for pulling edits from HBase

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935553#action_12935553 ] 

ryan rawson commented on HBASE-3247:
------------------------------------

why can't timestamp based scanning do this?  Is it because of the
missing deletes?  Could there be a scan option to give more raw data?
Not really a new API, but still kind of a half API.


> Changes API: API for pulling edits from HBase
> ---------------------------------------------
>
>                 Key: HBASE-3247
>                 URL: https://issues.apache.org/jira/browse/HBASE-3247
>             Project: HBase
>          Issue Type: Task
>            Reporter: stack
>
> Talking to Shay from Elastic Search, he was asking where the Changes API is in HBase.  Talking more -- there was a bit of beer involved so apologize up front -- he wants to be able to bootstrap an index and thereafter ask HBase for changes since time t.  We thought he could tie into the replication stream, but rather he wants to be able to pull rather than have it pushed to him (in case he crashes, etc. so on recovery he can start pulling again from last good edit received).  He could do the bootstrap with a Scan.  Thereafter, requests to pull from hbase would pass a marker of some  sort.  HBase would then give out edits that came in after this marker, in batches, along with an updated marker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3247) Changes API: API for pulling edits from HBase

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935055#action_12935055 ] 

stack commented on HBASE-3247:
------------------------------

@Steven Yes, we should start with RowLog (http://www.lilyproject.org/maven-site/0.1/apidocs/org/lilycms/rowlog/api/RowLog.html).

bq. I'm wondering, as I'm seeing a proliferation of alternative yet overlapping approaches to a certain number of issues (secondary indexes, change listening) which in the end could confuse new users.

-1 to proliferation of alternate yet overlapping... things

What you fellas suggest for bootstrapping system -- doing a fat bulk load into the search index -- and then cutting over to rowlog for incremental updates?  Doesn't there have to exact transition so followers do not miss edits?  You fellas have ideas for how to do that?

> Changes API: API for pulling edits from HBase
> ---------------------------------------------
>
>                 Key: HBASE-3247
>                 URL: https://issues.apache.org/jira/browse/HBASE-3247
>             Project: HBase
>          Issue Type: Task
>            Reporter: stack
>
> Talking to Shay from Elastic Search, he was asking where the Changes API is in HBase.  Talking more -- there was a bit of beer involved so apologize up front -- he wants to be able to bootstrap an index and thereafter ask HBase for changes since time t.  We thought he could tie into the replication stream, but rather he wants to be able to pull rather than have it pushed to him (in case he crashes, etc. so on recovery he can start pulling again from last good edit received).  He could do the bootstrap with a Scan.  Thereafter, requests to pull from hbase would pass a marker of some  sort.  HBase would then give out edits that came in after this marker, in batches, along with an updated marker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3247) Changes API: API for pulling edits from HBase

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964400#action_12964400 ] 

stack commented on HBASE-3247:
------------------------------

Thanks @Evert.  BTW, how do we get documentation that is as fancy as yours?

> Changes API: API for pulling edits from HBase
> ---------------------------------------------
>
>                 Key: HBASE-3247
>                 URL: https://issues.apache.org/jira/browse/HBASE-3247
>             Project: HBase
>          Issue Type: Task
>            Reporter: stack
>
> Talking to Shay from Elastic Search, he was asking where the Changes API is in HBase.  Talking more -- there was a bit of beer involved so apologize up front -- he wants to be able to bootstrap an index and thereafter ask HBase for changes since time t.  We thought he could tie into the replication stream, but rather he wants to be able to pull rather than have it pushed to him (in case he crashes, etc. so on recovery he can start pulling again from last good edit received).  He could do the bootstrap with a Scan.  Thereafter, requests to pull from hbase would pass a marker of some  sort.  HBase would then give out edits that came in after this marker, in batches, along with an updated marker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.