You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by "Sijie Guo (Created) (JIRA)" <ji...@apache.org> on 2012/01/19 03:48:40 UTC

[jira] [Created] (BOOKKEEPER-154) Garbage collect messages for those subscribers inactive/offline for a long time.

Garbage collect messages for those subscribers inactive/offline for a long time. 
---------------------------------------------------------------------------------

                 Key: BOOKKEEPER-154
                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-154
             Project: Bookkeeper
          Issue Type: New Feature
          Components: hedwig-client, hedwig-server
    Affects Versions: 4.0.0
            Reporter: Sijie Guo


Currently hedwig tracks subscribers progress for garbage collecting published messages. If subscriber subscribe and becomes offline without unsubscribing for a long time, those messages published in its topic have no chance to be garbage collected.

A time based garbage collection policy would be suitable for this case. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-154) Garbage collect messages for those subscribers inactive/offline for a long time.

Posted by "Sijie Guo (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196902#comment-13196902 ] 

Sijie Guo commented on BOOKKEEPER-154:
--------------------------------------

yes I agree. 
BTW, I have wrote a console client before in BOOKKEEPER-77 , which is used to do some metadata related operations, such as getTopicList, show metadata of a specified topic, read messages of a specified topic. Can we consider extracting meta operation code to put it into a HedwigAdmin client? 
                
> Garbage collect messages for those subscribers inactive/offline for a long time. 
> ---------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-154
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-154
>             Project: Bookkeeper
>          Issue Type: New Feature
>          Components: hedwig-client, hedwig-server
>    Affects Versions: 4.0.0
>            Reporter: Sijie Guo
>
> Currently hedwig tracks subscribers progress for garbage collecting published messages. If subscriber subscribe and becomes offline without unsubscribing for a long time, those messages published in its topic have no chance to be garbage collected.
> A time based garbage collection policy would be suitable for this case. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-154) Garbage collect messages for those subscribers inactive/offline for a long time.

Posted by "Gavin Li (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197639#comment-13197639 ] 

Gavin Li commented on BOOKKEEPER-154:
-------------------------------------

To recycle the old enough messages is the ideal case we want to achieve. But it seems very complicated to implement and takes too long. Maybe we can't wait for that. What is actually enough for us is to consume all the old message for users that being inactive for long enough time. How long the user has been inactive can be specified as a parameter.

Actually in this way we can have kind of size based recycle. The SE can firstly specify a large enough inactive time, say 20 days. Then it can recycle some space. If it's not enough, SE can run once more with smaller time to free up more space.

 
                
> Garbage collect messages for those subscribers inactive/offline for a long time. 
> ---------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-154
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-154
>             Project: Bookkeeper
>          Issue Type: New Feature
>          Components: hedwig-client, hedwig-server
>    Affects Versions: 4.0.0
>            Reporter: Sijie Guo
>
> Currently hedwig tracks subscribers progress for garbage collecting published messages. If subscriber subscribe and becomes offline without unsubscribing for a long time, those messages published in its topic have no chance to be garbage collected.
> A time based garbage collection policy would be suitable for this case. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-154) Garbage collect messages for those subscribers inactive/offline for a long time.

Posted by "Ivan Kelly (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197946#comment-13197946 ] 

Ivan Kelly commented on BOOKKEEPER-154:
---------------------------------------

Hubs in hedwig have all the information they need to do this, so zookeeper can be left alone. How I would see this working would be, that you'd have a client, 
{code}
$ hedwig console gc 10d
{code}

This would connect to all hubs[1] and send a GarbageCollect message to each hub with 10 days as the parameter. The hub can then go through it's list of topics and perform garbage collection on it. The hub must have a its list of topic's in memory as well as the list of subscribers for each topic. This is part of the basic design of Hedwig. Additionally, each hub would not have much to do as topics should be spread pretty evenly. 

The open questions here are:
  * how do we deal with time?
  * in the event of a crashed hub, what do we do with the topics which have not been taken over by another hub (since noone has tried to access them since the crash)?

[1] Im not sure how to get the list of all hubs without contacting zk, but that's an auxillary problem.
                
> Garbage collect messages for those subscribers inactive/offline for a long time. 
> ---------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-154
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-154
>             Project: Bookkeeper
>          Issue Type: New Feature
>          Components: hedwig-client, hedwig-server
>    Affects Versions: 4.0.0
>            Reporter: Sijie Guo
>
> Currently hedwig tracks subscribers progress for garbage collecting published messages. If subscriber subscribe and becomes offline without unsubscribing for a long time, those messages published in its topic have no chance to be garbage collected.
> A time based garbage collection policy would be suitable for this case. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-154) Garbage collect messages for those subscribers inactive/offline for a long time.

Posted by "Ivan Kelly (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197763#comment-13197763 ] 

Ivan Kelly commented on BOOKKEEPER-154:
---------------------------------------

The one thing im unclear of in the above design is that "long time" is undefined. What is a long time in this context? If it is wall time, what happens if the zk ensemble has its clocks completely out of sync? If it is znode time, what is something create a lot of writes in a very short period, and we start GCing very recent messages? 
                
> Garbage collect messages for those subscribers inactive/offline for a long time. 
> ---------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-154
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-154
>             Project: Bookkeeper
>          Issue Type: New Feature
>          Components: hedwig-client, hedwig-server
>    Affects Versions: 4.0.0
>            Reporter: Sijie Guo
>
> Currently hedwig tracks subscribers progress for garbage collecting published messages. If subscriber subscribe and becomes offline without unsubscribing for a long time, those messages published in its topic have no chance to be garbage collected.
> A time based garbage collection policy would be suitable for this case. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-154) Garbage collect messages for those subscribers inactive/offline for a long time.

Posted by "Sijie Guo (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188902#comment-13188902 ] 

Sijie Guo commented on BOOKKEEPER-154:
--------------------------------------

currently we don't have publish timestamp for each message. it would be not easy to implement such time-based garbage collection policy in hub server itself. 

so a proposal is to provide a offline tool to check subscriber's state to do time based gc. if a subscriber is inactive for a long time, the offline tool send a CONSUME request for this subscriber to consume to the lastest message.

the tool works as below:

loop over all topics, for each topic:
1) find the subscriber who is inactive for a long time: read subscriber znodes, we can get the modify time for these znodes. if these znodes are not modified for a long time, it means that these subscribers were not active for a long time.
2) read the lastest message id : we can parse ledgers znode to get it. we did the similar thing in BOOKKEEPER-77 .
3) do #subscribe the topic for the found inactive subscribers. (if these subscribers are online, the subscription would be fail. we should not do CONSUME for them) send a CONSUME request to hub server for them, to consume to the lastest message.


                
> Garbage collect messages for those subscribers inactive/offline for a long time. 
> ---------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-154
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-154
>             Project: Bookkeeper
>          Issue Type: New Feature
>          Components: hedwig-client, hedwig-server
>    Affects Versions: 4.0.0
>            Reporter: Sijie Guo
>
> Currently hedwig tracks subscribers progress for garbage collecting published messages. If subscriber subscribe and becomes offline without unsubscribing for a long time, those messages published in its topic have no chance to be garbage collected.
> A time based garbage collection policy would be suitable for this case. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-154) Garbage collect messages for those subscribers inactive/offline for a long time.

Posted by "Flavio Junqueira (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196947#comment-13196947 ] 

Flavio Junqueira commented on BOOKKEEPER-154:
---------------------------------------------

The new api call would be to consume up to a given timestamp? My understanding of the requirement is that the application needs to able to garbage collect old messages and needs a way of determining how old messages are, otherwise it doesn't know how far it should consume. A different way would be to consume based on size, in the case an application needs the ability to reduce the amount of state store on a given topic to some value. 
                
> Garbage collect messages for those subscribers inactive/offline for a long time. 
> ---------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-154
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-154
>             Project: Bookkeeper
>          Issue Type: New Feature
>          Components: hedwig-client, hedwig-server
>    Affects Versions: 4.0.0
>            Reporter: Sijie Guo
>
> Currently hedwig tracks subscribers progress for garbage collecting published messages. If subscriber subscribe and becomes offline without unsubscribing for a long time, those messages published in its topic have no chance to be garbage collected.
> A time based garbage collection policy would be suitable for this case. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-154) Garbage collect messages for those subscribers inactive/offline for a long time.

Posted by "Sijie Guo (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197842#comment-13197842 ] 

Sijie Guo commented on BOOKKEEPER-154:
--------------------------------------

thanks, Flavio & Ivan.

I think the application thread you mentioned would be a tool like bookkeeper recovery tool. it would not be run very frequently. it could be executed as a cron job, running every several days.

> the last time each of those subscribers has consumed a message.

I think using the modify time of subscription znode as last time is easiest way. for clock out of sync issue, either using time in zk or in hub doesn't solve it. for consistency issue, since the tool just uses modify time to judge a subscriber is offline for a long time, it would not modify ZooKeeper metadata directly.

> the subscribers it needs to watch for

similar issue as bookkeeper recovery tool. it needs to loop over all ledgers to check and do recovery. it use zk#getChildren to fetch all ledgers. (in BOOKKEEPER-39 , we add a hierarchical ledger manager to avoid fetching too many children in a single zk#getChildren)

The panic here is that we put all topics metadata in a single znode. it is not easy for application to retrieve the topic list where there is huge number of topics. a possible solution is to support hierarchical topic to let application organize their topics, but it may be another jira to handle it.

the easiest way is similar as previous comment described, which gc tool doesn't need to care about it, and the application passes a gc list to it. ($ gc_tool --topics topic_list ; or $ gc_tool -f topic_list_file) I think it would be easier for application to get such kind of list.
                
> Garbage collect messages for those subscribers inactive/offline for a long time. 
> ---------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-154
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-154
>             Project: Bookkeeper
>          Issue Type: New Feature
>          Components: hedwig-client, hedwig-server
>    Affects Versions: 4.0.0
>            Reporter: Sijie Guo
>
> Currently hedwig tracks subscribers progress for garbage collecting published messages. If subscriber subscribe and becomes offline without unsubscribing for a long time, those messages published in its topic have no chance to be garbage collected.
> A time based garbage collection policy would be suitable for this case. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-154) Garbage collect messages for those subscribers inactive/offline for a long time.

Posted by "Flavio Junqueira (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197812#comment-13197812 ] 

Flavio Junqueira commented on BOOKKEEPER-154:
---------------------------------------------

Ivan and I had an offline discussion about this issues, and here is a summary. We have not reached agreement on a solution, btw, this is just a bit of brainstorm. 

We find it better to have an application thread responsible for garbage-collecting messages from subscribers by consuming such messages. It is better in the sense that it avoids introducing functionality that is application specific.

Assuming that this feature is implemented at the application level, we need a way for the application thread to determine:

# the subscribers it needs to watch for;
# the last time each of those subscribers has consumed a message.

This information is in principle available through ZooKeeper, so one way of implementing this feature is to make the information in ZooKeeper available. Having the application accessing directly ZooKeeper sounds messy because it is prone to consistency problems to have the application manipulating the ZooKeeper metadata directly and it is operationally more difficult (e.g., for open ports). One option is to expose it through Hubs.

Exposing the ZooKeeper metadata via hubs doesn't solve the whole problem. Assuming millions of subscribers, such an application thread would have to loop through the subscribers frequently inducing a high load. If we could use the watch functionality of ZooKeeper, then perhaps we could have the application thread build a local table of subscribers and update the table when anything changes. This way it has to loop through the same subscribers, but locally.  


                
> Garbage collect messages for those subscribers inactive/offline for a long time. 
> ---------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-154
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-154
>             Project: Bookkeeper
>          Issue Type: New Feature
>          Components: hedwig-client, hedwig-server
>    Affects Versions: 4.0.0
>            Reporter: Sijie Guo
>
> Currently hedwig tracks subscribers progress for garbage collecting published messages. If subscriber subscribe and becomes offline without unsubscribing for a long time, those messages published in its topic have no chance to be garbage collected.
> A time based garbage collection policy would be suitable for this case. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-154) Garbage collect messages for those subscribers inactive/offline for a long time.

Posted by "Ivan Kelly (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196026#comment-13196026 ] 

Ivan Kelly commented on BOOKKEEPER-154:
---------------------------------------

Yes, i don't think the server should automatically GC messages as it breaks the guarantees that Hedwig gives. However, it is acceptable to provide an API to do this, as it may be something which the application requires. I think it would be best to put it into a HedwigAdmin client. I think Ben was talking about writing one. See his message on the mailing list on December 19th.
                
> Garbage collect messages for those subscribers inactive/offline for a long time. 
> ---------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-154
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-154
>             Project: Bookkeeper
>          Issue Type: New Feature
>          Components: hedwig-client, hedwig-server
>    Affects Versions: 4.0.0
>            Reporter: Sijie Guo
>
> Currently hedwig tracks subscribers progress for garbage collecting published messages. If subscriber subscribe and becomes offline without unsubscribing for a long time, those messages published in its topic have no chance to be garbage collected.
> A time based garbage collection policy would be suitable for this case. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira