You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@bookkeeper.apache.org by Sijie Guo <gu...@gmail.com> on 2013/01/14 04:53:17 UTC

[Proposal] Support "re-open, append" semantic in BookKeeper.

Hello all,

Currently Hedwig used *ledgers* to store messages for a topic. It requires
lots of metadata operations when a hub server owned a topic. These metadata
operations are:

   1. read topic persistence info for a topic. (1 metadata read operation)
   2. close the last opened ledger. (1 metadata read operation, 2 metadata
   write operations)
   3. create a new ledger to write. (1 metadata write operation)
   4. update topic persistence info fot the topic to track the new ledger.
   (1 metadata write operation)

so there are at least 2 metadata read operations and 4 metadata write
operations when acquiring a topic. if a hub server owned lots of topics
restarts, it would introduce a spike of metadata accesses to the metadata
storage (e.g. ZooKeeper).

Currently hedwig's design is originated from ledger's *"write once, read
many"* semantic.

   1. Ledger id is generated by bookkeeper. Hedwig needs to record ledger
   id in extra places, which introduce extra metadata accesses.
   2. A ledger could not wrote any more entries after it was closed => so
   hedwig has to create a new ledger to write new entries after the ownership
   of a topic is changed (e.g. hub server failure, topic release).
   3. A ledger's entries could not be *deleted* only after a ledger is
   deleted => so hedwig has to change ledgers, which let entries could be
   consumed by *deleting* ledger after all subscribers consumed.

I proposed two new apis accompanied with "re-open, append" semantic in
BookKeeper, for high performance metadata access and easy metadata
management for applications.

public void openLedger(String ledgerName, DigestType digestType,
byte[] passwd, Mode mode);

*Mode* indicates the access mode of a ledger, which would be *O_CREATE*, *
O_APPEND*, *O_RDONLY*.

   - O_CREATE: create a new ledger with the given ledger name. if there is
   a ledger existed already, fail the creation. similar as createLedger now.
   - O_APPEND: open a new ledger with the given ledger name and continue
   write entries.
   - O_RDONLY: open a new ledger w/o changing any state just reading
   entries already persisted. similar as openLedgerNoRecovery now.

*ledgerName* indicates the name of a ledger. user could pick up either name
he likes, so he could manage his ledgers in his way like introducing
namespace over it, instead of bookkeeper generatating ledger id for them.
(in most of cases, application needs to find another place to store the
generated ledger id. the practise is really bad)

public void shrink(long endEntryId, boolean force) throws BKException;

*Shrink* means cutting the entries starting from *startEntryId* to *
endEntryId* (endEntryId is non-inclusive). *startEntryId*is implicit in
ledger metadata, which is 0 for a non-shrinked ledger, while it is *
endEntryId* from previous valid shrink.

'Force' flag indicate whether to issue garbage collection request after we
just move the *startEntryId* to *endEntryId*. If the flag is true, we issue
garbage collection request to notify bookie server to do garbage
collection; otherwise, we just move *startEntryId* to *endEntryId*. This
feature might be useful for some applications. Take Hedwig for example, we
could leverage this feature not to store the subscriber state for those
topics which have only one subscriber for each. Each time after specific
number of messages consumed, we move the entry point by*shrink(entryId,
false)*. After several messages consumed, we garbage collected them by
*shrink(entryId,
true)*.

Using *shrink*, application could relaim the disk space occupied by a
ledger w/o creating new ledger and deleting old one.

These two operations are based on two mechanisms: one is 'session fencing',
and the other one is 'improved garbage collection (BOOKKEEPER-464)'.
Details are in the gist https://gist.github.com/4520260 . I would try to
start working on some drafts based on the idea to demonstrate its
correctness.

Welcome for comments and discussions.
-Sijie

Re: Discussion Request on Ledger ID Generation

Posted by Flavio Junqueira <fp...@yahoo.com>.

On Jan 19, 2013, at 5:59 PM, Jiannan Wang <ji...@yahoo-inc.com> wrote:

> 
> 3) could ZooKeeper#getChildren carry 8192 children? Flavio, do you have any
> number about it? or maybe we need some experiments to ensure it works.
> -----------------------------------
> 
> 
> I'll make experiments after the coding.
> 
> 
> 


In my understanding the limit is on the number of bytes, not the number of children, check this old jira:

	https://issues.apache.org/jira/browse/ZOOKEEPER-272

-Flavio

> 
> On 1/18/13 4:26 PM, "Sijie Guo" <gu...@gmail.com> wrote:
> 
>> Thanks Jiannan for providing and organizing the discussions.
>> 
>> currently bookkeeper could not handle id confliction perfectly so far
>> especially for the case 3 Flavio mentioned. So we heavily replied on
>> an implementation of id generator need to generate unique ledger id, which
>> means the id generation requires a centralized place to track or
>> co-ordinate. (e.g zookeeper). It is not perfect, but better as Jiannan
>> proposed. But since Jiannan has provided the patch to separate id
>> generation from ledger manager, it would be helpful to the community to
>> evaluate better and fast id generation algorithm. So let's improve it by
>> iterations.
>> 
>>> ledger metadata management.
>> 
>> thinking more about radix tree Jiannan proposed. several comments on it:
>> 
>> 1) how to delete a ledger? so you need to take care of telling from a
>> leave
>> znode from a inner znode in your implementation.
>> 2) taking care of iterating the tree in order, since Scan And Compare gc
>> algorithm relies on the order. But if we evolved to improved gc algorithm,
>> this is not be a problem. maybe we need to start pushing the changes for
>> improved gc algorithm into trunk after 4.2.0 released.
>> 3) could ZooKeeper#getChildren carry 8192 children? Flavio, do you have
>> any
>> number about it? or maybe we need some experiments to ensure it works.
>> 4) if this change wants to be applied on HierarchicalLedgerManager, it
>> would be better to consider how to make the change smoothly, since it is
>> different organized format.
>> 
>> from the above comments, I would suggest that you could start this work as
>> a new ledger manager w/o affecting other ledger managers. And apply the
>> idea back to HierarchicalLedgerManager later if possible.
>> 
>> -Sijie
>

Re: Discussion Request on Ledger ID Generation

Posted by Jiannan Wang <ji...@yahoo-inc.com>.

Thanks Sijie for discussing on implementation details.

1) how to delete a ledger? so you need to take care of telling from a leave
znode from a inner znode in your implementation.
2) taking care of iterating the tree in order, since Scan And Compare gc
algorithm relies on the order. But if we evolved to improved gc algorithm,
this is not be a problem. maybe we need to start pushing the changes for
improved gc algorithm into trunk after 4.2.0 released.
-----------------------------------

You are right, we can distinguish whether inner znode is deleted or not by
its data size simply. And iterate the tree in order only requires a BFS
process, but it shows that previous ScanAndCompare GC may require one more
read for each inner znode.

3) could ZooKeeper#getChildren carry 8192 children? Flavio, do you have any
number about it? or maybe we need some experiments to ensure it works.
-----------------------------------


I'll make experiments after the coding.

I would suggest that you could start this work as
a new ledger manager w/o affecting other ledger managers. And apply the
idea back to HierarchicalLedgerManager later if possible.
-----------------------------------


Ok, I'll make a new LedgerManager.


Thanks,
Jiannan



On 1/18/13 4:26 PM, "Sijie Guo" <gu...@gmail.com> wrote:

>Thanks Jiannan for providing and organizing the discussions.
>
>currently bookkeeper could not handle id confliction perfectly so far
>especially for the case 3 Flavio mentioned. So we heavily replied on
>an implementation of id generator need to generate unique ledger id, which
>means the id generation requires a centralized place to track or
>co-ordinate. (e.g zookeeper). It is not perfect, but better as Jiannan
>proposed. But since Jiannan has provided the patch to separate id
>generation from ledger manager, it would be helpful to the community to
>evaluate better and fast id generation algorithm. So let's improve it by
>iterations.
>
>> ledger metadata management.
>
>thinking more about radix tree Jiannan proposed. several comments on it:
>
>1) how to delete a ledger? so you need to take care of telling from a
>leave
>znode from a inner znode in your implementation.
>2) taking care of iterating the tree in order, since Scan And Compare gc
>algorithm relies on the order. But if we evolved to improved gc algorithm,
>this is not be a problem. maybe we need to start pushing the changes for
>improved gc algorithm into trunk after 4.2.0 released.
>3) could ZooKeeper#getChildren carry 8192 children? Flavio, do you have
>any
>number about it? or maybe we need some experiments to ensure it works.
>4) if this change wants to be applied on HierarchicalLedgerManager, it
>would be better to consider how to make the change smoothly, since it is
>different organized format.
>
>from the above comments, I would suggest that you could start this work as
>a new ledger manager w/o affecting other ledger managers. And apply the
>idea back to HierarchicalLedgerManager later if possible.
>
>-Sijie

Re: Discussion Request on Ledger ID Generation

Posted by Sijie Guo <gu...@gmail.com>.

Thanks Jiannan for providing and organizing the discussions.

currently bookkeeper could not handle id confliction perfectly so far
especially for the case 3 Flavio mentioned. So we heavily replied on
an implementation of id generator need to generate unique ledger id, which
means the id generation requires a centralized place to track or
co-ordinate. (e.g zookeeper). It is not perfect, but better as Jiannan
proposed. But since Jiannan has provided the patch to separate id
generation from ledger manager, it would be helpful to the community to
evaluate better and fast id generation algorithm. So let's improve it by
iterations.

> ledger metadata management.

thinking more about radix tree Jiannan proposed. several comments on it:

1) how to delete a ledger? so you need to take care of telling from a leave
znode from a inner znode in your implementation.
2) taking care of iterating the tree in order, since Scan And Compare gc
algorithm relies on the order. But if we evolved to improved gc algorithm,
this is not be a problem. maybe we need to start pushing the changes for
improved gc algorithm into trunk after 4.2.0 released.
3) could ZooKeeper#getChildren carry 8192 children? Flavio, do you have any
number about it? or maybe we need some experiments to ensure it works.
4) if this change wants to be applied on HierarchicalLedgerManager, it
would be better to consider how to make the change smoothly, since it is
different organized format.

from the above comments, I would suggest that you could start this work as
a new ledger manager w/o affecting other ledger managers. And apply the
idea back to HierarchicalLedgerManager later if possible.

-Sijie


On Thu, Jan 17, 2013 at 10:17 AM, Jiannan Wang <ji...@yahoo-inc.com>wrote:

> Hello all,
>    Currently, the ledger id generation is implemented with zookeeper
> (persist-/ephemeral-) sequential node to make a global unique id. In code
> detail,
>       - FlatLedgerManager requires a write on zookeeper
>       - HierarchicalLedgerManager and MSLedgerManagerFactory use same
> approach which includes a write and a delete operation to zookeeper.
>    Obviously, this ledger id generation process is too heavy, since what
> we want is only a global unique id. Also there has been a JIRA
> BOOKKEEPER-421<https://issues.apache.org/jira/browse/BOOKKEEPER-421>
> shows that current ledger id space is limited to 32 bits by the cversion
> (int type) in zookeeper node. So we need to enlarge the ledger id space to
> 64 bits.
>
>    Then there are two questions:
>       1. How to generate a 64 bits global unique id?
>       2. How to maintain the metadata for 64 bits ledger id in zookeeper?
> (Absolutely, current 2-4-4 split for ledger id is not suitable, see
> HierarchicalLedgerManager)
>
> --------------I'm a split line for 64 bits ledger id
> generation-----------------------------
>
> For 64 bits global unique id generation, Flavio, Ivan, Sijie and I have a
> discussion in mail, here are two proposals:
>    1. Let client generate the id itself (Ivan proposed): leverage
> zookeeper session id as a unique part and client maintains a counter in
> memory. so the id would be {session_id}{counter}.
>    2. Batch id generation (Jiannan proposed): use zookeeper znode as
> counter to track generated ids. During the implementation, client asked
> zookeeper for a counter range. after that, the id generation is proceeded
> locally w/o contacting zookeeper.
>
>    For proposal 1, the performance would be very great since it's local
> generation totally. But Sijie has one concern: "in reality, it seems that
> it doesn't work. zookeeper session id is long, while ledger id is long, you
> could not put session id as part of ledger id. otherwise, it would cause id
> conflict..".
>    And then Flavio and Ivan suggest perhaps we could simply use a
> procedure similar to the one used in ZooKeeper to generate and increment
> session ids in ZooKeeper. But Sijie figure out that this process in
> zookeeper includes a current system timestamp which may exhaust the 64 bits
> id space quickly. Also Flavio is thinking of reusing ledger identifiers,
> but he address that there are three scenarios if we reuse a ledger
> identifier:
>       1- The previous ledger still exists and its metadata is stored. In
> this case, we can detect it when trying to create the metadata for the new
> ledger;
>       2- The previous ledger has been fully deleted (metadata +  ledger
> fragments);
>       3- Metadata for the previous ledger has been deleted, but the ledger
> fragments haven't.
>    Flavio: "Case 1 can be easily detected, while case 2 causes no problem
> at all. Case 3 is the problematic one, but I can't remember whether it can
> happen or not given the way we do garbage collection currently. I need to
> review how we do it, but in the case scenario 3 can happen, we could have
> the ledger writers using different master keys, which would cause the
> bookie to return an error when trying to write to a ledger that already
> exists."
>
>    For proposal 2, it still requires to access zookeeper but the write
> frequency could be quite small once we set a large batch size (like 10000).
>
>    In summary, proposal 1 aims to generate a UUID/GUID like id in 64 bits
> space, but the possibility of conflict should be taken into account and if
> the id generated is not monotone we should take care of the case 3 listed
> above. Proposal 2 has no problem on a quick monotone id generation, but the
> process involves zookeeper.
>    By the way, I've submitted a patch in BOOKKEEPER-438<
> https://issues.apache.org/jira/browse/BOOKKEEPER-438> to move ledger id
> generation out of LedgerManager, and I'll add a conf setting in another
> JIRA to give bookkeeper client a chance to customize his own id generation
> idea. I'll appreciate if anyone can help to review on the patch (thanks
> Sijie first).
>
> --------------I'm a split line for 64 bits ledger id metadata
> management-----------------------------
>
>    HierarchicalLedgerManager use 2-4-4 style to split current 10 chars
> ledger id, E.g Ledger 0000000001 is splited into 3 parts 00,0000,0001 and
> stored in zookeeper path "(ledgersRootPath)/00/0000/L0001". So each znode
> could have at most 10000 ledgers, which avoids errors during garbage
> collection due to lists of children that are too long.
>    After we enlarge the ledger id space to 64 bits, it's a big problem to
> manage for large ledger id.
>
>    My idea is split the ledger id under the radix 2^13=8192 and then
> construct it in a radix tree. For example, ledger id 2, 5, and
> 41093(==5X8192+133) then the tree in zookeeper would be:
>          (ledger id root)
>             /      \
>         2 (meta)   5 (meta)
>                      \
>                   133 (meta)
>    So there will be at most 8192 children under each znode and the depth
> is (64/13=5) at most.
>    Note that the inner znode will also record metadata, so if ledger id
> generation is increasing step by step, then the depth of this radix tree
> only grows as needed. And I guess it can handle all 2^64 ledger ids ideally.
>
>    Since speaking of metadata, I would like to share a test result we make
> these two days. For HierarchicalLedgerManager , we observe that a ledger
> metadata consumes 700+ bytes in zookeeper, this may possible because of
> LedgerMetadata.serialize() uses a pure text format. But the data size is
> only 300+ bytes in ledger id node, and I guess the extra space is occupied
> by the overhead of inner hierarchical node. What's more, the memory a topic
> consume is 2k with only 1 subscriber and no pub: there is no metadata for
> topic ownership (since we now use consistent hash for topic ownership), and
> the metadata size for subscription and persistence are both 8 bytes. I'll
> investigate more and then issue a new topic on it.
>
>
> Best,
> Jiannan
>
>

Re: Discussion Request on Ledger ID Generation

Posted by Jiannan Wang <ji...@yahoo-inc.com>.

So you also involves zookeeper...

And in my view, this approach is same with the batch ledger id generation
proposal (with batch size 2^24).

- Jiannan

>On Fri, Jan 18, 2013 at 03:17:47AM +0900, Jiannan Wang wrote:
>>    For proposal 1, the performance would be very great since it's
>>    local generation totally. But Sijie has one concern: "in reality,
>>    it seems that it doesn't work. zookeeper session id is long,
>>    while ledger id is long, you could not put session id as part of
>>    ledger id. otherwise, it would cause id conflict..".
>The session id part doesn't have to be 64 bit. It's 64bit if you use
>the zookeeper session ID, but we could have the bookkeeper client
>generate it when it connects to zookeeper by having a znode, whose
>contents each client increments. It means 1 more read and 1 more write
>per client, but I don't think thats a big hit.
>
>So if we give the session id part, 40 bits, and the client local bit
>24, then each client can create up for 2^24 ledgers, and there can be
>2^40 clients over the lifetime of the system. This numbers can be
>tuned.
>
>-Ivan

Re: Discussion Request on Ledger ID Generation

Posted by Ivan Kelly <iv...@apache.org>.

On Fri, Jan 18, 2013 at 03:17:47AM +0900, Jiannan Wang wrote:
>    For proposal 1, the performance would be very great since it's
>    local generation totally. But Sijie has one concern: "in reality,
>    it seems that it doesn't work. zookeeper session id is long,
>    while ledger id is long, you could not put session id as part of
>    ledger id. otherwise, it would cause id conflict..". 
The session id part doesn't have to be 64 bit. It's 64bit if you use
the zookeeper session ID, but we could have the bookkeeper client
generate it when it connects to zookeeper by having a znode, whose
contents each client increments. It means 1 more read and 1 more write
per client, but I don't think thats a big hit.

So if we give the session id part, 40 bits, and the client local bit
24, then each client can create up for 2^24 ledgers, and there can be
2^40 clients over the lifetime of the system. This numbers can be
tuned.

-Ivan

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Sijie Guo <gu...@gmail.com>.

see comments in line.

On Fri, Jan 18, 2013 at 12:54 AM, Flavio Junqueira <fp...@yahoo.com>wrote:

> When you say that 'create a ledger and record the ledger id in some other
> places' is bad, I assume that you're referring to the number of metadata
> accesses. The pattern itself is not bad; BookKeeper provides a ledger
> abstraction, the application designer decides the best way of using those
> ledgers. Storing the ledgers elsewhere is a common pattern indeed, but
> again, managed ledgers have been proposed in my understanding to cover
> common patterns like this one.
>

metadata accesses is one thing. leaking ledger id and causing zombie
metadata is another thing.

for managed ledger, I explain it in my previous email.


>
> About the use of names instead of longs, I believe one reason we did it is
> that we didn't initially expect users to manipulate ledgers directly, so
> programs manipulate the ledgers. Consequently, doing things like comparing
> two ledger ids is more efficient with longs. Granted that for
> administration perhaps human readable names could be easier.
>
> > Removing metadata accessing means
> > you could not change ledgers, a better way to do that is to 'shrink' a
> > ledger.
>
> I think this is a great point, but can we actually review why publishes
> are causing metadata accesses? This should only happen upon topic creation,
> acquisition, or when rolling the log, no? For the first one and for the
> particular application I believe we are discussing, we shouldn't be
> creating and destroying topics as frequently as we are. Topic acquisition
> is a problem in the case of crashes, but for topic acquisition, we'll need
> at least one access to the metadata anyway, although we could amortize an
> access across many ledgers if we have groups. For the last one, I believe
> we can increase the frequency of rolling.
>

reasons explained in previous email.

one more sentence about 'ledger id', for 'ledger id' style, there are two
metadata, one is ledger metadata and the other one is application metadata
(e.g. hedwig persistence info). Even you could group ledger metadata, if
you don't group application metadata, it is still 'linear'. if we could
consolidate two metadata into one, then applying some kind of group ideas,
it might really get rid of 'linear'.


>
> -Flavio
>
> On Jan 18, 2013, at 7:56 AM, Sijie Guo <gu...@gmail.com> wrote:
>
> > As I pointed, the style 'create a ledger and record the ledger id in some
> > other places'  is bad, especially changing ledger would cause 2 metadata
> > writes and it is in the publish path, which caused latency spike (the
> > latency spike depends on the two metadata writes latency). The reason
> why I
> > proposed using 'ledger name', 're-open' and 'shrink', is trying to remove
> > any metadata accesses during publishes. Removing metadata accessing means
> > you could not change ledgers, a better way to do that is to 'shrink' a
> > ledger.
> >
> > -Sijie
> >
> > On Thu, Jan 17, 2013 at 7:26 AM, Ivan Kelly <iv...@apache.org> wrote:
> >
> >> Also, I don't think the shrink operation is necessary. The aim here is
> >> to avoid the metadata spike, yes? If so, we can still roll the ledger
> >> when it gets to a certain capacity, and this is unlikely to happen on
> >> a spike.
> >>
>
>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Flavio Junqueira <fp...@yahoo.com>.

When you say that 'create a ledger and record the ledger id in some other places' is bad, I assume that you're referring to the number of metadata accesses. The pattern itself is not bad; BookKeeper provides a ledger abstraction, the application designer decides the best way of using those ledgers. Storing the ledgers elsewhere is a common pattern indeed, but again, managed ledgers have been proposed in my understanding to cover common patterns like this one. 

About the use of names instead of longs, I believe one reason we did it is that we didn't initially expect users to manipulate ledgers directly, so programs manipulate the ledgers. Consequently, doing things like comparing two ledger ids is more efficient with longs. Granted that for administration perhaps human readable names could be easier. 

> Removing metadata accessing means
> you could not change ledgers, a better way to do that is to 'shrink' a
> ledger.

I think this is a great point, but can we actually review why publishes are causing metadata accesses? This should only happen upon topic creation, acquisition, or when rolling the log, no? For the first one and for the particular application I believe we are discussing, we shouldn't be creating and destroying topics as frequently as we are. Topic acquisition is a problem in the case of crashes, but for topic acquisition, we'll need at least one access to the metadata anyway, although we could amortize an access across many ledgers if we have groups. For the last one, I believe we can increase the frequency of rolling.

-Flavio 

On Jan 18, 2013, at 7:56 AM, Sijie Guo <gu...@gmail.com> wrote:

> As I pointed, the style 'create a ledger and record the ledger id in some
> other places'  is bad, especially changing ledger would cause 2 metadata
> writes and it is in the publish path, which caused latency spike (the
> latency spike depends on the two metadata writes latency). The reason why I
> proposed using 'ledger name', 're-open' and 'shrink', is trying to remove
> any metadata accesses during publishes. Removing metadata accessing means
> you could not change ledgers, a better way to do that is to 'shrink' a
> ledger.
> 
> -Sijie
> 
> On Thu, Jan 17, 2013 at 7:26 AM, Ivan Kelly <iv...@apache.org> wrote:
> 
>> Also, I don't think the shrink operation is necessary. The aim here is
>> to avoid the metadata spike, yes? If so, we can still roll the ledger
>> when it gets to a certain capacity, and this is unlikely to happen on
>> a spike.
>>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Ivan Kelly <iv...@apache.org>.

On Fri, Jan 18, 2013 at 08:07:14PM -0800, Sijie Guo wrote:
> in most of case, there are subscriptions, since Hedwig's use case is
> reliable pub/sub system. so most of applications subscribed and never
> unsubscribe. so the case you described 'If there's no subscribers, we can
> just discard the message' is rare in Hedwig.
Yes, so my point is that it's quite straight forward to avoid publish
latency in this scenario.

> It affects the incoming publish requests at the 'ledger change' time
> period. Especially, for a continuous publish stream, how you decide the
> point to avoid the hit?
> 
> One possible way is Hub server could pre-create ledger before the point
> changing ledger. pre-creation could run in background, not affecting
> publish path. but pre-creation increase the possibility to have zombie
> ledgers and exhausted ledger id space. And also it would make ledger change
> logic pretty complex. so it is not good.
You have to deal with zombies at some level anyhow. If you don't have
zombie ledgers, you can still have zombie topics. If you fix the
zombie topics, you can still have zombie users. In any case, I don't
think the zombie ledgers are a huge issue. Lets say, that for each
crash, you zombify 100 ledgers, over a year of running a 1000 machine
cluster, you may have 100 crashs, thats only 10,000 zombies which
really isn't very much in that scale.

I don't think it will make code complex either. We just need a
preallocator thread that feeds into a blocking queue. The code that
wants the ledger can then just take from the queue.

-Ivan

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Sijie Guo <gu...@gmail.com>.

On Fri, Jan 18, 2013 at 8:06 AM, Ivan Kelly <iv...@apache.org> wrote:

> You touch on a key point here (it's in the the publish path). But it
> doesn't have to be. There's two cases where this is in the publish
> path, first publish and rolling the ledger.
>
> For the first publish to a topic, a ledger only needed to be created
> if we are persisting the message, i.e. there are already subscribers
> for that topic. If there's no subscribers, we can just discard the
> message.
>

in most of case, there are subscriptions, since Hedwig's use case is
reliable pub/sub system. so most of applications subscribed and never
unsubscribe. so the case you described 'If there's no subscribers, we can
just discard the message' is rare in Hedwig.

>
> In the second case, for rolling, we need to create a new ledger and
> write it to the topic persistence info. This doesn't need to be in the
> critical path though. They can occur after a message has been
> persisted, to avoid the hit.
>

It affects the incoming publish requests at the 'ledger change' time
period. Especially, for a continuous publish stream, how you decide the
point to avoid the hit?

One possible way is Hub server could pre-create ledger before the point
changing ledger. pre-creation could run in background, not affecting
publish path. but pre-creation increase the possibility to have zombie
ledgers and exhausted ledger id space. And also it would make ledger change
logic pretty complex. so it is not good.

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Ivan Kelly <iv...@apache.org>.

On Thu, Jan 17, 2013 at 10:56:12PM -0800, Sijie Guo wrote:
> 'create a ledger and record the ledger id in some other places' is bad
I disagree. This approach allows us to keep the bookkeeper interface
completely decoupled from whatever is using it. It keeps the data
model of bookkeeper simple. 

> especially changing ledger would cause 2 metadata writes and it is
> in the publish path, which caused latency spike (the latency spike
> depends on the two metadata writes latency). 
You touch on a key point here (it's in the the publish path). But it
doesn't have to be. There's two cases where this is in the publish
path, first publish and rolling the ledger.

For the first publish to a topic, a ledger only needed to be created
if we are persisting the message, i.e. there are already subscribers
for that topic. If there's no subscribers, we can just discard the
message.

In the second case, for rolling, we need to create a new ledger and
write it to the topic persistence info. This doesn't need to be in the
critical path though. They can occur after a message has been
persisted, to avoid the hit.

> The reason why I  proposed using 'ledger name', 're-open' and
> 'shrink', is trying to remove any metadata accesses during
> publishes. Removing metadata accessing means you could not change
> ledgers, a better way to do that is to 'shrink' a ledger.
I think this can be achieved with the 're-open' operation alone. The
reason I'm being so resistent to 'ledger name' and 'shrink', is that I
don't like changing lower layers until it's absolutely sure that
nothing can be done at the higher layers to solve the problem,
i.e. when it shows that there's a fundamental shortcoming in the lower
layer. I don't think that is the case for ledger name and shrink.

-Ivan

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Sijie Guo <gu...@gmail.com>.

As I pointed, the style 'create a ledger and record the ledger id in some
other places'  is bad, especially changing ledger would cause 2 metadata
writes and it is in the publish path, which caused latency spike (the
latency spike depends on the two metadata writes latency). The reason why I
proposed using 'ledger name', 're-open' and 'shrink', is trying to remove
any metadata accesses during publishes. Removing metadata accessing means
you could not change ledgers, a better way to do that is to 'shrink' a
ledger.

-Sijie

On Thu, Jan 17, 2013 at 7:26 AM, Ivan Kelly <iv...@apache.org> wrote:

> Also, I don't think the shrink operation is necessary. The aim here is
> to avoid the metadata spike, yes? If so, we can still roll the ledger
> when it gets to a certain capacity, and this is unlikely to happen on
> a spike.
>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Ivan Kelly <iv...@apache.org>.

> WC got a normal I/O, entering ensemble change logic. But it could not do
> ensemble change, since ledger state is already changed from OPEN to
> IN_RECOVERY. so it would fail. If it doesn't stop and fail, it is a bug for
> fencing, right? Please keep in mind, fencing is guarantee by bother
> metadata CAS and fence state in bookie servers. We changed the metadata
> before proceeding any actions. It would guarantee one is succeed and the
> other one is failed.
Ah, so the write I was referring to is actually the one write that you
propose keeping? I think that could work then.

I would define it as
BookKeeper#resumeLedger(int ledgerId, DigestType type, byte[] passwd)
though. Overloading openLedger is a bit ugly.

Also, I don't think the shrink operation is necessary. The aim here is
to avoid the metadata spike, yes? If so, we can still roll the ledger
when it gets to a certain capacity, and this is unlikely to happen on
a spike. Not having the shrink operation will also help preserve the
"immutable after close" property that ledgers have.


-Ivan

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Sijie Guo <gu...@gmail.com>.

First of all, I don't bound to any special case in this proposal. Since it
would make an performance improvement (either in restarting hub servers or
ledger changing) and solve some problems caused by current working style,
no matter how many topics owned by a hub server. It is generic benefiting
every users in BookKeeper. I added some references in that gist, to be
clarify, I added them here again. Especially, Latency spike already found
in BOOKKEEPER-448 according to some previous load testings. And the
proposal here really get rid of metadata accesses in a publish request
path, which is critical to most of the applications.

BOOKKEEPER-448: reduce publish latency when changing ledgers (
https://issues.apache.org/jira/browse/BOOKKEEPER-448)
BOOKKEEPER-449: Zombie ledgers existed in Hedwig (
https://issues.apache.org/jira/browse/BOOKKEEPER-449)

Secondly, for metadata improvements, it would not be an easy work, and
could be processed in several dimensions and improved by iterations. And
this proposal is just first step to optimize it, I don't say it would
resolve everything. And so far I don't see any side effects that it would
affect other solutions made in future.

The last thing, about you comment on fencing. I am not very clear about
your point here.

> What happens if the fenced server fails
> between being fenced and WC trying to write? It will get a normal i/o
> error.

WC got a normal I/O, entering ensemble change logic. But it could not do
ensemble change, since ledger state is already changed from OPEN to
IN_RECOVERY. so it would fail. If it doesn't stop and fail, it is a bug for
fencing, right? Please keep in mind, fencing is guarantee by bother
metadata CAS and fence state in bookie servers. We changed the metadata
before proceeding any actions. It would guarantee one is succeed and the
other one is failed.

So similar mechanism as session fencing. one would succeed in incrementing
the session id and gained the permission to proceed actions.

If fencing works correctly, session fencing would works correctly too.

-Sijie


On Wed, Jan 16, 2013 at 11:00 AM, Ivan Kelly <iv...@apache.org> wrote:

> On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote:
> > > Originally, it was meant to have a number of
> > > long lived subscriptions, over which a lot of data travelled. Now the
> > > load has flipped to a large number of short lived subscriptions, over
> > > which relatively little data travels.
> >
> > The topic discussed here doesn't relate to hedwig subscriptions, it just
> > about how hedwig use ledgers to store its messages.  Even there are no
> > subscriptions, the problem is still there. The restart of a hub server
> > carrying large number of topics would hit the metadata storage with many
> > accesses. The hit is a hub server acquiring a topic, no matter the
> > subscription is long lived or short lived. after topic is acquired,
> > following accesses are in memory, which doesn't cause any
> > performance issue.
> I was using topics and subscriptions to mean the same thing here due
> to the usecase we have in Yahoo where they're effectively the same
> thing. But yes, I should have said topic. But my point still
> stands. Hedwig was designed to deal with fewer topics, which had a lot
> of data passing through them, rather than more topics, with very
> little data passing though them. This is why zk was consider suffient
> at that point, as tens of thousands of topics being recovered really
> isn't an issue for zk. The point I was driving at is that, the usecase
> has changed in a big way, so it may require a big change to handle it.
>
> > But we should separate the capacity problem from the software problem. A
> > high performance and scalable metadata storage would help for resolving
> > capacity problem. but either implementing a new one or leveraging a high
> > performance one doesn't change the fact that it still need so many
> metadata
> > accesses to acquire topic. A bad implementation causing such many
> metadata
> > accesses is a software problem. If we had chance to improve it, why
> > not?
> I don't think the implementation is bad, but rather the assumptions,
> as I said earlier. The data:metadata ratio has completely changed
> completely. hedwig/bk were designed with a data:metadata ratio of
> something like 100000:1. What we're talking about now is more like 1:1
> and therefore we need to be able to handle an order of magnitude more
> of metadata than previously. Bringing down the number of writes by an
> order of 2 or 3, while a nice optimisation, is just putting duct tape
> on the problem.
>
> >
> > > The ledger can still be read many times, but you have removed the
> > guarantee that what is read each time will be the same thing.
> >
> > How we guarantee a reader's behavior when a ledger is removed at the same
> > time? We don't guarantee it right now, right? It is similar thing for a
> > 'shrink' operation which remove part of entries, while 'delete' operation
> > removes whole entries?
> >
> > And if I remembered correctly, readers only see the same thing when a
> > ledger is closed. What I proposed doesn't volatile this contract.  If a
> > ledger is closed (state is in CLOSED), an application can't re-open it.
> If
> > a ledger isn't closed yet, an application can recover previous state and
> > continue writing entries using this ledger. for applications, they could
> > still use 'create-close-create' style to use ledgers, or evolve to new
> api
> > for efficiency smoothly, w/o breaking any backward compatibility.
> Ah, yes, I misread your proposal originally, I thought the reopen was
> working with an already closed ledger.
>
> On a side note, the reason we have an initial write for fencing, is
> that when the reading client(RC) fences, the servers in the ensemble
> start returning an error to the writing client (WC). At the moment we
> don't distinguish between a fencing error and a i/o error for
> example. So WC will try to rebuild a the ensemble by replacing the
> erroring servers. Before writing to the new ensemble, it has to update
> the metadata, and at this point it will see that it has been
> fenced. With a specific FENCED error, we could avoid this write. This
> makes me uncomfortable though. What happens if the fenced server fails
> between being fenced and WC trying to write? It will get a normal i/o
> error. And will try to replace the server. Since the metadata has not
> been changed, nothing will stop it, and it may be able to continue
> writing. I think this is also the case for the session fencing solution.
>
> -Ivan
>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Sijie Guo <gu...@gmail.com>.

re-clarify the points in the proposal to make it clear.

1) change ledger id to ledger name.

giving a name is a more natural way for human beings. id generation just
eases the work naming a ledger. I don't see any optimization or efficiency
that we gained in BookKeeper is based on a 'long' ledger id so far.

2) session fencing rather than fencing.

same working mechanism as fencing. same correctness guarantee.

3) remove entries range rather than delete a ledger.

same working mechanism as improved gc proposed in BOOKKEEPER-464.


On Wed, Jan 16, 2013 at 10:58 PM, Sijie Guo <gu...@gmail.com> wrote:

> > If this observation is correct, can we acquire those topics lazily
> instead of eagerly?
>
> could you add entries to a ledger without opening ledger?
>
> > Another approach would be to group topics into buckets and have hubs
> acquiring buckets of topics, all topics of a bucket writing to the same
> ledgers. There is an obvious drawback to this approach when reading
> messages for a given topic.
>
> for topic ownership, we could proceed it in groups as I proposed in
> BOOKKEEPER-508.
> for topic persistence info, we should not do that, otherwise it volatile
> the contract that Hedwig made and you add complexity into Hedwig itself.
> Grouping could be considered in application side, but if application could
> not or is difficult to do that, we still have to face such many topics to
> make Hedwig as a scalable platform.
>
> I don't like what you said 'the number of accesses to the metadata is
> still linear on the number of topics'. Metadata accesses problem is an
> 'entity locating' problem. Two categories of solutions are used to locate
> an entity, one is mapping, the other one is computation.
>
> Mapping means you need to find a place to store this entity or store the
> location of this entity. Like HDFS NameNode storing the locations of blocks
> in Namenode. If an entity existed in this way, you had at least one access
> to find it. no matter it was stored in ZooKeeper, in memory, embedded in
> bookie servers or grouped in somewhere. You could not make any changes to
> the 'linear' relationship between number of accesses and number of
> entities. To make it more clearly, in HDFS, 'entity' means a data block, if
> you had an entity location, read data is fast, but cross boundary of
> blocks, a metadata lookup is required. Similarly, in Hedwig, 'entity' means
> a topic, if the topic is owned, read messages in this topic is fast. But in
> either case, the number of accesses is still linear on the number of
> entities.
>
> Computation means I could compute the location of an entity based on its
> name, without looking up somewhere. E.g, Consistent Hashing.
>
> Currently, BookKeeper only gives a ledger id, so Hedwig has to find a
> place to record this ledger id. This is a 'Mapping'. So it is 2 levels
> mapping. one is topic name to Hedwig's metadata, the other is Hedwig
> metadata to Ledger's metadata. You could not get rid of 'linear' limitation
> w/o breaking the mapping between Hedwig metadata and Ledger metadata.
>
> What I proposed is still 'linear'. But it consolidates these two metadata
> into one, and breaks the mapping between Hedwig metadata and Ledger
> metadata. Although Hedwig still requires one lookup from topic name to
> metadata, it gives the opportunity to get rid of this lookup in future
> since we could use 'ledger name' to access its entries now.
>
> An idea like 'placement group' in ceph could be deployed in BookKeeper in
> future, which grouping ledgers to share same ensemble configuration. It
> would reduce the number of mappings between ledger name to ledger metadata.
> but it should be a different topic to discuss. All what I said here is just
> to demonstrate the proposal I made here could make the improvement proceed
> in right direction.
>
>
>
>
> On Wed, Jan 16, 2013 at 1:18 PM, Flavio Junqueira <fp...@yahoo.com>wrote:
>
>> Thanks for the great discussion so far. I have a couple of comments below:
>>
>> On Jan 16, 2013, at 8:00 PM, Ivan Kelly <iv...@apache.org> wrote:
>>
>> > On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote:
>> >>> Originally, it was meant to have a number of
>> >>> long lived subscriptions, over which a lot of data travelled. Now the
>> >>> load has flipped to a large number of short lived subscriptions, over
>> >>> which relatively little data travels.
>> >>
>> >> The topic discussed here doesn't relate to hedwig subscriptions, it
>> just
>> >> about how hedwig use ledgers to store its messages.  Even there are no
>> >> subscriptions, the problem is still there. The restart of a hub server
>> >> carrying large number of topics would hit the metadata storage with
>> many
>> >> accesses. The hit is a hub server acquiring a topic, no matter the
>> >> subscription is long lived or short lived. after topic is acquired,
>> >> following accesses are in memory, which doesn't cause any
>> >> performance issue.
>> > I was using topics and subscriptions to mean the same thing here due
>> > to the usecase we have in Yahoo where they're effectively the same
>> > thing. But yes, I should have said topic. But my point still
>> > stands. Hedwig was designed to deal with fewer topics, which had a lot
>> > of data passing through them, rather than more topics, with very
>> > little data passing though them. This is why zk was consider suffient
>> > at that point, as tens of thousands of topics being recovered really
>> > isn't an issue for zk. The point I was driving at is that, the usecase
>> > has changed in a big way, so it may require a big change to handle it.
>> >
>>
>> About the restart of a hub, if I understand the problem, many topics will
>> be acquired concurrently, inducing the load spike you're mentioning. If
>> this observation is correct, can we acquire those topics lazily instead of
>> eagerly?
>>
>> >> But we should separate the capacity problem from the software problem.
>> A
>> >> high performance and scalable metadata storage would help for resolving
>> >> capacity problem. but either implementing a new one or leveraging a
>> high
>> >> performance one doesn't change the fact that it still need so many
>> metadata
>> >> accesses to acquire topic. A bad implementation causing such many
>> metadata
>> >> accesses is a software problem. If we had chance to improve it, why
>> >> not?
>> > I don't think the implementation is bad, but rather the assumptions,
>> > as I said earlier. The data:metadata ratio has completely changed
>> > completely. hedwig/bk were designed with a data:metadata ratio of
>> > something like 100000:1. What we're talking about now is more like 1:1
>> > and therefore we need to be able to handle an order of magnitude more
>> > of metadata than previously. Bringing down the number of writes by an
>> > order of 2 or 3, while a nice optimisation, is just putting duct tape
>> > on the problem.
>>
>>
>> I love the duct tape analogy. The number of accesses to the metadata is
>> still linear on the number of topics, and reducing the number of accesses
>> per acquisition does not change the complexity. Perhaps distributing the
>> accesses over time, optimized or not, might be a good way to proceed
>> overall. Another approach would be to group topics into buckets and have
>> hubs acquiring buckets of topics, all topics of a bucket writing to the
>> same ledgers. There is an obvious drawback to this approach when reading
>> messages for a given topic.
>>
>> >
>> >>
>> >>> The ledger can still be read many times, but you have removed the
>> >> guarantee that what is read each time will be the same thing.
>> >>
>> >> How we guarantee a reader's behavior when a ledger is removed at the
>> same
>> >> time? We don't guarantee it right now, right? It is similar thing for a
>> >> 'shrink' operation which remove part of entries, while 'delete'
>> operation
>> >> removes whole entries?
>> >>
>> >> And if I remembered correctly, readers only see the same thing when a
>> >> ledger is closed. What I proposed doesn't volatile this contract.  If a
>> >> ledger is closed (state is in CLOSED), an application can't re-open
>> it. If
>> >> a ledger isn't closed yet, an application can recover previous state
>> and
>> >> continue writing entries using this ledger. for applications, they
>> could
>> >> still use 'create-close-create' style to use ledgers, or evolve to new
>> api
>> >> for efficiency smoothly, w/o breaking any backward compatibility.
>> > Ah, yes, I misread your proposal originally, I thought the reopen was
>> > working with an already closed ledger.
>> >
>> > On a side note, the reason we have an initial write for fencing, is
>> > that when the reading client(RC) fences, the servers in the ensemble
>> > start returning an error to the writing client (WC). At the moment we
>> > don't distinguish between a fencing error and a i/o error for
>> > example. So WC will try to rebuild a the ensemble by replacing the
>> > erroring servers. Before writing to the new ensemble, it has to update
>> > the metadata, and at this point it will see that it has been
>> > fenced. With a specific FENCED error, we could avoid this write. This
>> > makes me uncomfortable though. What happens if the fenced server fails
>> > between being fenced and WC trying to write? It will get a normal i/o
>> > error. And will try to replace the server. Since the metadata has not
>> > been changed, nothing will stop it, and it may be able to continue
>> > writing. I think this is also the case for the session fencing solution.
>> >
>> > -Ivan
>>
>>
>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Sijie Guo <gu...@gmail.com>.

check comments in line...


On Fri, Jan 18, 2013 at 6:08 AM, Flavio Junqueira <fp...@yahoo.com>wrote:

> One problem for me with this thread is that there are parallel discussions
> it refers to and the information seems to be spread across different liras,
> other threads, and private communication. I personally find it hard to be
> constructive and converge like that, which I think it should be our goal.
> Can we find a better way of making progress, perhaps discussing issues
> separately and prioritizing them? One possibility is that we have open
> calls (e.g., skype, google hangout) to discuss concrete issues. We should
> of course post our findings to the list or wiki, whichever is more
> convenient.
>

I tried my best to not let this thread out of control. But I am not sure is
my gist so bad that you don't understand, or you don't take a look at it
and its references. I found that I just end up clarifying some points again
and again that it is already pointed in the document. I was tired of that
(had to do lots of context switches off working time). I don't think any
open calls would be efficient before we are on the same page. so I would
step back to discuss the ledger id and its working style in another thread
first.


>
> For the moment, I'll add some more comments trying to focus on
> clarifications.
>
>
> >> On Jan 17, 2013, at 7:58 AM, Sijie Guo <gu...@gmail.com> wrote:
> >>
> >>>> If this observation is correct, can we acquire those topics lazily
> >>> instead of eagerly?
> >>>
> >>> could you add entries to a ledger without opening ledger?
> >>>
> >>
> >> By acquiring topics lazily, I meant to say that we delay topic
> acquisition
> >> until there is an operation for such a topic. This is a pretty standard
> >> technique to avoid load spikes.
> >>
> >> I'm sorry if you know the concept already, but to make sure we
> understand
> >> each other, this is the technique what I'm referring to:
> >>
> >>        http://en.wikipedia.org/wiki/Lazy_loading
> >>
> >> Jiannan pointed out one issue with this approach, though. Acquiring
> topics
> >> lazily implies that some operation triggers the acquisition of the
> topic.
> >> Such a trigger could be say a publish operation. In the case of
> >> subscriptions, however, there is no operation that could trigger the
> >> acquisition.
> >>
> >
> > You need to first checkout the operations in Hedwig before talking about
> > lazy loading. for Hedwig, most of the operations are publish and
> subscribe.
> > For publish request, it requires ledger to be created to add entries, so
> > you could not load persistence info (ledger metadata) lazily; for
> subscribe
> > request, it means some one needs to read entries to deliver, it requires
> > ledger to be opened to read entries, so you could still not load
> > persistence info (ledger metadata) lazily. so for persistence info, you
> > don't have any chance to do lazily loading. That is why I asked the
> > question.
> >
> > The issue Jiannan created to work on this idea is actually what I
> suggested
> > him to do that.
>
> I don't know which issue you're referring to. Is this a jira issue?
>

You said "Jiannan pointed out one issue with this approach, though.", I
assumed you already new the issue. But seems that you didn't...

This issue is https://issues.apache.org/jira/browse/BOOKKEEPER-518. And
this issue is also a reference in my gist.


>
> > You already talked about lazy loading, I would clarify the
> > idea on lazily loading:
> >
> > 1) for publish request triggering topic acquire, we don't need to load
> the
> > subscriptions of this topic. it is lazy loading for subscription
> metadata.
> > 2) for subscribe request triggering topic acquire, we could not load the
> > persistence info for a topic, but we could defer the ledger creation and
> > persistence info update until next publish request. this is lazy creating
> > new ledger.
> >
> > these are only two points I could see based on current implementation, we
> > could do for lazily loading. If I missed something, you could point out.
> >
> >
>
> Thanks for the clarification. I don't have anything to add to the list.
>
> >
> >>
> >>
> >>>> Another approach would be to group topics into buckets and have hubs
> >>> acquiring buckets of topics, all topics of a bucket writing to the same
> >>> ledgers. There is an obvious drawback to this approach when reading
> >>> messages for a given topic.
> >>>
> >>> for topic ownership, we could proceed it in groups as I proposed in
> >>> BOOKKEEPER-508.
> >>> for topic persistence info, we should not do that, otherwise it
> volatile
> >>> the contract that Hedwig made and you add complexity into Hedwig
> itself.
> >>> Grouping could be considered in application side, but if application
> >> could
> >>> not or is difficult to do that, we still have to face such many topics
> to
> >>> make Hedwig as a scalable platform.
> >>
> >> I didn't mean to re-propose groups, and I'm happy you did it already. My
> >> point was simply that the complexity is still linear, and I think that
> to
> >> have lower complexity we would need some technique like grouping.
> >>
> >>>
> >>> I don't like what you said 'the number of accesses to the metadata is
> >> still
> >>> linear on the number of topics'. Metadata accesses problem is an
> 'entity
> >>> locating' problem. Two categories of solutions are used to locate an
> >>> entity, one is mapping, the other one is computation.
> >>>
> >>
> >>
> >> We may not like it, but it is still linear. :-)
> >>
> >>
> >>> Mapping means you need to find a place to store this entity or store
> the
> >>> location of this entity. Like HDFS NameNode storing the locations of
> >> blocks
> >>> in Namenode. If an entity existed in this way, you had at least one
> >> access
> >>> to find it. no matter it was stored in ZooKeeper, in memory, embedded
> in
> >>> bookie servers or grouped in somewhere. You could not make any changes
> to
> >>> the 'linear' relationship between number of accesses and number of
> >>> entities. To make it more clearly, in HDFS, 'entity' means a data
> block,
> >> if
> >>> you had an entity location, read data is fast, but cross boundary of
> >>> blocks, a metadata lookup is required. Similarly, in Hedwig, 'entity'
> >> means
> >>> a topic, if the topic is owned, read messages in this topic is fast.
> But
> >> in
> >>> either case, the number of accesses is still linear on the number of
> >>> entities.
> >>
> >>
> >> Agreed, you need one access per entity. The two techniques I know of and
> >> that are often used to deal efficiently with the large number of
> accesses
> >> to such entities are lazy loading, which spreads more the accesses over
> >> time and allows the system to make progress without waiting for all of
> the
> >> accesses to occur, and grouping, which makes the granularity of entities
> >> coarser as you say.
> >>
> >
> > Agreed on lazy loading and grouping idea.
> >
> > But I also clarified in somewhere previous emails. reducing the
> complexity
> > could be multiple dimensions. the proposal is one dimension, although not
> > so good, but a starting point. we had to improve it by iterations.
> >
> >
> >>
> >>>
> >>> Computation means I could compute the location of an entity based on
> its
> >>> name, without looking up somewhere. E.g, Consistent Hashing.
> >>>
> >>> Currently, BookKeeper only gives a ledger id, so Hedwig has to find a
> >> place
> >>> to record this ledger id. This is a 'Mapping'. So it is 2 levels
> mapping.
> >>> one is topic name to Hedwig's metadata, the other is Hedwig metadata to
> >>> Ledger's metadata. You could not get rid of 'linear' limitation w/o
> >>> breaking the mapping between Hedwig metadata and Ledger metadata.
> >>
> >> This is one kind of task that we have designed ZK for, storing metadata
> on
> >> behalf of the application, although you seem to have reached the limits
> of
> >> ZK ;-). I don't think it is bad to expect the application to store the
> >> sequence of ledgers it has written elsewhere. But again, this is one of
> the
> >> things that managed ledgers could do for you. It won't solve the
> metadata
> >> issue, but it does solve the abstraction issue.
> >>
> >
> >> I don't think it is bad to expect the application to store the sequence
> > of ledgers it has written elsewhere.
> >
> > It depends. some applications are happy to do that. but some applications
> > might be bad using such style. Clarify again, I don't say get rid of such
> > ledger id style. But I clarified, ledger id could be based on a ledger id
> > generation plus a ledger name solution. The proposal just provides a
> lower
> > level api for applications to optimize its performance.
> >
> > secondly, I clarified before and again here, storing ledger id and
> rolling
> > a new ledger to write is not just introducing metadata accesses spike
> > during startup but also introducing latency spike in publish path.
> > applications could not get rid of this problem if they sticked on using
> > such style. the reason is same as above, 'could you add entries before
> > opening a ledger'.
>
> Rolling a ledger doesn't have be in the critical path of publish request.
>  When we roll a ledger, is it synchronous with respect to the publish
> request that triggered it?
>
> >
> > one more thing you might ask 'why we need to change ledgers in the life
> of
> > topic'. I would clarify here before you asked, changing ledger feature is
> > added to let topic have chance to consume ('delete') its entries. you
> could
> > configure a higher number to defer the ledger changing, but it doesn't
> > change the fact that latency spike when ledger change happened.
>
> Don't you think that we can get rid of this latency spike by rolling
> ledgers asynchronously? I'm not too familiar with that part of the code, so
> you possibly have a better feeling if it can be done.
>
> >
> > what I tried to propose is tried to remove any potential metadata
> accesses
> > in critical path, like publish. The proposal is friendly for consuming
> > (deleting) entries w/o affecting publish latency in critical path.
>
> Overall trying to remove or reduce access to metadata from the critical
> path sounds good.
>
> >
> >> But again, this is one of the things that managed ledgers could do for
> > you.
> >
> > managed ledgers bring too much concepts like cursor. the interface could
> > like managed ledgers, I agreed. If the interface is one concern for you,
> I
> > am feeling OK to make a change.
>
> I'm not suggesting we change the interface to look like managed ledgers,
> I'm saying that this functionality can be built on top, while keeping the
> abstraction BookKeeper exposes simple.
>
> >
> >>>
> >>> What I proposed is still 'linear'. But it consolidates these two
> metadata
> >>> into one, and breaks the mapping between Hedwig metadata and Ledger
> >>> metadata. Although Hedwig still requires one lookup from topic name to
> >>> metadata, it gives the opportunity to get rid of this lookup in future
> >>> since we could use 'ledger name' to access its entries now.
> >>>
> >>
> >> Closing a ledger has one fundamental property, which is having consensus
> >> upon the content of a ledger. I was actually wondering how much this
> >> property matters to Hedwig, it certainly does for HDFS.
> >>
> >
> > Closed is only matter for those application having multiple readers. They
> > need agreement on the entries set for a ledger. But so far, I don't see
> any
> > applications using multiple readers for BookKeeper. Hedwig doesn't while
> > Namenode also doesn't (If I understand correctly. please point out if I
> am
> > wrong). For most applications, the reader and writer are actually same
> one,
> > so why they need to close a ledger? why they can't just re-open and
> append
> > entries? it is a natural way to do that rather than "close one, create a
> > new one and recorde ledger ids to track them".
> >
> >> 'it certainly does for HDFS'
> >
> > I am doubting your certain on 'close' for HDFS. As my knowledge, Namenode
> > use bookkeeper for fail over. So the entries doesn't be read until fail
> > over happened. Actually, it doesn't need to close the ledger as my
> > explanation as above.
> >
>
> HDFS as far as I can tell allow multiple standbys. I believe the
> documentation says that typically there is one standby, but it doesn't
> limit to one and the configuration allows for more. For Hedwig, it matters
> if you have different hubs reading the same ledger over time. Say that:
>
> - hub 1 creates a ledger for a topic
> - hub 2 becomes the owner of the topic, reads the (open) ledger, delivers
> to some subscriber, and crashes
> - hub 3 becomes the new owner of the topic, reads the (open) ledger, and
> also delivers to a different subscriber.
>
> If we don't have agreement, hubs 2 and 3 might disagree on the messages
> they see. I suppose a similar argument holds for Cloud Messaging at Yahoo!
> and Hubspot, but I don't know enough about Hubspot to say for sure.
>
> About re-opening being more natural, systems like ZooKeeper and HDFS have
> used this pattern of rolling logs; even personal computers use this
> pattern. At this point, it might be a matter of preference, I'm not sure,
> but I really don't feel it is more natural to re-open.
>
> >
> >>
> >>
> >>> An idea like 'placement group' in ceph could be deployed in BookKeeper
> in
> >>> future, which grouping ledgers to share same ensemble configuration. It
> >>> would reduce the number of mappings between ledger name to ledger
> >> metadata.
> >>> but it should be a different topic to discuss. All what I said here is
> >> just
> >>> to demonstrate the proposal I made here could make the improvement
> >> proceed
> >>> in right direction.
> >>>
> >>>
> >>
> >> That's a good idea, although I wonder how much each group would actually
> >> share. For example, if I delete one ledger in the group, would I delete
> the
> >> ledger only when all other ledgers in the group are also deleted?
> >>
> >
> > I don't have a whole view about this idea. But this initial idea is:
> >
> > for a 'placement group', there is only one ensemble. it is a kind of
> super
> > ensemble. there is no ensemble change in this 'placement group'. so:
> >
> > 1) how to face bookie failure w/o ensemble change. we already have write
> > quorum and ack quorum, it would help facing bookie failure. this is why I
> > called it a super ensemble, it means that the number of bookies might
> need
> > to be enough to tolerance enough ack quorum for some granularity
> > of availability.
> > 2) bookie failure forever and re-replication. a placement group is also
> > friendly for re-replication. re-replication could only happen in the
> > bookies in this placement group, it is self-organized and make the
> > re-replication purely distributed.  Facing a bookie failure forever,
> > re-replication could pickup another node to replace it. The re-build of
> the
> > failed bookie, we just need to contact the other bookies in the placement
> > group to ask the entries they owned for this placement group. we don't
> need
> > to contact any metadata.
> > 3) garbage collection. as there is no ensemble change, it means there is
> no
> > zombie entries. improved gc algorithm would work very well for it.
> > 4) actually I am thinking we might not need to record ledger metadata any
> > more. we could use some kind of hashing or prefixing mechanism to
> identify
> > which placement group a ledger is in. Only one point need to consider is
> > how to tell a ledger is not existed or a ledger just have empty entries.
> if
> > it doesn't matter then it might work.
> >
> > This is just some initial ideas. If you are interested on it, we could
> open
> > another thread for discussion.
> >
>
> It would be good to discuss this, but we definitely need to separate and
> prioritize the discussions, otherwise it becomes unmanageable.
>
> -Flavio
>
> >>>
> >>>
> >>> On Wed, Jan 16, 2013 at 1:18 PM, Flavio Junqueira <
> fpjunqueira@yahoo.com
> >>> wrote:
> >>>
> >>>> Thanks for the great discussion so far. I have a couple of comments
> >> below:
> >>>>
> >>>> On Jan 16, 2013, at 8:00 PM, Ivan Kelly <iv...@apache.org> wrote:
> >>>>
> >>>>> On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote:
> >>>>>>> Originally, it was meant to have a number of
> >>>>>>> long lived subscriptions, over which a lot of data travelled. Now
> the
> >>>>>>> load has flipped to a large number of short lived subscriptions,
> over
> >>>>>>> which relatively little data travels.
> >>>>>>
> >>>>>> The topic discussed here doesn't relate to hedwig subscriptions, it
> >> just
> >>>>>> about how hedwig use ledgers to store its messages.  Even there are
> no
> >>>>>> subscriptions, the problem is still there. The restart of a hub
> server
> >>>>>> carrying large number of topics would hit the metadata storage with
> >> many
> >>>>>> accesses. The hit is a hub server acquiring a topic, no matter the
> >>>>>> subscription is long lived or short lived. after topic is acquired,
> >>>>>> following accesses are in memory, which doesn't cause any
> >>>>>> performance issue.
> >>>>> I was using topics and subscriptions to mean the same thing here due
> >>>>> to the usecase we have in Yahoo where they're effectively the same
> >>>>> thing. But yes, I should have said topic. But my point still
> >>>>> stands. Hedwig was designed to deal with fewer topics, which had a
> lot
> >>>>> of data passing through them, rather than more topics, with very
> >>>>> little data passing though them. This is why zk was consider suffient
> >>>>> at that point, as tens of thousands of topics being recovered really
> >>>>> isn't an issue for zk. The point I was driving at is that, the
> usecase
> >>>>> has changed in a big way, so it may require a big change to handle
> it.
> >>>>>
> >>>>
> >>>> About the restart of a hub, if I understand the problem, many topics
> >> will
> >>>> be acquired concurrently, inducing the load spike you're mentioning.
> If
> >>>> this observation is correct, can we acquire those topics lazily
> instead
> >> of
> >>>> eagerly?
> >>>>
> >>>>>> But we should separate the capacity problem from the software
> >> problem. A
> >>>>>> high performance and scalable metadata storage would help for
> >> resolving
> >>>>>> capacity problem. but either implementing a new one or leveraging a
> >> high
> >>>>>> performance one doesn't change the fact that it still need so many
> >>>> metadata
> >>>>>> accesses to acquire topic. A bad implementation causing such many
> >>>> metadata
> >>>>>> accesses is a software problem. If we had chance to improve it, why
> >>>>>> not?
> >>>>> I don't think the implementation is bad, but rather the assumptions,
> >>>>> as I said earlier. The data:metadata ratio has completely changed
> >>>>> completely. hedwig/bk were designed with a data:metadata ratio of
> >>>>> something like 100000:1. What we're talking about now is more like
> 1:1
> >>>>> and therefore we need to be able to handle an order of magnitude more
> >>>>> of metadata than previously. Bringing down the number of writes by an
> >>>>> order of 2 or 3, while a nice optimisation, is just putting duct tape
> >>>>> on the problem.
> >>>>
> >>>>
> >>>> I love the duct tape analogy. The number of accesses to the metadata
> is
> >>>> still linear on the number of topics, and reducing the number of
> >> accesses
> >>>> per acquisition does not change the complexity. Perhaps distributing
> the
> >>>> accesses over time, optimized or not, might be a good way to proceed
> >>>> overall. Another approach would be to group topics into buckets and
> have
> >>>> hubs acquiring buckets of topics, all topics of a bucket writing to
> the
> >>>> same ledgers. There is an obvious drawback to this approach when
> reading
> >>>> messages for a given topic.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>> The ledger can still be read many times, but you have removed the
> >>>>>> guarantee that what is read each time will be the same thing.
> >>>>>>
> >>>>>> How we guarantee a reader's behavior when a ledger is removed at the
> >>>> same
> >>>>>> time? We don't guarantee it right now, right? It is similar thing
> for
> >> a
> >>>>>> 'shrink' operation which remove part of entries, while 'delete'
> >>>> operation
> >>>>>> removes whole entries?
> >>>>>>
> >>>>>> And if I remembered correctly, readers only see the same thing when
> a
> >>>>>> ledger is closed. What I proposed doesn't volatile this contract.
>  If
> >> a
> >>>>>> ledger is closed (state is in CLOSED), an application can't re-open
> >> it.
> >>>> If
> >>>>>> a ledger isn't closed yet, an application can recover previous state
> >> and
> >>>>>> continue writing entries using this ledger. for applications, they
> >> could
> >>>>>> still use 'create-close-create' style to use ledgers, or evolve to
> new
> >>>> api
> >>>>>> for efficiency smoothly, w/o breaking any backward compatibility.
> >>>>> Ah, yes, I misread your proposal originally, I thought the reopen was
> >>>>> working with an already closed ledger.
> >>>>>
> >>>>> On a side note, the reason we have an initial write for fencing, is
> >>>>> that when the reading client(RC) fences, the servers in the ensemble
> >>>>> start returning an error to the writing client (WC). At the moment we
> >>>>> don't distinguish between a fencing error and a i/o error for
> >>>>> example. So WC will try to rebuild a the ensemble by replacing the
> >>>>> erroring servers. Before writing to the new ensemble, it has to
> update
> >>>>> the metadata, and at this point it will see that it has been
> >>>>> fenced. With a specific FENCED error, we could avoid this write. This
> >>>>> makes me uncomfortable though. What happens if the fenced server
> fails
> >>>>> between being fenced and WC trying to write? It will get a normal i/o
> >>>>> error. And will try to replace the server. Since the metadata has not
> >>>>> been changed, nothing will stop it, and it may be able to continue
> >>>>> writing. I think this is also the case for the session fencing
> >> solution.
> >>>>>
> >>>>> -Ivan
> >>>>
> >>>>
> >>
> >>
>
>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Flavio Junqueira <fp...@yahoo.com>.

One problem for me with this thread is that there are parallel discussions it refers to and the information seems to be spread across different liras, other threads, and private communication. I personally find it hard to be constructive and converge like that, which I think it should be our goal. Can we find a better way of making progress, perhaps discussing issues separately and prioritizing them? One possibility is that we have open calls (e.g., skype, google hangout) to discuss concrete issues. We should of course post our findings to the list or wiki, whichever is more convenient.

For the moment, I'll add some more comments trying to focus on clarifications.  


>> On Jan 17, 2013, at 7:58 AM, Sijie Guo <gu...@gmail.com> wrote:
>> 
>>>> If this observation is correct, can we acquire those topics lazily
>>> instead of eagerly?
>>> 
>>> could you add entries to a ledger without opening ledger?
>>> 
>> 
>> By acquiring topics lazily, I meant to say that we delay topic acquisition
>> until there is an operation for such a topic. This is a pretty standard
>> technique to avoid load spikes.
>> 
>> I'm sorry if you know the concept already, but to make sure we understand
>> each other, this is the technique what I'm referring to:
>> 
>>        http://en.wikipedia.org/wiki/Lazy_loading
>> 
>> Jiannan pointed out one issue with this approach, though. Acquiring topics
>> lazily implies that some operation triggers the acquisition of the topic.
>> Such a trigger could be say a publish operation. In the case of
>> subscriptions, however, there is no operation that could trigger the
>> acquisition.
>> 
> 
> You need to first checkout the operations in Hedwig before talking about
> lazy loading. for Hedwig, most of the operations are publish and subscribe.
> For publish request, it requires ledger to be created to add entries, so
> you could not load persistence info (ledger metadata) lazily; for subscribe
> request, it means some one needs to read entries to deliver, it requires
> ledger to be opened to read entries, so you could still not load
> persistence info (ledger metadata) lazily. so for persistence info, you
> don't have any chance to do lazily loading. That is why I asked the
> question.
> 
> The issue Jiannan created to work on this idea is actually what I suggested
> him to do that.

I don't know which issue you're referring to. Is this a jira issue?

> You already talked about lazy loading, I would clarify the
> idea on lazily loading:
> 
> 1) for publish request triggering topic acquire, we don't need to load the
> subscriptions of this topic. it is lazy loading for subscription metadata.
> 2) for subscribe request triggering topic acquire, we could not load the
> persistence info for a topic, but we could defer the ledger creation and
> persistence info update until next publish request. this is lazy creating
> new ledger.
> 
> these are only two points I could see based on current implementation, we
> could do for lazily loading. If I missed something, you could point out.
> 
> 

Thanks for the clarification. I don't have anything to add to the list.

> 
>> 
>> 
>>>> Another approach would be to group topics into buckets and have hubs
>>> acquiring buckets of topics, all topics of a bucket writing to the same
>>> ledgers. There is an obvious drawback to this approach when reading
>>> messages for a given topic.
>>> 
>>> for topic ownership, we could proceed it in groups as I proposed in
>>> BOOKKEEPER-508.
>>> for topic persistence info, we should not do that, otherwise it volatile
>>> the contract that Hedwig made and you add complexity into Hedwig itself.
>>> Grouping could be considered in application side, but if application
>> could
>>> not or is difficult to do that, we still have to face such many topics to
>>> make Hedwig as a scalable platform.
>> 
>> I didn't mean to re-propose groups, and I'm happy you did it already. My
>> point was simply that the complexity is still linear, and I think that to
>> have lower complexity we would need some technique like grouping.
>> 
>>> 
>>> I don't like what you said 'the number of accesses to the metadata is
>> still
>>> linear on the number of topics'. Metadata accesses problem is an 'entity
>>> locating' problem. Two categories of solutions are used to locate an
>>> entity, one is mapping, the other one is computation.
>>> 
>> 
>> 
>> We may not like it, but it is still linear. :-)
>> 
>> 
>>> Mapping means you need to find a place to store this entity or store the
>>> location of this entity. Like HDFS NameNode storing the locations of
>> blocks
>>> in Namenode. If an entity existed in this way, you had at least one
>> access
>>> to find it. no matter it was stored in ZooKeeper, in memory, embedded in
>>> bookie servers or grouped in somewhere. You could not make any changes to
>>> the 'linear' relationship between number of accesses and number of
>>> entities. To make it more clearly, in HDFS, 'entity' means a data block,
>> if
>>> you had an entity location, read data is fast, but cross boundary of
>>> blocks, a metadata lookup is required. Similarly, in Hedwig, 'entity'
>> means
>>> a topic, if the topic is owned, read messages in this topic is fast. But
>> in
>>> either case, the number of accesses is still linear on the number of
>>> entities.
>> 
>> 
>> Agreed, you need one access per entity. The two techniques I know of and
>> that are often used to deal efficiently with the large number of accesses
>> to such entities are lazy loading, which spreads more the accesses over
>> time and allows the system to make progress without waiting for all of the
>> accesses to occur, and grouping, which makes the granularity of entities
>> coarser as you say.
>> 
> 
> Agreed on lazy loading and grouping idea.
> 
> But I also clarified in somewhere previous emails. reducing the complexity
> could be multiple dimensions. the proposal is one dimension, although not
> so good, but a starting point. we had to improve it by iterations.
> 
> 
>> 
>>> 
>>> Computation means I could compute the location of an entity based on its
>>> name, without looking up somewhere. E.g, Consistent Hashing.
>>> 
>>> Currently, BookKeeper only gives a ledger id, so Hedwig has to find a
>> place
>>> to record this ledger id. This is a 'Mapping'. So it is 2 levels mapping.
>>> one is topic name to Hedwig's metadata, the other is Hedwig metadata to
>>> Ledger's metadata. You could not get rid of 'linear' limitation w/o
>>> breaking the mapping between Hedwig metadata and Ledger metadata.
>> 
>> This is one kind of task that we have designed ZK for, storing metadata on
>> behalf of the application, although you seem to have reached the limits of
>> ZK ;-). I don't think it is bad to expect the application to store the
>> sequence of ledgers it has written elsewhere. But again, this is one of the
>> things that managed ledgers could do for you. It won't solve the metadata
>> issue, but it does solve the abstraction issue.
>> 
> 
>> I don't think it is bad to expect the application to store the sequence
> of ledgers it has written elsewhere.
> 
> It depends. some applications are happy to do that. but some applications
> might be bad using such style. Clarify again, I don't say get rid of such
> ledger id style. But I clarified, ledger id could be based on a ledger id
> generation plus a ledger name solution. The proposal just provides a lower
> level api for applications to optimize its performance.
> 
> secondly, I clarified before and again here, storing ledger id and rolling
> a new ledger to write is not just introducing metadata accesses spike
> during startup but also introducing latency spike in publish path.
> applications could not get rid of this problem if they sticked on using
> such style. the reason is same as above, 'could you add entries before
> opening a ledger'.

Rolling a ledger doesn't have be in the critical path of publish request.  When we roll a ledger, is it synchronous with respect to the publish request that triggered it?

> 
> one more thing you might ask 'why we need to change ledgers in the life of
> topic'. I would clarify here before you asked, changing ledger feature is
> added to let topic have chance to consume ('delete') its entries. you could
> configure a higher number to defer the ledger changing, but it doesn't
> change the fact that latency spike when ledger change happened.

Don't you think that we can get rid of this latency spike by rolling ledgers asynchronously? I'm not too familiar with that part of the code, so you possibly have a better feeling if it can be done.

> 
> what I tried to propose is tried to remove any potential metadata accesses
> in critical path, like publish. The proposal is friendly for consuming
> (deleting) entries w/o affecting publish latency in critical path.

Overall trying to remove or reduce access to metadata from the critical path sounds good.

> 
>> But again, this is one of the things that managed ledgers could do for
> you.
> 
> managed ledgers bring too much concepts like cursor. the interface could
> like managed ledgers, I agreed. If the interface is one concern for you, I
> am feeling OK to make a change.

I'm not suggesting we change the interface to look like managed ledgers, I'm saying that this functionality can be built on top, while keeping the abstraction BookKeeper exposes simple.

> 
>>> 
>>> What I proposed is still 'linear'. But it consolidates these two metadata
>>> into one, and breaks the mapping between Hedwig metadata and Ledger
>>> metadata. Although Hedwig still requires one lookup from topic name to
>>> metadata, it gives the opportunity to get rid of this lookup in future
>>> since we could use 'ledger name' to access its entries now.
>>> 
>> 
>> Closing a ledger has one fundamental property, which is having consensus
>> upon the content of a ledger. I was actually wondering how much this
>> property matters to Hedwig, it certainly does for HDFS.
>> 
> 
> Closed is only matter for those application having multiple readers. They
> need agreement on the entries set for a ledger. But so far, I don't see any
> applications using multiple readers for BookKeeper. Hedwig doesn't while
> Namenode also doesn't (If I understand correctly. please point out if I am
> wrong). For most applications, the reader and writer are actually same one,
> so why they need to close a ledger? why they can't just re-open and append
> entries? it is a natural way to do that rather than "close one, create a
> new one and recorde ledger ids to track them".
> 
>> 'it certainly does for HDFS'
> 
> I am doubting your certain on 'close' for HDFS. As my knowledge, Namenode
> use bookkeeper for fail over. So the entries doesn't be read until fail
> over happened. Actually, it doesn't need to close the ledger as my
> explanation as above.
> 

HDFS as far as I can tell allow multiple standbys. I believe the documentation says that typically there is one standby, but it doesn't limit to one and the configuration allows for more. For Hedwig, it matters if you have different hubs reading the same ledger over time. Say that:

- hub 1 creates a ledger for a topic
- hub 2 becomes the owner of the topic, reads the (open) ledger, delivers to some subscriber, and crashes
- hub 3 becomes the new owner of the topic, reads the (open) ledger, and also delivers to a different subscriber.

If we don't have agreement, hubs 2 and 3 might disagree on the messages they see. I suppose a similar argument holds for Cloud Messaging at Yahoo! and Hubspot, but I don't know enough about Hubspot to say for sure. 

About re-opening being more natural, systems like ZooKeeper and HDFS have used this pattern of rolling logs; even personal computers use this pattern. At this point, it might be a matter of preference, I'm not sure, but I really don't feel it is more natural to re-open. 

> 
>> 
>> 
>>> An idea like 'placement group' in ceph could be deployed in BookKeeper in
>>> future, which grouping ledgers to share same ensemble configuration. It
>>> would reduce the number of mappings between ledger name to ledger
>> metadata.
>>> but it should be a different topic to discuss. All what I said here is
>> just
>>> to demonstrate the proposal I made here could make the improvement
>> proceed
>>> in right direction.
>>> 
>>> 
>> 
>> That's a good idea, although I wonder how much each group would actually
>> share. For example, if I delete one ledger in the group, would I delete the
>> ledger only when all other ledgers in the group are also deleted?
>> 
> 
> I don't have a whole view about this idea. But this initial idea is:
> 
> for a 'placement group', there is only one ensemble. it is a kind of super
> ensemble. there is no ensemble change in this 'placement group'. so:
> 
> 1) how to face bookie failure w/o ensemble change. we already have write
> quorum and ack quorum, it would help facing bookie failure. this is why I
> called it a super ensemble, it means that the number of bookies might need
> to be enough to tolerance enough ack quorum for some granularity
> of availability.
> 2) bookie failure forever and re-replication. a placement group is also
> friendly for re-replication. re-replication could only happen in the
> bookies in this placement group, it is self-organized and make the
> re-replication purely distributed.  Facing a bookie failure forever,
> re-replication could pickup another node to replace it. The re-build of the
> failed bookie, we just need to contact the other bookies in the placement
> group to ask the entries they owned for this placement group. we don't need
> to contact any metadata.
> 3) garbage collection. as there is no ensemble change, it means there is no
> zombie entries. improved gc algorithm would work very well for it.
> 4) actually I am thinking we might not need to record ledger metadata any
> more. we could use some kind of hashing or prefixing mechanism to identify
> which placement group a ledger is in. Only one point need to consider is
> how to tell a ledger is not existed or a ledger just have empty entries. if
> it doesn't matter then it might work.
> 
> This is just some initial ideas. If you are interested on it, we could open
> another thread for discussion.
> 

It would be good to discuss this, but we definitely need to separate and prioritize the discussions, otherwise it becomes unmanageable.

-Flavio

>>> 
>>> 
>>> On Wed, Jan 16, 2013 at 1:18 PM, Flavio Junqueira <fpjunqueira@yahoo.com
>>> wrote:
>>> 
>>>> Thanks for the great discussion so far. I have a couple of comments
>> below:
>>>> 
>>>> On Jan 16, 2013, at 8:00 PM, Ivan Kelly <iv...@apache.org> wrote:
>>>> 
>>>>> On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote:
>>>>>>> Originally, it was meant to have a number of
>>>>>>> long lived subscriptions, over which a lot of data travelled. Now the
>>>>>>> load has flipped to a large number of short lived subscriptions, over
>>>>>>> which relatively little data travels.
>>>>>> 
>>>>>> The topic discussed here doesn't relate to hedwig subscriptions, it
>> just
>>>>>> about how hedwig use ledgers to store its messages.  Even there are no
>>>>>> subscriptions, the problem is still there. The restart of a hub server
>>>>>> carrying large number of topics would hit the metadata storage with
>> many
>>>>>> accesses. The hit is a hub server acquiring a topic, no matter the
>>>>>> subscription is long lived or short lived. after topic is acquired,
>>>>>> following accesses are in memory, which doesn't cause any
>>>>>> performance issue.
>>>>> I was using topics and subscriptions to mean the same thing here due
>>>>> to the usecase we have in Yahoo where they're effectively the same
>>>>> thing. But yes, I should have said topic. But my point still
>>>>> stands. Hedwig was designed to deal with fewer topics, which had a lot
>>>>> of data passing through them, rather than more topics, with very
>>>>> little data passing though them. This is why zk was consider suffient
>>>>> at that point, as tens of thousands of topics being recovered really
>>>>> isn't an issue for zk. The point I was driving at is that, the usecase
>>>>> has changed in a big way, so it may require a big change to handle it.
>>>>> 
>>>> 
>>>> About the restart of a hub, if I understand the problem, many topics
>> will
>>>> be acquired concurrently, inducing the load spike you're mentioning. If
>>>> this observation is correct, can we acquire those topics lazily instead
>> of
>>>> eagerly?
>>>> 
>>>>>> But we should separate the capacity problem from the software
>> problem. A
>>>>>> high performance and scalable metadata storage would help for
>> resolving
>>>>>> capacity problem. but either implementing a new one or leveraging a
>> high
>>>>>> performance one doesn't change the fact that it still need so many
>>>> metadata
>>>>>> accesses to acquire topic. A bad implementation causing such many
>>>> metadata
>>>>>> accesses is a software problem. If we had chance to improve it, why
>>>>>> not?
>>>>> I don't think the implementation is bad, but rather the assumptions,
>>>>> as I said earlier. The data:metadata ratio has completely changed
>>>>> completely. hedwig/bk were designed with a data:metadata ratio of
>>>>> something like 100000:1. What we're talking about now is more like 1:1
>>>>> and therefore we need to be able to handle an order of magnitude more
>>>>> of metadata than previously. Bringing down the number of writes by an
>>>>> order of 2 or 3, while a nice optimisation, is just putting duct tape
>>>>> on the problem.
>>>> 
>>>> 
>>>> I love the duct tape analogy. The number of accesses to the metadata is
>>>> still linear on the number of topics, and reducing the number of
>> accesses
>>>> per acquisition does not change the complexity. Perhaps distributing the
>>>> accesses over time, optimized or not, might be a good way to proceed
>>>> overall. Another approach would be to group topics into buckets and have
>>>> hubs acquiring buckets of topics, all topics of a bucket writing to the
>>>> same ledgers. There is an obvious drawback to this approach when reading
>>>> messages for a given topic.
>>>> 
>>>>> 
>>>>>> 
>>>>>>> The ledger can still be read many times, but you have removed the
>>>>>> guarantee that what is read each time will be the same thing.
>>>>>> 
>>>>>> How we guarantee a reader's behavior when a ledger is removed at the
>>>> same
>>>>>> time? We don't guarantee it right now, right? It is similar thing for
>> a
>>>>>> 'shrink' operation which remove part of entries, while 'delete'
>>>> operation
>>>>>> removes whole entries?
>>>>>> 
>>>>>> And if I remembered correctly, readers only see the same thing when a
>>>>>> ledger is closed. What I proposed doesn't volatile this contract.  If
>> a
>>>>>> ledger is closed (state is in CLOSED), an application can't re-open
>> it.
>>>> If
>>>>>> a ledger isn't closed yet, an application can recover previous state
>> and
>>>>>> continue writing entries using this ledger. for applications, they
>> could
>>>>>> still use 'create-close-create' style to use ledgers, or evolve to new
>>>> api
>>>>>> for efficiency smoothly, w/o breaking any backward compatibility.
>>>>> Ah, yes, I misread your proposal originally, I thought the reopen was
>>>>> working with an already closed ledger.
>>>>> 
>>>>> On a side note, the reason we have an initial write for fencing, is
>>>>> that when the reading client(RC) fences, the servers in the ensemble
>>>>> start returning an error to the writing client (WC). At the moment we
>>>>> don't distinguish between a fencing error and a i/o error for
>>>>> example. So WC will try to rebuild a the ensemble by replacing the
>>>>> erroring servers. Before writing to the new ensemble, it has to update
>>>>> the metadata, and at this point it will see that it has been
>>>>> fenced. With a specific FENCED error, we could avoid this write. This
>>>>> makes me uncomfortable though. What happens if the fenced server fails
>>>>> between being fenced and WC trying to write? It will get a normal i/o
>>>>> error. And will try to replace the server. Since the metadata has not
>>>>> been changed, nothing will stop it, and it may be able to continue
>>>>> writing. I think this is also the case for the session fencing
>> solution.
>>>>> 
>>>>> -Ivan
>>>> 
>>>> 
>> 
>>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Sijie Guo <gu...@gmail.com>.

Thanks Flavio. Please see comments in line.

On Fri, Jan 18, 2013 at 12:31 AM, Flavio Junqueira <fp...@yahoo.com>wrote:

> Hi Sijie,
>
> See a couple of comments below:
>
> On Jan 17, 2013, at 7:58 AM, Sijie Guo <gu...@gmail.com> wrote:
>
> >> If this observation is correct, can we acquire those topics lazily
> > instead of eagerly?
> >
> > could you add entries to a ledger without opening ledger?
> >
>
> By acquiring topics lazily, I meant to say that we delay topic acquisition
> until there is an operation for such a topic. This is a pretty standard
> technique to avoid load spikes.
>
> I'm sorry if you know the concept already, but to make sure we understand
> each other, this is the technique what I'm referring to:
>
>         http://en.wikipedia.org/wiki/Lazy_loading
>
> Jiannan pointed out one issue with this approach, though. Acquiring topics
> lazily implies that some operation triggers the acquisition of the topic.
> Such a trigger could be say a publish operation. In the case of
> subscriptions, however, there is no operation that could trigger the
> acquisition.
>

You need to first checkout the operations in Hedwig before talking about
lazy loading. for Hedwig, most of the operations are publish and subscribe.
For publish request, it requires ledger to be created to add entries, so
you could not load persistence info (ledger metadata) lazily; for subscribe
request, it means some one needs to read entries to deliver, it requires
ledger to be opened to read entries, so you could still not load
persistence info (ledger metadata) lazily. so for persistence info, you
don't have any chance to do lazily loading. That is why I asked the
question.

The issue Jiannan created to work on this idea is actually what I suggested
him to do that. You already talked about lazy loading, I would clarify the
idea on lazily loading:

1) for publish request triggering topic acquire, we don't need to load the
subscriptions of this topic. it is lazy loading for subscription metadata.
2) for subscribe request triggering topic acquire, we could not load the
persistence info for a topic, but we could defer the ledger creation and
persistence info update until next publish request. this is lazy creating
new ledger.

these are only two points I could see based on current implementation, we
could do for lazily loading. If I missed something, you could point out.

>
>
> >> Another approach would be to group topics into buckets and have hubs
> > acquiring buckets of topics, all topics of a bucket writing to the same
> > ledgers. There is an obvious drawback to this approach when reading
> > messages for a given topic.
> >
> > for topic ownership, we could proceed it in groups as I proposed in
> > BOOKKEEPER-508.
> > for topic persistence info, we should not do that, otherwise it volatile
> > the contract that Hedwig made and you add complexity into Hedwig itself.
> > Grouping could be considered in application side, but if application
> could
> > not or is difficult to do that, we still have to face such many topics to
> > make Hedwig as a scalable platform.
>
> I didn't mean to re-propose groups, and I'm happy you did it already. My
> point was simply that the complexity is still linear, and I think that to
> have lower complexity we would need some technique like grouping.
>
> >
> > I don't like what you said 'the number of accesses to the metadata is
> still
> > linear on the number of topics'. Metadata accesses problem is an 'entity
> > locating' problem. Two categories of solutions are used to locate an
> > entity, one is mapping, the other one is computation.
> >
>
>
> We may not like it, but it is still linear. :-)
>
>
> > Mapping means you need to find a place to store this entity or store the
> > location of this entity. Like HDFS NameNode storing the locations of
> blocks
> > in Namenode. If an entity existed in this way, you had at least one
> access
> > to find it. no matter it was stored in ZooKeeper, in memory, embedded in
> > bookie servers or grouped in somewhere. You could not make any changes to
> > the 'linear' relationship between number of accesses and number of
> > entities. To make it more clearly, in HDFS, 'entity' means a data block,
> if
> > you had an entity location, read data is fast, but cross boundary of
> > blocks, a metadata lookup is required. Similarly, in Hedwig, 'entity'
> means
> > a topic, if the topic is owned, read messages in this topic is fast. But
> in
> > either case, the number of accesses is still linear on the number of
> > entities.
>
>
> Agreed, you need one access per entity. The two techniques I know of and
> that are often used to deal efficiently with the large number of accesses
> to such entities are lazy loading, which spreads more the accesses over
> time and allows the system to make progress without waiting for all of the
> accesses to occur, and grouping, which makes the granularity of entities
> coarser as you say.
>

Agreed on lazy loading and grouping idea.

But I also clarified in somewhere previous emails. reducing the complexity
could be multiple dimensions. the proposal is one dimension, although not
so good, but a starting point. we had to improve it by iterations.

>
> >
> > Computation means I could compute the location of an entity based on its
> > name, without looking up somewhere. E.g, Consistent Hashing.
> >
> > Currently, BookKeeper only gives a ledger id, so Hedwig has to find a
> place
> > to record this ledger id. This is a 'Mapping'. So it is 2 levels mapping.
> > one is topic name to Hedwig's metadata, the other is Hedwig metadata to
> > Ledger's metadata. You could not get rid of 'linear' limitation w/o
> > breaking the mapping between Hedwig metadata and Ledger metadata.
>
> This is one kind of task that we have designed ZK for, storing metadata on
> behalf of the application, although you seem to have reached the limits of
> ZK ;-). I don't think it is bad to expect the application to store the
> sequence of ledgers it has written elsewhere. But again, this is one of the
> things that managed ledgers could do for you. It won't solve the metadata
> issue, but it does solve the abstraction issue.
>

> I don't think it is bad to expect the application to store the sequence
of ledgers it has written elsewhere.

It depends. some applications are happy to do that. but some applications
might be bad using such style. Clarify again, I don't say get rid of such
ledger id style. But I clarified, ledger id could be based on a ledger id
generation plus a ledger name solution. The proposal just provides a lower
level api for applications to optimize its performance.

secondly, I clarified before and again here, storing ledger id and rolling
a new ledger to write is not just introducing metadata accesses spike
during startup but also introducing latency spike in publish path.
applications could not get rid of this problem if they sticked on using
such style. the reason is same as above, 'could you add entries before
opening a ledger'.

one more thing you might ask 'why we need to change ledgers in the life of
topic'. I would clarify here before you asked, changing ledger feature is
added to let topic have chance to consume ('delete') its entries. you could
configure a higher number to defer the ledger changing, but it doesn't
change the fact that latency spike when ledger change happened.

what I tried to propose is tried to remove any potential metadata accesses
in critical path, like publish. The proposal is friendly for consuming
(deleting) entries w/o affecting publish latency in critical path.

> But again, this is one of the things that managed ledgers could do for
you.

managed ledgers bring too much concepts like cursor. the interface could
like managed ledgers, I agreed. If the interface is one concern for you, I
am feeling OK to make a change.

>
> >
> > What I proposed is still 'linear'. But it consolidates these two metadata
> > into one, and breaks the mapping between Hedwig metadata and Ledger
> > metadata. Although Hedwig still requires one lookup from topic name to
> > metadata, it gives the opportunity to get rid of this lookup in future
> > since we could use 'ledger name' to access its entries now.
> >
>
> Closing a ledger has one fundamental property, which is having consensus
> upon the content of a ledger. I was actually wondering how much this
> property matters to Hedwig, it certainly does for HDFS.
>

Closed is only matter for those application having multiple readers. They
need agreement on the entries set for a ledger. But so far, I don't see any
applications using multiple readers for BookKeeper. Hedwig doesn't while
Namenode also doesn't (If I understand correctly. please point out if I am
wrong). For most applications, the reader and writer are actually same one,
so why they need to close a ledger? why they can't just re-open and append
entries? it is a natural way to do that rather than "close one, create a
new one and recorde ledger ids to track them".

> 'it certainly does for HDFS'

I am doubting your certain on 'close' for HDFS. As my knowledge, Namenode
use bookkeeper for fail over. So the entries doesn't be read until fail
over happened. Actually, it doesn't need to close the ledger as my
explanation as above.

>
>
> > An idea like 'placement group' in ceph could be deployed in BookKeeper in
> > future, which grouping ledgers to share same ensemble configuration. It
> > would reduce the number of mappings between ledger name to ledger
> metadata.
> > but it should be a different topic to discuss. All what I said here is
> just
> > to demonstrate the proposal I made here could make the improvement
> proceed
> > in right direction.
> >
> >
>
> That's a good idea, although I wonder how much each group would actually
> share. For example, if I delete one ledger in the group, would I delete the
> ledger only when all other ledgers in the group are also deleted?
>

I don't have a whole view about this idea. But this initial idea is:

for a 'placement group', there is only one ensemble. it is a kind of super
ensemble. there is no ensemble change in this 'placement group'. so:

1) how to face bookie failure w/o ensemble change. we already have write
quorum and ack quorum, it would help facing bookie failure. this is why I
called it a super ensemble, it means that the number of bookies might need
to be enough to tolerance enough ack quorum for some granularity
of availability.
2) bookie failure forever and re-replication. a placement group is also
friendly for re-replication. re-replication could only happen in the
bookies in this placement group, it is self-organized and make the
re-replication purely distributed.  Facing a bookie failure forever,
re-replication could pickup another node to replace it. The re-build of the
failed bookie, we just need to contact the other bookies in the placement
group to ask the entries they owned for this placement group. we don't need
to contact any metadata.
3) garbage collection. as there is no ensemble change, it means there is no
zombie entries. improved gc algorithm would work very well for it.
4) actually I am thinking we might not need to record ledger metadata any
more. we could use some kind of hashing or prefixing mechanism to identify
which placement group a ledger is in. Only one point need to consider is
how to tell a ledger is not existed or a ledger just have empty entries. if
it doesn't matter then it might work.

This is just some initial ideas. If you are interested on it, we could open
another thread for discussion.

>
> -Flavio
>
> >
> >
> > On Wed, Jan 16, 2013 at 1:18 PM, Flavio Junqueira <fpjunqueira@yahoo.com
> >wrote:
> >
> >> Thanks for the great discussion so far. I have a couple of comments
> below:
> >>
> >> On Jan 16, 2013, at 8:00 PM, Ivan Kelly <iv...@apache.org> wrote:
> >>
> >>> On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote:
> >>>>> Originally, it was meant to have a number of
> >>>>> long lived subscriptions, over which a lot of data travelled. Now the
> >>>>> load has flipped to a large number of short lived subscriptions, over
> >>>>> which relatively little data travels.
> >>>>
> >>>> The topic discussed here doesn't relate to hedwig subscriptions, it
> just
> >>>> about how hedwig use ledgers to store its messages.  Even there are no
> >>>> subscriptions, the problem is still there. The restart of a hub server
> >>>> carrying large number of topics would hit the metadata storage with
> many
> >>>> accesses. The hit is a hub server acquiring a topic, no matter the
> >>>> subscription is long lived or short lived. after topic is acquired,
> >>>> following accesses are in memory, which doesn't cause any
> >>>> performance issue.
> >>> I was using topics and subscriptions to mean the same thing here due
> >>> to the usecase we have in Yahoo where they're effectively the same
> >>> thing. But yes, I should have said topic. But my point still
> >>> stands. Hedwig was designed to deal with fewer topics, which had a lot
> >>> of data passing through them, rather than more topics, with very
> >>> little data passing though them. This is why zk was consider suffient
> >>> at that point, as tens of thousands of topics being recovered really
> >>> isn't an issue for zk. The point I was driving at is that, the usecase
> >>> has changed in a big way, so it may require a big change to handle it.
> >>>
> >>
> >> About the restart of a hub, if I understand the problem, many topics
> will
> >> be acquired concurrently, inducing the load spike you're mentioning. If
> >> this observation is correct, can we acquire those topics lazily instead
> of
> >> eagerly?
> >>
> >>>> But we should separate the capacity problem from the software
> problem. A
> >>>> high performance and scalable metadata storage would help for
> resolving
> >>>> capacity problem. but either implementing a new one or leveraging a
> high
> >>>> performance one doesn't change the fact that it still need so many
> >> metadata
> >>>> accesses to acquire topic. A bad implementation causing such many
> >> metadata
> >>>> accesses is a software problem. If we had chance to improve it, why
> >>>> not?
> >>> I don't think the implementation is bad, but rather the assumptions,
> >>> as I said earlier. The data:metadata ratio has completely changed
> >>> completely. hedwig/bk were designed with a data:metadata ratio of
> >>> something like 100000:1. What we're talking about now is more like 1:1
> >>> and therefore we need to be able to handle an order of magnitude more
> >>> of metadata than previously. Bringing down the number of writes by an
> >>> order of 2 or 3, while a nice optimisation, is just putting duct tape
> >>> on the problem.
> >>
> >>
> >> I love the duct tape analogy. The number of accesses to the metadata is
> >> still linear on the number of topics, and reducing the number of
> accesses
> >> per acquisition does not change the complexity. Perhaps distributing the
> >> accesses over time, optimized or not, might be a good way to proceed
> >> overall. Another approach would be to group topics into buckets and have
> >> hubs acquiring buckets of topics, all topics of a bucket writing to the
> >> same ledgers. There is an obvious drawback to this approach when reading
> >> messages for a given topic.
> >>
> >>>
> >>>>
> >>>>> The ledger can still be read many times, but you have removed the
> >>>> guarantee that what is read each time will be the same thing.
> >>>>
> >>>> How we guarantee a reader's behavior when a ledger is removed at the
> >> same
> >>>> time? We don't guarantee it right now, right? It is similar thing for
> a
> >>>> 'shrink' operation which remove part of entries, while 'delete'
> >> operation
> >>>> removes whole entries?
> >>>>
> >>>> And if I remembered correctly, readers only see the same thing when a
> >>>> ledger is closed. What I proposed doesn't volatile this contract.  If
> a
> >>>> ledger is closed (state is in CLOSED), an application can't re-open
> it.
> >> If
> >>>> a ledger isn't closed yet, an application can recover previous state
> and
> >>>> continue writing entries using this ledger. for applications, they
> could
> >>>> still use 'create-close-create' style to use ledgers, or evolve to new
> >> api
> >>>> for efficiency smoothly, w/o breaking any backward compatibility.
> >>> Ah, yes, I misread your proposal originally, I thought the reopen was
> >>> working with an already closed ledger.
> >>>
> >>> On a side note, the reason we have an initial write for fencing, is
> >>> that when the reading client(RC) fences, the servers in the ensemble
> >>> start returning an error to the writing client (WC). At the moment we
> >>> don't distinguish between a fencing error and a i/o error for
> >>> example. So WC will try to rebuild a the ensemble by replacing the
> >>> erroring servers. Before writing to the new ensemble, it has to update
> >>> the metadata, and at this point it will see that it has been
> >>> fenced. With a specific FENCED error, we could avoid this write. This
> >>> makes me uncomfortable though. What happens if the fenced server fails
> >>> between being fenced and WC trying to write? It will get a normal i/o
> >>> error. And will try to replace the server. Since the metadata has not
> >>> been changed, nothing will stop it, and it may be able to continue
> >>> writing. I think this is also the case for the session fencing
> solution.
> >>>
> >>> -Ivan
> >>
> >>
>
>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Flavio Junqueira <fp...@yahoo.com>.

Hi Sijie,

See a couple of comments below:

On Jan 17, 2013, at 7:58 AM, Sijie Guo <gu...@gmail.com> wrote:

>> If this observation is correct, can we acquire those topics lazily
> instead of eagerly?
> 
> could you add entries to a ledger without opening ledger?
> 

By acquiring topics lazily, I meant to say that we delay topic acquisition until there is an operation for such a topic. This is a pretty standard technique to avoid load spikes. 

I'm sorry if you know the concept already, but to make sure we understand each other, this is the technique what I'm referring to:

	http://en.wikipedia.org/wiki/Lazy_loading

Jiannan pointed out one issue with this approach, though. Acquiring topics lazily implies that some operation triggers the acquisition of the topic. Such a trigger could be say a publish operation. In the case of subscriptions, however, there is no operation that could trigger the acquisition. 


>> Another approach would be to group topics into buckets and have hubs
> acquiring buckets of topics, all topics of a bucket writing to the same
> ledgers. There is an obvious drawback to this approach when reading
> messages for a given topic.
> 
> for topic ownership, we could proceed it in groups as I proposed in
> BOOKKEEPER-508.
> for topic persistence info, we should not do that, otherwise it volatile
> the contract that Hedwig made and you add complexity into Hedwig itself.
> Grouping could be considered in application side, but if application could
> not or is difficult to do that, we still have to face such many topics to
> make Hedwig as a scalable platform.

I didn't mean to re-propose groups, and I'm happy you did it already. My point was simply that the complexity is still linear, and I think that to have lower complexity we would need some technique like grouping.

> 
> I don't like what you said 'the number of accesses to the metadata is still
> linear on the number of topics'. Metadata accesses problem is an 'entity
> locating' problem. Two categories of solutions are used to locate an
> entity, one is mapping, the other one is computation.
> 


We may not like it, but it is still linear. :-)


> Mapping means you need to find a place to store this entity or store the
> location of this entity. Like HDFS NameNode storing the locations of blocks
> in Namenode. If an entity existed in this way, you had at least one access
> to find it. no matter it was stored in ZooKeeper, in memory, embedded in
> bookie servers or grouped in somewhere. You could not make any changes to
> the 'linear' relationship between number of accesses and number of
> entities. To make it more clearly, in HDFS, 'entity' means a data block, if
> you had an entity location, read data is fast, but cross boundary of
> blocks, a metadata lookup is required. Similarly, in Hedwig, 'entity' means
> a topic, if the topic is owned, read messages in this topic is fast. But in
> either case, the number of accesses is still linear on the number of
> entities.


Agreed, you need one access per entity. The two techniques I know of and that are often used to deal efficiently with the large number of accesses to such entities are lazy loading, which spreads more the accesses over time and allows the system to make progress without waiting for all of the accesses to occur, and grouping, which makes the granularity of entities coarser as you say.

> 
> Computation means I could compute the location of an entity based on its
> name, without looking up somewhere. E.g, Consistent Hashing.
> 
> Currently, BookKeeper only gives a ledger id, so Hedwig has to find a place
> to record this ledger id. This is a 'Mapping'. So it is 2 levels mapping.
> one is topic name to Hedwig's metadata, the other is Hedwig metadata to
> Ledger's metadata. You could not get rid of 'linear' limitation w/o
> breaking the mapping between Hedwig metadata and Ledger metadata.

This is one kind of task that we have designed ZK for, storing metadata on behalf of the application, although you seem to have reached the limits of ZK ;-). I don't think it is bad to expect the application to store the sequence of ledgers it has written elsewhere. But again, this is one of the things that managed ledgers could do for you. It won't solve the metadata issue, but it does solve the abstraction issue.

> 
> What I proposed is still 'linear'. But it consolidates these two metadata
> into one, and breaks the mapping between Hedwig metadata and Ledger
> metadata. Although Hedwig still requires one lookup from topic name to
> metadata, it gives the opportunity to get rid of this lookup in future
> since we could use 'ledger name' to access its entries now.
> 

Closing a ledger has one fundamental property, which is having consensus upon the content of a ledger. I was actually wondering how much this property matters to Hedwig, it certainly does for HDFS.


> An idea like 'placement group' in ceph could be deployed in BookKeeper in
> future, which grouping ledgers to share same ensemble configuration. It
> would reduce the number of mappings between ledger name to ledger metadata.
> but it should be a different topic to discuss. All what I said here is just
> to demonstrate the proposal I made here could make the improvement proceed
> in right direction.
> 
> 

That's a good idea, although I wonder how much each group would actually share. For example, if I delete one ledger in the group, would I delete the ledger only when all other ledgers in the group are also deleted?

-Flavio

> 
> 
> On Wed, Jan 16, 2013 at 1:18 PM, Flavio Junqueira <fp...@yahoo.com>wrote:
> 
>> Thanks for the great discussion so far. I have a couple of comments below:
>> 
>> On Jan 16, 2013, at 8:00 PM, Ivan Kelly <iv...@apache.org> wrote:
>> 
>>> On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote:
>>>>> Originally, it was meant to have a number of
>>>>> long lived subscriptions, over which a lot of data travelled. Now the
>>>>> load has flipped to a large number of short lived subscriptions, over
>>>>> which relatively little data travels.
>>>> 
>>>> The topic discussed here doesn't relate to hedwig subscriptions, it just
>>>> about how hedwig use ledgers to store its messages.  Even there are no
>>>> subscriptions, the problem is still there. The restart of a hub server
>>>> carrying large number of topics would hit the metadata storage with many
>>>> accesses. The hit is a hub server acquiring a topic, no matter the
>>>> subscription is long lived or short lived. after topic is acquired,
>>>> following accesses are in memory, which doesn't cause any
>>>> performance issue.
>>> I was using topics and subscriptions to mean the same thing here due
>>> to the usecase we have in Yahoo where they're effectively the same
>>> thing. But yes, I should have said topic. But my point still
>>> stands. Hedwig was designed to deal with fewer topics, which had a lot
>>> of data passing through them, rather than more topics, with very
>>> little data passing though them. This is why zk was consider suffient
>>> at that point, as tens of thousands of topics being recovered really
>>> isn't an issue for zk. The point I was driving at is that, the usecase
>>> has changed in a big way, so it may require a big change to handle it.
>>> 
>> 
>> About the restart of a hub, if I understand the problem, many topics will
>> be acquired concurrently, inducing the load spike you're mentioning. If
>> this observation is correct, can we acquire those topics lazily instead of
>> eagerly?
>> 
>>>> But we should separate the capacity problem from the software problem. A
>>>> high performance and scalable metadata storage would help for resolving
>>>> capacity problem. but either implementing a new one or leveraging a high
>>>> performance one doesn't change the fact that it still need so many
>> metadata
>>>> accesses to acquire topic. A bad implementation causing such many
>> metadata
>>>> accesses is a software problem. If we had chance to improve it, why
>>>> not?
>>> I don't think the implementation is bad, but rather the assumptions,
>>> as I said earlier. The data:metadata ratio has completely changed
>>> completely. hedwig/bk were designed with a data:metadata ratio of
>>> something like 100000:1. What we're talking about now is more like 1:1
>>> and therefore we need to be able to handle an order of magnitude more
>>> of metadata than previously. Bringing down the number of writes by an
>>> order of 2 or 3, while a nice optimisation, is just putting duct tape
>>> on the problem.
>> 
>> 
>> I love the duct tape analogy. The number of accesses to the metadata is
>> still linear on the number of topics, and reducing the number of accesses
>> per acquisition does not change the complexity. Perhaps distributing the
>> accesses over time, optimized or not, might be a good way to proceed
>> overall. Another approach would be to group topics into buckets and have
>> hubs acquiring buckets of topics, all topics of a bucket writing to the
>> same ledgers. There is an obvious drawback to this approach when reading
>> messages for a given topic.
>> 
>>> 
>>>> 
>>>>> The ledger can still be read many times, but you have removed the
>>>> guarantee that what is read each time will be the same thing.
>>>> 
>>>> How we guarantee a reader's behavior when a ledger is removed at the
>> same
>>>> time? We don't guarantee it right now, right? It is similar thing for a
>>>> 'shrink' operation which remove part of entries, while 'delete'
>> operation
>>>> removes whole entries?
>>>> 
>>>> And if I remembered correctly, readers only see the same thing when a
>>>> ledger is closed. What I proposed doesn't volatile this contract.  If a
>>>> ledger is closed (state is in CLOSED), an application can't re-open it.
>> If
>>>> a ledger isn't closed yet, an application can recover previous state and
>>>> continue writing entries using this ledger. for applications, they could
>>>> still use 'create-close-create' style to use ledgers, or evolve to new
>> api
>>>> for efficiency smoothly, w/o breaking any backward compatibility.
>>> Ah, yes, I misread your proposal originally, I thought the reopen was
>>> working with an already closed ledger.
>>> 
>>> On a side note, the reason we have an initial write for fencing, is
>>> that when the reading client(RC) fences, the servers in the ensemble
>>> start returning an error to the writing client (WC). At the moment we
>>> don't distinguish between a fencing error and a i/o error for
>>> example. So WC will try to rebuild a the ensemble by replacing the
>>> erroring servers. Before writing to the new ensemble, it has to update
>>> the metadata, and at this point it will see that it has been
>>> fenced. With a specific FENCED error, we could avoid this write. This
>>> makes me uncomfortable though. What happens if the fenced server fails
>>> between being fenced and WC trying to write? It will get a normal i/o
>>> error. And will try to replace the server. Since the metadata has not
>>> been changed, nothing will stop it, and it may be able to continue
>>> writing. I think this is also the case for the session fencing solution.
>>> 
>>> -Ivan
>> 
>>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Sijie Guo <gu...@gmail.com>.

> If this observation is correct, can we acquire those topics lazily
instead of eagerly?

could you add entries to a ledger without opening ledger?

> Another approach would be to group topics into buckets and have hubs
acquiring buckets of topics, all topics of a bucket writing to the same
ledgers. There is an obvious drawback to this approach when reading
messages for a given topic.

for topic ownership, we could proceed it in groups as I proposed in
BOOKKEEPER-508.
for topic persistence info, we should not do that, otherwise it volatile
the contract that Hedwig made and you add complexity into Hedwig itself.
Grouping could be considered in application side, but if application could
not or is difficult to do that, we still have to face such many topics to
make Hedwig as a scalable platform.

I don't like what you said 'the number of accesses to the metadata is still
linear on the number of topics'. Metadata accesses problem is an 'entity
locating' problem. Two categories of solutions are used to locate an
entity, one is mapping, the other one is computation.

Mapping means you need to find a place to store this entity or store the
location of this entity. Like HDFS NameNode storing the locations of blocks
in Namenode. If an entity existed in this way, you had at least one access
to find it. no matter it was stored in ZooKeeper, in memory, embedded in
bookie servers or grouped in somewhere. You could not make any changes to
the 'linear' relationship between number of accesses and number of
entities. To make it more clearly, in HDFS, 'entity' means a data block, if
you had an entity location, read data is fast, but cross boundary of
blocks, a metadata lookup is required. Similarly, in Hedwig, 'entity' means
a topic, if the topic is owned, read messages in this topic is fast. But in
either case, the number of accesses is still linear on the number of
entities.

Computation means I could compute the location of an entity based on its
name, without looking up somewhere. E.g, Consistent Hashing.

Currently, BookKeeper only gives a ledger id, so Hedwig has to find a place
to record this ledger id. This is a 'Mapping'. So it is 2 levels mapping.
one is topic name to Hedwig's metadata, the other is Hedwig metadata to
Ledger's metadata. You could not get rid of 'linear' limitation w/o
breaking the mapping between Hedwig metadata and Ledger metadata.

What I proposed is still 'linear'. But it consolidates these two metadata
into one, and breaks the mapping between Hedwig metadata and Ledger
metadata. Although Hedwig still requires one lookup from topic name to
metadata, it gives the opportunity to get rid of this lookup in future
since we could use 'ledger name' to access its entries now.

An idea like 'placement group' in ceph could be deployed in BookKeeper in
future, which grouping ledgers to share same ensemble configuration. It
would reduce the number of mappings between ledger name to ledger metadata.
but it should be a different topic to discuss. All what I said here is just
to demonstrate the proposal I made here could make the improvement proceed
in right direction.

On Wed, Jan 16, 2013 at 1:18 PM, Flavio Junqueira <fp...@yahoo.com>wrote:

> Thanks for the great discussion so far. I have a couple of comments below:
>
> On Jan 16, 2013, at 8:00 PM, Ivan Kelly <iv...@apache.org> wrote:
>
> > On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote:
> >>> Originally, it was meant to have a number of
> >>> long lived subscriptions, over which a lot of data travelled. Now the
> >>> load has flipped to a large number of short lived subscriptions, over
> >>> which relatively little data travels.
> >>
> >> The topic discussed here doesn't relate to hedwig subscriptions, it just
> >> about how hedwig use ledgers to store its messages.  Even there are no
> >> subscriptions, the problem is still there. The restart of a hub server
> >> carrying large number of topics would hit the metadata storage with many
> >> accesses. The hit is a hub server acquiring a topic, no matter the
> >> subscription is long lived or short lived. after topic is acquired,
> >> following accesses are in memory, which doesn't cause any
> >> performance issue.
> > I was using topics and subscriptions to mean the same thing here due
> > to the usecase we have in Yahoo where they're effectively the same
> > thing. But yes, I should have said topic. But my point still
> > stands. Hedwig was designed to deal with fewer topics, which had a lot
> > of data passing through them, rather than more topics, with very
> > little data passing though them. This is why zk was consider suffient
> > at that point, as tens of thousands of topics being recovered really
> > isn't an issue for zk. The point I was driving at is that, the usecase
> > has changed in a big way, so it may require a big change to handle it.
> >
>
> About the restart of a hub, if I understand the problem, many topics will
> be acquired concurrently, inducing the load spike you're mentioning. If
> this observation is correct, can we acquire those topics lazily instead of
> eagerly?
>
> >> But we should separate the capacity problem from the software problem. A
> >> high performance and scalable metadata storage would help for resolving
> >> capacity problem. but either implementing a new one or leveraging a high
> >> performance one doesn't change the fact that it still need so many
> metadata
> >> accesses to acquire topic. A bad implementation causing such many
> metadata
> >> accesses is a software problem. If we had chance to improve it, why
> >> not?
> > I don't think the implementation is bad, but rather the assumptions,
> > as I said earlier. The data:metadata ratio has completely changed
> > completely. hedwig/bk were designed with a data:metadata ratio of
> > something like 100000:1. What we're talking about now is more like 1:1
> > and therefore we need to be able to handle an order of magnitude more
> > of metadata than previously. Bringing down the number of writes by an
> > order of 2 or 3, while a nice optimisation, is just putting duct tape
> > on the problem.
>
>
> I love the duct tape analogy. The number of accesses to the metadata is
> still linear on the number of topics, and reducing the number of accesses
> per acquisition does not change the complexity. Perhaps distributing the
> accesses over time, optimized or not, might be a good way to proceed
> overall. Another approach would be to group topics into buckets and have
> hubs acquiring buckets of topics, all topics of a bucket writing to the
> same ledgers. There is an obvious drawback to this approach when reading
> messages for a given topic.
>
> >
> >>
> >>> The ledger can still be read many times, but you have removed the
> >> guarantee that what is read each time will be the same thing.
> >>
> >> How we guarantee a reader's behavior when a ledger is removed at the
> same
> >> time? We don't guarantee it right now, right? It is similar thing for a
> >> 'shrink' operation which remove part of entries, while 'delete'
> operation
> >> removes whole entries?
> >>
> >> And if I remembered correctly, readers only see the same thing when a
> >> ledger is closed. What I proposed doesn't volatile this contract.  If a
> >> ledger is closed (state is in CLOSED), an application can't re-open it.
> If
> >> a ledger isn't closed yet, an application can recover previous state and
> >> continue writing entries using this ledger. for applications, they could
> >> still use 'create-close-create' style to use ledgers, or evolve to new
> api
> >> for efficiency smoothly, w/o breaking any backward compatibility.
> > Ah, yes, I misread your proposal originally, I thought the reopen was
> > working with an already closed ledger.
> >
> > On a side note, the reason we have an initial write for fencing, is
> > that when the reading client(RC) fences, the servers in the ensemble
> > start returning an error to the writing client (WC). At the moment we
> > don't distinguish between a fencing error and a i/o error for
> > example. So WC will try to rebuild a the ensemble by replacing the
> > erroring servers. Before writing to the new ensemble, it has to update
> > the metadata, and at this point it will see that it has been
> > fenced. With a specific FENCED error, we could avoid this write. This
> > makes me uncomfortable though. What happens if the fenced server fails
> > between being fenced and WC trying to write? It will get a normal i/o
> > error. And will try to replace the server. Since the metadata has not
> > been changed, nothing will stop it, and it may be able to continue
> > writing. I think this is also the case for the session fencing solution.
> >
> > -Ivan
>
>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Flavio Junqueira <fp...@yahoo.com>.

Thanks for the great discussion so far. I have a couple of comments below:

On Jan 16, 2013, at 8:00 PM, Ivan Kelly <iv...@apache.org> wrote:

> On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote:
>>> Originally, it was meant to have a number of
>>> long lived subscriptions, over which a lot of data travelled. Now the
>>> load has flipped to a large number of short lived subscriptions, over
>>> which relatively little data travels.
>> 
>> The topic discussed here doesn't relate to hedwig subscriptions, it just
>> about how hedwig use ledgers to store its messages.  Even there are no
>> subscriptions, the problem is still there. The restart of a hub server
>> carrying large number of topics would hit the metadata storage with many
>> accesses. The hit is a hub server acquiring a topic, no matter the
>> subscription is long lived or short lived. after topic is acquired,
>> following accesses are in memory, which doesn't cause any
>> performance issue.
> I was using topics and subscriptions to mean the same thing here due
> to the usecase we have in Yahoo where they're effectively the same
> thing. But yes, I should have said topic. But my point still
> stands. Hedwig was designed to deal with fewer topics, which had a lot
> of data passing through them, rather than more topics, with very
> little data passing though them. This is why zk was consider suffient
> at that point, as tens of thousands of topics being recovered really
> isn't an issue for zk. The point I was driving at is that, the usecase
> has changed in a big way, so it may require a big change to handle it. 
> 

About the restart of a hub, if I understand the problem, many topics will be acquired concurrently, inducing the load spike you're mentioning. If this observation is correct, can we acquire those topics lazily instead of eagerly?

>> But we should separate the capacity problem from the software problem. A
>> high performance and scalable metadata storage would help for resolving
>> capacity problem. but either implementing a new one or leveraging a high
>> performance one doesn't change the fact that it still need so many metadata
>> accesses to acquire topic. A bad implementation causing such many metadata
>> accesses is a software problem. If we had chance to improve it, why
>> not?
> I don't think the implementation is bad, but rather the assumptions,
> as I said earlier. The data:metadata ratio has completely changed
> completely. hedwig/bk were designed with a data:metadata ratio of
> something like 100000:1. What we're talking about now is more like 1:1
> and therefore we need to be able to handle an order of magnitude more
> of metadata than previously. Bringing down the number of writes by an
> order of 2 or 3, while a nice optimisation, is just putting duct tape
> on the problem.


I love the duct tape analogy. The number of accesses to the metadata is still linear on the number of topics, and reducing the number of accesses per acquisition does not change the complexity. Perhaps distributing the accesses over time, optimized or not, might be a good way to proceed overall. Another approach would be to group topics into buckets and have hubs acquiring buckets of topics, all topics of a bucket writing to the same ledgers. There is an obvious drawback to this approach when reading messages for a given topic.   

> 
>> 
>>> The ledger can still be read many times, but you have removed the
>> guarantee that what is read each time will be the same thing.
>> 
>> How we guarantee a reader's behavior when a ledger is removed at the same
>> time? We don't guarantee it right now, right? It is similar thing for a
>> 'shrink' operation which remove part of entries, while 'delete' operation
>> removes whole entries?
>> 
>> And if I remembered correctly, readers only see the same thing when a
>> ledger is closed. What I proposed doesn't volatile this contract.  If a
>> ledger is closed (state is in CLOSED), an application can't re-open it. If
>> a ledger isn't closed yet, an application can recover previous state and
>> continue writing entries using this ledger. for applications, they could
>> still use 'create-close-create' style to use ledgers, or evolve to new api
>> for efficiency smoothly, w/o breaking any backward compatibility.
> Ah, yes, I misread your proposal originally, I thought the reopen was
> working with an already closed ledger.
> 
> On a side note, the reason we have an initial write for fencing, is
> that when the reading client(RC) fences, the servers in the ensemble
> start returning an error to the writing client (WC). At the moment we
> don't distinguish between a fencing error and a i/o error for
> example. So WC will try to rebuild a the ensemble by replacing the
> erroring servers. Before writing to the new ensemble, it has to update
> the metadata, and at this point it will see that it has been
> fenced. With a specific FENCED error, we could avoid this write. This
> makes me uncomfortable though. What happens if the fenced server fails
> between being fenced and WC trying to write? It will get a normal i/o
> error. And will try to replace the server. Since the metadata has not
> been changed, nothing will stop it, and it may be able to continue
> writing. I think this is also the case for the session fencing solution.
> 
> -Ivan

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Ivan Kelly <iv...@apache.org>.

On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote:
> > Originally, it was meant to have a number of
> > long lived subscriptions, over which a lot of data travelled. Now the
> > load has flipped to a large number of short lived subscriptions, over
> > which relatively little data travels.
> 
> The topic discussed here doesn't relate to hedwig subscriptions, it just
> about how hedwig use ledgers to store its messages.  Even there are no
> subscriptions, the problem is still there. The restart of a hub server
> carrying large number of topics would hit the metadata storage with many
> accesses. The hit is a hub server acquiring a topic, no matter the
> subscription is long lived or short lived. after topic is acquired,
> following accesses are in memory, which doesn't cause any
> performance issue.
I was using topics and subscriptions to mean the same thing here due
to the usecase we have in Yahoo where they're effectively the same
thing. But yes, I should have said topic. But my point still
stands. Hedwig was designed to deal with fewer topics, which had a lot
of data passing through them, rather than more topics, with very
little data passing though them. This is why zk was consider suffient
at that point, as tens of thousands of topics being recovered really
isn't an issue for zk. The point I was driving at is that, the usecase
has changed in a big way, so it may require a big change to handle it. 

> But we should separate the capacity problem from the software problem. A
> high performance and scalable metadata storage would help for resolving
> capacity problem. but either implementing a new one or leveraging a high
> performance one doesn't change the fact that it still need so many metadata
> accesses to acquire topic. A bad implementation causing such many metadata
> accesses is a software problem. If we had chance to improve it, why
> not?
I don't think the implementation is bad, but rather the assumptions,
as I said earlier. The data:metadata ratio has completely changed
completely. hedwig/bk were designed with a data:metadata ratio of
something like 100000:1. What we're talking about now is more like 1:1
and therefore we need to be able to handle an order of magnitude more
of metadata than previously. Bringing down the number of writes by an
order of 2 or 3, while a nice optimisation, is just putting duct tape
on the problem.

> 
> > The ledger can still be read many times, but you have removed the
> guarantee that what is read each time will be the same thing.
> 
> How we guarantee a reader's behavior when a ledger is removed at the same
> time? We don't guarantee it right now, right? It is similar thing for a
> 'shrink' operation which remove part of entries, while 'delete' operation
> removes whole entries?
> 
> And if I remembered correctly, readers only see the same thing when a
> ledger is closed. What I proposed doesn't volatile this contract.  If a
> ledger is closed (state is in CLOSED), an application can't re-open it. If
> a ledger isn't closed yet, an application can recover previous state and
> continue writing entries using this ledger. for applications, they could
> still use 'create-close-create' style to use ledgers, or evolve to new api
> for efficiency smoothly, w/o breaking any backward compatibility.
Ah, yes, I misread your proposal originally, I thought the reopen was
working with an already closed ledger.

On a side note, the reason we have an initial write for fencing, is
that when the reading client(RC) fences, the servers in the ensemble
start returning an error to the writing client (WC). At the moment we
don't distinguish between a fencing error and a i/o error for
example. So WC will try to rebuild a the ensemble by replacing the
erroring servers. Before writing to the new ensemble, it has to update
the metadata, and at this point it will see that it has been
fenced. With a specific FENCED error, we could avoid this write. This
makes me uncomfortable though. What happens if the fenced server fails
between being fenced and WC trying to write? It will get a normal i/o
error. And will try to replace the server. Since the metadata has not
been changed, nothing will stop it, and it may be able to continue
writing. I think this is also the case for the session fencing solution.

-Ivan

Discussion Request on Ledger ID Generation

Posted by Jiannan Wang <ji...@yahoo-inc.com>.

Hello all,
Currently, the ledger id generation is implemented with zookeeper (persist-/ephemeral-) sequential node to make a global unique id. In code detail,
- FlatLedgerManager requires a write on zookeeper
- HierarchicalLedgerManager and MSLedgerManagerFactory use same approach which includes a write and a delete operation to zookeeper.
Obviously, this ledger id generation process is too heavy, since what we want is only a global unique id. Also there has been a JIRA BOOKKEEPER-421<https://issues.apache.org/jira/browse/BOOKKEEPER-421> shows that current ledger id space is limited to 32 bits by the cversion (int type) in zookeeper node. So we need to enlarge the ledger id space to 64 bits.

Then there are two questions:
1. How to generate a 64 bits global unique id?
2. How to maintain the metadata for 64 bits ledger id in zookeeper? (Absolutely, current 2-4-4 split for ledger id is not suitable, see HierarchicalLedgerManager)

--------------I'm a split line for 64 bits ledger id generation-----------------------------

For 64 bits global unique id generation, Flavio, Ivan, Sijie and I have a discussion in mail, here are two proposals:
1. Let client generate the id itself (Ivan proposed): leverage zookeeper session id as a unique part and client maintains a counter in memory. so the id would be {session_id}{counter}.
2. Batch id generation (Jiannan proposed): use zookeeper znode as counter to track generated ids. During the implementation, client asked zookeeper for a counter range. after that, the id generation is proceeded locally w/o contacting zookeeper.

For proposal 1, the performance would be very great since it's local generation totally. But Sijie has one concern: "in reality, it seems that it doesn't work. zookeeper session id is long, while ledger id is long, you could not put session id as part of ledger id. otherwise, it would cause id conflict..".
And then Flavio and Ivan suggest perhaps we could simply use a procedure similar to the one used in ZooKeeper to generate and increment session ids in ZooKeeper. But Sijie figure out that this process in zookeeper includes a current system timestamp which may exhaust the 64 bits id space quickly. Also Flavio is thinking of reusing ledger identifiers, but he address that there are three scenarios if we reuse a ledger identifier:
1- The previous ledger still exists and its metadata is stored. In this case, we can detect it when trying to create the metadata for the new ledger;
2- The previous ledger has been fully deleted (metadata + ledger fragments);
3- Metadata for the previous ledger has been deleted, but the ledger fragments haven't.
Flavio: "Case 1 can be easily detected, while case 2 causes no problem at all. Case 3 is the problematic one, but I can't remember whether it can happen or not given the way we do garbage collection currently. I need to review how we do it, but in the case scenario 3 can happen, we could have the ledger writers using different master keys, which would cause the bookie to return an error when trying to write to a ledger that already exists."

For proposal 2, it still requires to access zookeeper but the write frequency could be quite small once we set a large batch size (like 10000).

In summary, proposal 1 aims to generate a UUID/GUID like id in 64 bits space, but the possibility of conflict should be taken into account and if the id generated is not monotone we should take care of the case 3 listed above. Proposal 2 has no problem on a quick monotone id generation, but the process involves zookeeper.
By the way, I've submitted a patch in BOOKKEEPER-438<https://issues.apache.org/jira/browse/BOOKKEEPER-438> to move ledger id generation out of LedgerManager, and I'll add a conf setting in another JIRA to give bookkeeper client a chance to customize his own id generation idea. I'll appreciate if anyone can help to review on the patch (thanks Sijie first).

--------------I'm a split line for 64 bits ledger id metadata management-----------------------------

HierarchicalLedgerManager use 2-4-4 style to split current 10 chars ledger id, E.g Ledger 0000000001 is splited into 3 parts 00,0000,0001 and stored in zookeeper path "(ledgersRootPath)/00/0000/L0001". So each znode could have at most 10000 ledgers, which avoids errors during garbage collection due to lists of children that are too long.
After we enlarge the ledger id space to 64 bits, it's a big problem to manage for large ledger id.

My idea is split the ledger id under the radix 2^13=8192 and then construct it in a radix tree. For example, ledger id 2, 5, and 41093(==5X8192+133) then the tree in zookeeper would be:
(ledger id root)
/ \
2 (meta) 5 (meta)
\
133 (meta)
So there will be at most 8192 children under each znode and the depth is (64/13=5) at most.
Note that the inner znode will also record metadata, so if ledger id generation is increasing step by step, then the depth of this radix tree only grows as needed. And I guess it can handle all 2^64 ledger ids ideally.

Since speaking of metadata, I would like to share a test result we make these two days. For HierarchicalLedgerManager , we observe that a ledger metadata consumes 700+ bytes in zookeeper, this may possible because of LedgerMetadata.serialize() uses a pure text format. But the data size is only 300+ bytes in ledger id node, and I guess the extra space is occupied by the overhead of inner hierarchical node. What's more, the memory a topic consume is 2k with only 1 subscriber and no pub: there is no metadata for topic ownership (since we now use consistent hash for topic ownership), and the metadata size for subscription and persistence are both 8 bytes. I'll investigate more and then issue a new topic on it.

Best,
Jiannan

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Sijie Guo <gu...@gmail.com>.

> Originally, it was meant to have a number of
> long lived subscriptions, over which a lot of data travelled. Now the
> load has flipped to a large number of short lived subscriptions, over
> which relatively little data travels.

The topic discussed here doesn't relate to hedwig subscriptions, it just
about how hedwig use ledgers to store its messages.  Even there are no
subscriptions, the problem is still there. The restart of a hub server
carrying large number of topics would hit the metadata storage with many
accesses. The hit is a hub server acquiring a topic, no matter the
subscription is long lived or short lived. after topic is acquired,
following accesses are in memory, which doesn't cause any performance issue.

> We can embed some other or
write our own. Anything we would do would need to be persistent,
support CAS, and replicated.

First of all, I am interested in the idea having a built-in metadata
storage for both Hedwig and BookKeeper, although I think it is too complex
to implement a distributed, robust and scalable metadata storage from
scratch.

But we should separate the capacity problem from the software problem. A
high performance and scalable metadata storage would help for resolving
capacity problem. but either implementing a new one or leveraging a high
performance one doesn't change the fact that it still need so many metadata
accesses to acquire topic. A bad implementation causing such many metadata
accesses is a software problem. If we had chance to improve it, why not?

> The ledger can still be read many times, but you have removed the
guarantee that what is read each time will be the same thing.

How we guarantee a reader's behavior when a ledger is removed at the same
time? We don't guarantee it right now, right? It is similar thing for a
'shrink' operation which remove part of entries, while 'delete' operation
removes whole entries?

And if I remembered correctly, readers only see the same thing when a
ledger is closed. What I proposed doesn't volatile this contract.  If a
ledger is closed (state is in CLOSED), an application can't re-open it. If
a ledger isn't closed yet, an application can recover previous state and
continue writing entries using this ledger. for applications, they could
still use 'create-close-create' style to use ledgers, or evolve to new api
for efficiency smoothly, w/o breaking any backward compatibility.

And one more point, using a user defined name or using a generated ledger
id is not a big problem for bookkeeper system. As BOOKKEEPER-438 (
https://issues.apache.org/jira/browse/BOOKKEEPER-438) planned to move
ledger id generation out of LedgerManager, LedgerManager would just focus
on how to store ledger metadata by a ledger key (the key could be a user
defined string/path, or a generated long ledger id). From the perspective
of keeping functionality as a minmum system, a BookKeeper client with
ledger id generation could base on a BookKeeper client providing custom
ledger name, isn't it? :-)

-Sijie

On Tue, Jan 15, 2013 at 3:42 AM, Ivan Kelly <iv...@apache.org> wrote:

>
> > Let me clarify the 3 accesses for a close operation.
> >
> > 1) first read the ledger metadata. (1 metadata read)
> > 2) if the ledger state is in open state, write ledger metadata to
> > IN_RECOVERY state. (1 metadata write)
> > 3) do ledger recovery procedure
> > 4) update last entry id and change state to CLOSED. (1 metadata write)
> >
> > I am not very sure about whether we could get ride of step 2). But at
> > least, we still have 1 read and 1 write for a close operation. (NOTE:
> close
> > here is not create a ledger and close it. it is the *close* when open to
> > recover)
> You cannot get rid of 2) without sacrificing correctness.
>
> > The big problem for current bookkeeper is applications (either Hedwig or
> > managed ledger) needs to find extra places to record the ledger ids
> > generated by bookkeeper. It is not efficient and also duplicates the
> > metadata storage, especially for a system facing large number of topics
> or
> > ledgers, it is Achilles' Heel.
> I think a better way to handle this would be just to scale the
> metadata storage along with the system. Part of the problem here is
> that hedwig is being used in a way which is quite different to what it
> was first designed for. Originally, it was meant to have a number of
> long lived subscriptions, over which a lot of data travelled. Now the
> load has flipped to a large number of short lived subscriptions, over
> which relatively little data travels.
>
> This means that the disk capacity of the bookies isn't fully used. So
> how about using that for metadata also? We can embed some other or
> write our own. Anything we would do would need to be persistent,
> support CAS, and replicated. It would be a fairly hefty project, but
> it would give us horizontal scalability and reduce the number of
> moving parts required to provide this scalability.
>
> For writing our own, we could have a metadata store that sits inside
> the bookie, sharing the journal and snapshotting every so often, so it
> should barely affect performance.
>
> > I took care of re-using existing concepts with minmum changes to extend
> the
> > api to provide more flexibility and efficiency for applications. I don't
> > think it violates the principle to make the system a bare minmum. For
> > example, I don't add cursor concept in ledger. The ledger is still could
> be
> > read many times. How to shrink a ledger depends on applications. Hedwig
> or
> > managed ledger would record their cursors (subscriber states) and shrink
> a
> > ledger when they decided to do that.
> The ledger can still be read many times, but you have removed the
> guarantee that what is read each time will be the same thing.
>
> -Ivan
>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Ivan Kelly <iv...@apache.org>.

> Let me clarify the 3 accesses for a close operation.
> 
> 1) first read the ledger metadata. (1 metadata read)
> 2) if the ledger state is in open state, write ledger metadata to
> IN_RECOVERY state. (1 metadata write)
> 3) do ledger recovery procedure
> 4) update last entry id and change state to CLOSED. (1 metadata write)
> 
> I am not very sure about whether we could get ride of step 2). But at
> least, we still have 1 read and 1 write for a close operation. (NOTE: close
> here is not create a ledger and close it. it is the *close* when open to
> recover)
You cannot get rid of 2) without sacrificing correctness.

> The big problem for current bookkeeper is applications (either Hedwig or
> managed ledger) needs to find extra places to record the ledger ids
> generated by bookkeeper. It is not efficient and also duplicates the
> metadata storage, especially for a system facing large number of topics or
> ledgers, it is Achilles' Heel.
I think a better way to handle this would be just to scale the
metadata storage along with the system. Part of the problem here is
that hedwig is being used in a way which is quite different to what it
was first designed for. Originally, it was meant to have a number of
long lived subscriptions, over which a lot of data travelled. Now the
load has flipped to a large number of short lived subscriptions, over
which relatively little data travels. 

This means that the disk capacity of the bookies isn't fully used. So
how about using that for metadata also? We can embed some other or
write our own. Anything we would do would need to be persistent,
support CAS, and replicated. It would be a fairly hefty project, but
it would give us horizontal scalability and reduce the number of
moving parts required to provide this scalability. 

For writing our own, we could have a metadata store that sits inside
the bookie, sharing the journal and snapshotting every so often, so it
should barely affect performance.

> I took care of re-using existing concepts with minmum changes to extend the
> api to provide more flexibility and efficiency for applications. I don't
> think it violates the principle to make the system a bare minmum. For
> example, I don't add cursor concept in ledger. The ledger is still could be
> read many times. How to shrink a ledger depends on applications. Hedwig or
> managed ledger would record their cursors (subscriber states) and shrink a
> ledger when they decided to do that.
The ledger can still be read many times, but you have removed the
guarantee that what is read each time will be the same thing.

-Ivan

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Sijie Guo <gu...@gmail.com>.

>  I need to review it because I'm not sure why we currently need 3
accesses to the metadata store.

Let me clarify the 3 accesses for a close operation.

1) first read the ledger metadata. (1 metadata read)
2) if the ledger state is in open state, write ledger metadata to
IN_RECOVERY state. (1 metadata write)
3) do ledger recovery procedure
4) update last entry id and change state to CLOSED. (1 metadata write)

I am not very sure about whether we could get ride of step 2). But at
least, we still have 1 read and 1 write for a close operation. (NOTE: close
here is not create a ledger and close it. it is the *close* when open to
recover)

> According to your analysis, the size of the spike is linear on the number
of topics the faulty hub owns. Your proposal just changes the constant, but
it is still linear, yes?

from mathematical, it is still linear. but in reality, it would be much
much better. And the number of topics is the factor we could not change for
a application that requires so much topics.

>  In fact, isn't it the case that you can do at least some of what you're
proposing with managed ledgers?

Actually, this is not the whole thing of managed ledgers. managed ledgers
is the thing that BookKeeper persistence manager with Subscriptions manager
(cursor) in Hedwig. managed ledgers still have the same problems I raised
in the proposal.

The big problem for current bookkeeper is applications (either Hedwig or
managed ledger) needs to find extra places to record the ledger ids
generated by bookkeeper. It is not efficient and also duplicates the
metadata storage, especially for a system facing large number of topics or
ledgers, it is Achilles' Heel.

> It adds a number of new concurrent scenarios, increasing the complexity
of what we expose, and it violates one principle that we used when first
designing this system

Could you point out the concurrent scenarios and complexity I added in the
proposal? I am not very sure about that.

I took care of re-using existing concepts with minmum changes to extend the
api to provide more flexibility and efficiency for applications. I don't
think it violates the principle to make the system a bare minmum. For
example, I don't add cursor concept in ledger. The ledger is still could be
read many times. How to shrink a ledger depends on applications. Hedwig or
managed ledger would record their cursors (subscriber states) and shrink a
ledger when they decided to do that.


On Mon, Jan 14, 2013 at 1:06 AM, Flavio Junqueira <fp...@yahoo.com>wrote:

> Thanks for your proposal, Sijie. The cost of owning a topic seems indeed
> high, and your motivation to raise that point seems to be that upon hub
> crashes there will be a spike of accesses to the metadata store. According
> to your analysis, the size of the spike is linear on the number of topics
> the faulty hub owns. Your proposal just changes the constant, but it is
> still linear, yes?
>
> Although I think I understand the motivation for your proposal, I'm not
> really in favor of extending the BK API like this. It adds a number of new
> concurrent scenarios, increasing the complexity of what we expose, and it
> violates one principle that we used when first designing this system, which
> is keeping functionality to a bare minimum so that we can implement it
> efficiently. Other functionality needed can be implemented on top. In fact,
> isn't it the case that you can do at least some of what you're proposing
> with managed ledgers?
>
> One point you raised that concerns me a bit is the cost of a close
> operation. I need to review it because I'm not sure why we currently need 3
> accesses to the metadata store. It should really be just one in the regular
> case (no concurrent attempts to close a ledger through ledger recovery). I
> agree though that we need to think about how to deal efficiently with the
> hedwig issue you're raising.
>
> -Flavio
>
> On Jan 14, 2013, at 4:53 AM, Sijie Guo <gu...@gmail.com> wrote:
>
> > Hello all,
> >
> > Currently Hedwig used *ledgers* to store messages for a topic. It
> requires
> > lots of metadata operations when a hub server owned a topic. These
> metadata
> > operations are:
> >
> >   1. read topic persistence info for a topic. (1 metadata read operation)
> >   2. close the last opened ledger. (1 metadata read operation, 2 metadata
> >   write operations)
> >   3. create a new ledger to write. (1 metadata write operation)
> >   4. update topic persistence info fot the topic to track the new ledger.
> >   (1 metadata write operation)
> >
> > so there are at least 2 metadata read operations and 4 metadata write
> > operations when acquiring a topic. if a hub server owned lots of topics
> > restarts, it would introduce a spike of metadata accesses to the metadata
> > storage (e.g. ZooKeeper).
> >
> > Currently hedwig's design is originated from ledger's *"write once, read
> > many"* semantic.
> >
> >   1. Ledger id is generated by bookkeeper. Hedwig needs to record ledger
> >   id in extra places, which introduce extra metadata accesses.
> >   2. A ledger could not wrote any more entries after it was closed => so
> >   hedwig has to create a new ledger to write new entries after the
> ownership
> >   of a topic is changed (e.g. hub server failure, topic release).
> >   3. A ledger's entries could not be *deleted* only after a ledger is
> >   deleted => so hedwig has to change ledgers, which let entries could be
> >   consumed by *deleting* ledger after all subscribers consumed.
> >
> > I proposed two new apis accompanied with "re-open, append" semantic in
> > BookKeeper, for high performance metadata access and easy metadata
> > management for applications.
> >
> > public void openLedger(String ledgerName, DigestType digestType,
> > byte[] passwd, Mode mode);
> >
> > *Mode* indicates the access mode of a ledger, which would be *O_CREATE*,
> *
> > O_APPEND*, *O_RDONLY*.
> >
> >   - O_CREATE: create a new ledger with the given ledger name. if there is
> >   a ledger existed already, fail the creation. similar as createLedger
> now.
> >   - O_APPEND: open a new ledger with the given ledger name and continue
> >   write entries.
> >   - O_RDONLY: open a new ledger w/o changing any state just reading
> >   entries already persisted. similar as openLedgerNoRecovery now.
> >
> > *ledgerName* indicates the name of a ledger. user could pick up either
> name
> > he likes, so he could manage his ledgers in his way like introducing
> > namespace over it, instead of bookkeeper generatating ledger id for them.
> > (in most of cases, application needs to find another place to store the
> > generated ledger id. the practise is really bad)
> >
> > public void shrink(long endEntryId, boolean force) throws BKException;
> >
> > *Shrink* means cutting the entries starting from *startEntryId* to *
> > endEntryId* (endEntryId is non-inclusive). *startEntryId*is implicit in
> > ledger metadata, which is 0 for a non-shrinked ledger, while it is *
> > endEntryId* from previous valid shrink.
> >
> > 'Force' flag indicate whether to issue garbage collection request after
> we
> > just move the *startEntryId* to *endEntryId*. If the flag is true, we
> issue
> > garbage collection request to notify bookie server to do garbage
> > collection; otherwise, we just move *startEntryId* to *endEntryId*. This
> > feature might be useful for some applications. Take Hedwig for example,
> we
> > could leverage this feature not to store the subscriber state for those
> > topics which have only one subscriber for each. Each time after specific
> > number of messages consumed, we move the entry point by*shrink(entryId,
> > false)*. After several messages consumed, we garbage collected them by
> > *shrink(entryId,
> > true)*.
> >
> > Using *shrink*, application could relaim the disk space occupied by a
> > ledger w/o creating new ledger and deleting old one.
> >
> > These two operations are based on two mechanisms: one is 'session
> fencing',
> > and the other one is 'improved garbage collection (BOOKKEEPER-464)'.
> > Details are in the gist https://gist.github.com/4520260 . I would try to
> > start working on some drafts based on the idea to demonstrate its
> > correctness.
> >
> > Welcome for comments and discussions.
> > -Sijie
>
>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Posted by Flavio Junqueira <fp...@yahoo.com>.

Thanks for your proposal, Sijie. The cost of owning a topic seems indeed high, and your motivation to raise that point seems to be that upon hub crashes there will be a spike of accesses to the metadata store. According to your analysis, the size of the spike is linear on the number of topics the faulty hub owns. Your proposal just changes the constant, but it is still linear, yes?

Although I think I understand the motivation for your proposal, I'm not really in favor of extending the BK API like this. It adds a number of new concurrent scenarios, increasing the complexity of what we expose, and it violates one principle that we used when first designing this system, which is keeping functionality to a bare minimum so that we can implement it efficiently. Other functionality needed can be implemented on top. In fact, isn't it the case that you can do at least some of what you're proposing with managed ledgers?

One point you raised that concerns me a bit is the cost of a close operation. I need to review it because I'm not sure why we currently need 3 accesses to the metadata store. It should really be just one in the regular case (no concurrent attempts to close a ledger through ledger recovery). I agree though that we need to think about how to deal efficiently with the hedwig issue you're raising.

-Flavio

On Jan 14, 2013, at 4:53 AM, Sijie Guo <gu...@gmail.com> wrote:

> Hello all,
> 
> Currently Hedwig used *ledgers* to store messages for a topic. It requires
> lots of metadata operations when a hub server owned a topic. These metadata
> operations are:
> 
>   1. read topic persistence info for a topic. (1 metadata read operation)
>   2. close the last opened ledger. (1 metadata read operation, 2 metadata
>   write operations)
>   3. create a new ledger to write. (1 metadata write operation)
>   4. update topic persistence info fot the topic to track the new ledger.
>   (1 metadata write operation)
> 
> so there are at least 2 metadata read operations and 4 metadata write
> operations when acquiring a topic. if a hub server owned lots of topics
> restarts, it would introduce a spike of metadata accesses to the metadata
> storage (e.g. ZooKeeper).
> 
> Currently hedwig's design is originated from ledger's *"write once, read
> many"* semantic.
> 
>   1. Ledger id is generated by bookkeeper. Hedwig needs to record ledger
>   id in extra places, which introduce extra metadata accesses.
>   2. A ledger could not wrote any more entries after it was closed => so
>   hedwig has to create a new ledger to write new entries after the ownership
>   of a topic is changed (e.g. hub server failure, topic release).
>   3. A ledger's entries could not be *deleted* only after a ledger is
>   deleted => so hedwig has to change ledgers, which let entries could be
>   consumed by *deleting* ledger after all subscribers consumed.
> 
> I proposed two new apis accompanied with "re-open, append" semantic in
> BookKeeper, for high performance metadata access and easy metadata
> management for applications.
> 
> public void openLedger(String ledgerName, DigestType digestType,
> byte[] passwd, Mode mode);
> 
> *Mode* indicates the access mode of a ledger, which would be *O_CREATE*, *
> O_APPEND*, *O_RDONLY*.
> 
>   - O_CREATE: create a new ledger with the given ledger name. if there is
>   a ledger existed already, fail the creation. similar as createLedger now.
>   - O_APPEND: open a new ledger with the given ledger name and continue
>   write entries.
>   - O_RDONLY: open a new ledger w/o changing any state just reading
>   entries already persisted. similar as openLedgerNoRecovery now.
> 
> *ledgerName* indicates the name of a ledger. user could pick up either name
> he likes, so he could manage his ledgers in his way like introducing
> namespace over it, instead of bookkeeper generatating ledger id for them.
> (in most of cases, application needs to find another place to store the
> generated ledger id. the practise is really bad)
> 
> public void shrink(long endEntryId, boolean force) throws BKException;
> 
> *Shrink* means cutting the entries starting from *startEntryId* to *
> endEntryId* (endEntryId is non-inclusive). *startEntryId*is implicit in
> ledger metadata, which is 0 for a non-shrinked ledger, while it is *
> endEntryId* from previous valid shrink.
> 
> 'Force' flag indicate whether to issue garbage collection request after we
> just move the *startEntryId* to *endEntryId*. If the flag is true, we issue
> garbage collection request to notify bookie server to do garbage
> collection; otherwise, we just move *startEntryId* to *endEntryId*. This
> feature might be useful for some applications. Take Hedwig for example, we
> could leverage this feature not to store the subscriber state for those
> topics which have only one subscriber for each. Each time after specific
> number of messages consumed, we move the entry point by*shrink(entryId,
> false)*. After several messages consumed, we garbage collected them by
> *shrink(entryId,
> true)*.
> 
> Using *shrink*, application could relaim the disk space occupied by a
> ledger w/o creating new ledger and deleting old one.
> 
> These two operations are based on two mechanisms: one is 'session fencing',
> and the other one is 'improved garbage collection (BOOKKEEPER-464)'.
> Details are in the gist https://gist.github.com/4520260 . I would try to
> start working on some drafts based on the idea to demonstrate its
> correctness.
> 
> Welcome for comments and discussions.
> -Sijie