You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@bookkeeper.apache.org by Istvan Soos <is...@gmail.com> on 2017/11/07 18:47:28 UTC

log compaction of entries

On the website [0] I gather that data compaction is mostly about
cleaning up after we delete a ledger. Is there a feature or plan to
implement entry-level compaction, e.g. to have an ID that uniquely
identifies an entity, and if there are two events for that entity,
only retain the last one?

[0]: https://bookkeeper.apache.org/docs/latest/getting-started/concepts/

Or do you implement it by using different ledgers, migrating from one
to another? How does it work out with handovers of what is considered
the main ledger to write to or read from?

Thanks,
  Istvan

Re: log compaction of entries

Posted by Enrico Olivelli <eo...@gmail.com>.

Il mer 8 nov 2017, 16:27 Ivan Kelly <iv...@apache.org> ha scritto:

> > I believe I'm missing the implication of this. Does that mean we need
> > to logically name ledgers in a way that can keep track, because each
> > has only one uninterrupted session of write operations, otherwise it
> > is read only?
> It's not possible to specify a name on ledger creation. When you
> create a ledger that ledger is assigned an ID. If you want to keep
> track of different ledgers, you need to store a mapping from a logical
> name to ledger is somewhere.
>

Nowadays you can specify manually a ledger id, but it still has to be a
positive long.
Enrico

>
> -Ivan
>
-- 


-- Enrico Olivelli

Re: log compaction of entries

Posted by Ivan Kelly <iv...@apache.org>.

> I believe I'm missing the implication of this. Does that mean we need
> to logically name ledgers in a way that can keep track, because each
> has only one uninterrupted session of write operations, otherwise it
> is read only?
It's not possible to specify a name on ledger creation. When you
create a ledger that ledger is assigned an ID. If you want to keep
track of different ledgers, you need to store a mapping from a logical
name to ledger is somewhere.

-Ivan

Re: log compaction of entries

Posted by Istvan Soos <is...@gmail.com>.

On Tue, Nov 7, 2017 at 10:42 PM, Sijie Guo <gu...@gmail.com> wrote:
> yeah, if you are looking for this feature, you probably should checkout
> pulsar (which is bookkeeper based pub/sub):
> https://pulsar.incubator.apache.org/
>
> the topic compaction feature might come in next release or so.

Ok, good to know!

>> My use case is really simple: a website is crawled in regular periods,
>> and it is easy to create a content identifier out of it. I would store
>> the different versions of it in a ledger for downstream processing,
>> but there is really no need to preserve all of the versions of the
>> past if the content identifier is the same.
>
>
> it seems a messaging pub/sub system like pulsar is good for you use case.
>
> just fyi, a bookkeeper ledger is single writer semantic. once the ledger is
> closed or the writer fails, you can not reopen the ledger to write. Is this
> an expected behavior for you?

I believe I'm missing the implication of this. Does that mean we need
to logically name ledgers in a way that can keep track, because each
has only one uninterrupted session of write operations, otherwise it
is read only?

Thanks,
  Istvan

Re: log compaction of entries

Posted by Sijie Guo <gu...@gmail.com>.

On Tue, Nov 7, 2017 at 12:18 PM, Istvan Soos <is...@gmail.com> wrote:

> On Tue, Nov 7, 2017 at 8:09 PM, Sijie Guo <gu...@gmail.com> wrote:
> > But I would to learn more about your use case and to see how we can
> support
> > you.
>
> It is a nice feature in Kafka, and I've seen a complex app using it:
> https://kafka.apache.org/documentation.html#compaction

yeah, if you are looking for this feature, you probably should checkout
pulsar (which is bookkeeper based pub/sub):
https://pulsar.incubator.apache.org/

the topic compaction feature might come in next release or so.

>
>
> My use case is really simple: a website is crawled in regular periods,
> and it is easy to create a content identifier out of it. I would store
> the different versions of it in a ledger for downstream processing,
> but there is really no need to preserve all of the versions of the
> past if the content identifier is the same.
>

it seems a messaging pub/sub system like pulsar is good for you use case.

just fyi, a bookkeeper ledger is single writer semantic. once the ledger is
closed or the writer fails, you can not reopen the ledger to write. Is this
an expected behavior for you?

>
> I know this specific case can be handled in many different ways, but
> if the ledger could do that on its own, it could simplify the overall
> architecture (free GC).
>

yeah, I totally see the value for it.

>
> Istvan
>

Re: log compaction of entries

Posted by Istvan Soos <is...@gmail.com>.

On Tue, Nov 7, 2017 at 8:09 PM, Sijie Guo <gu...@gmail.com> wrote:
> But I would to learn more about your use case and to see how we can support
> you.

It is a nice feature in Kafka, and I've seen a complex app using it:
https://kafka.apache.org/documentation.html#compaction

My use case is really simple: a website is crawled in regular periods,
and it is easy to create a content identifier out of it. I would store
the different versions of it in a ledger for downstream processing,
but there is really no need to preserve all of the versions of the
past if the content identifier is the same.

I know this specific case can be handled in many different ways, but
if the ledger could do that on its own, it could simplify the overall
architecture (free GC).

Istvan

Re: log compaction of entries

Posted by Sijie Guo <gu...@gmail.com>.

On Tue, Nov 7, 2017 at 10:47 AM, Istvan Soos <is...@gmail.com> wrote:

> On the website [0] I gather that data compaction is mostly about
> cleaning up after we delete a ledger. Is there a feature or plan to
> implement entry-level compaction, e.g. to have an ID that uniquely
> identifies an entity, and if there are two events for that entity,
> only retain the last one?
>
> [0]: https://bookkeeper.apache.org/docs/latest/getting-started/concepts/

Currently we don't have an open item about supporting this "log compaction"
feature.
But I would to learn more about your use case and to see how we can support
you.

>
>
> Or do you implement it by using different ledgers, migrating from one
> to another?

In pulsar community, we are actually discussing a similar "log compaction"
feature.
Pulsar is the pub/sub messaging system built on Apache BookKeeper.
The idea is almost same as what you said, it would compact the
messages/entries based on
some keys, and write the compacted messages as a separate ledger.

> How does it work out with handovers of what is considered
> the main ledger to write to or read from?
>

You need some sort of metadata to track the list of ledgers and update the
metadata once a compacted ledger is generated.

Hope this explain your questions. Would love to chat more about your user
case.

>
> Thanks,
>   Istvan
>